Engee documentation
Notebook

Checking autograms in Julia language

This example demonstrates the implementation of an algorithm for verifying the correctness of so—called autograms - sentences in natural language that contain an accurate description of their own structure, that is, describing the number of occurrences of certain letters, punctuation marks, and other symbols.

Introduction

What is an autogram?

An autogram is a special type of autological (self—descriptive) sentence that describes its own structure. Namely, such a sentence indicates how many times each letter of the alphabet, punctuation marks and other symbols occur in it. An autogram is correct if the counter of each mentioned item matches its actual quantity in the same sentence.

For example, an autogram might say, "There are six letters 'e' in this sentence," and if it does contain six letters 'e', then that's part of the correct description.

The purpose of the algorithm

The purpose of the implementation is to check whether the autogram describes itself correctly, that is:

  1. For each mentioned letter or symbol: the specified number is the same as the actual number.
  2. All the symbols mentioned in the offer are really taken into account.
  3. Whether punctuation marks are taken into account, if required.

This algorithm can be used in games, linguistic tasks, text generation, and even as an interesting way of self-reflection in artificial intelligence.

The main part

Connecting the necessary packages

In [ ]:
# Installing the necessary DataStructures package if it is not installed
import Pkg; Pkg.add("DataStructures")

# Using the DataStructures package for convenient character frequency calculation
using DataStructures
   Resolving package versions...
   Installed FiniteDiff ─ v2.27.0
    Updating `~/.project/Project.toml`
 [864edb3b] + DataStructures v0.18.22
  No Changes to `~/.project/Manifest.toml`
WARNING: using DataStructures.reset! in module Main conflicts with an existing identifier.

Defining a dictionary of numbers in text form

In [ ]:
# A dictionary with a textual representation of numbers from one to ninety
const textnumbers = Dict(
    "single" => 1, "one" => 1, "two" => 2, "three" => 3, "four" => 4,
    "five" => 5, "six" => 6, "seven" => 7, "eight" => 8, "nine" => 9,
    "ten" => 10, "eleven" => 11, "twelve" => 12, "thirteen" => 13,
    "fourteen" => 14, "fifteen" => 15, "sixteen" => 16,
    "seventeen" => 17, "eighteen" => 18, "nineteen" => 19,
    "twenty" => 20, "thirty" => 30, "forty" => 40, "fifty" => 50,
    "sixty" => 60, "seventy" => 70, "eighty" => 80, "ninety" => 90
)
Out[0]:
Dict{String, Int64} with 28 entries:
  "fourteen" => 14
  "seventy"  => 70
  "twelve"   => 12
  "eight"    => 8
  "twenty"   => 20
  "one"      => 1
  "sixteen"  => 16
  "eighteen" => 18
  "seven"    => 7
  "thirty"   => 30
  "six"      => 6
  "five"     => 5
  "forty"    => 40
  "ten"      => 10
  "sixty"    => 60
  "nineteen" => 19
  "fifty"    => 50
  "thirteen" => 13
  "ninety"   => 90
  "three"    => 3
  "single"   => 1
  "fifteen"  => 15
  "eleven"   => 11
  "two"      => 2
  "four"     => 4
  ⋮          => ⋮

This dictionary associates string representations of numbers (example: "five", "twenty-five") with their integer values. It is used to convert numbers described by words into specific numeric values.

Text-to-number conversion function

In [ ]:
"""
    phrasetointeger(txt)

Converting text numbers to a numeric value.
Пример: "twenty five" => 25
"""
function phrasetointeger(txt)
    words = split(txt, r"\W+")      # Splitting a string into separate words, excluding punctuation marks
    n = 0                           # Accumulator of numbers
    for w in words                  # Passage through all words
        n += get(textnumbers, w, 0) # If the word is in the dictionary, add its meaning.
        w == "hundred" && (n *= 100) # If the word is a hundred, we multiply it by 100
    end
    return n                        # Returning the total numeric value
end
Out[0]:
phrasetointeger

This function takes a string of text, extracts numeric words from it, and makes up an integer from 1 to 999. It is important that the function supports composite numbers if they are separated by a space (for example, "eleven" or "twenty five").

The main function of checking the autogram

In [ ]:
"""
    isautogram(txt, countpunctuation; verbose = true)

The function checks whether the txt string is a valid autogram.
Arguments:
- txt: the text that needs to be checked
- countpunctuation: whether to count punctuation marks (true/false)
- verbose: whether to display error messages
"""
function isautogram(txt, countpunctuation; verbose = true)
    # We reduce the entire text to lowercase letters
    s = lowercase(txt)
    
    # We count the number of occurrences of each character
    charcounts = counter(s)

    # A dictionary of characters that need to be checked for references
    stillneedmention = Dict(
        p[1] => isletter(p[1]) || p[1] != ' ' && countpunctuation ? p[2] : 0
        for p in charcounts
    )

    # A little formatting of the string before parsing
    s = " " * replace(s, r"^\.(?:employs|composed|contains)" => "")

    # Splitting the string by commas and colons into tokens describing the characters
    for mention in split(s, r"\s*,|:\s*")
        mention = replace(mention, r" and$" => "")  # Delete the word "and" at the end of the line
        spos = findlast(isspace, mention)           # We find the last space — then there will be the word of the symbol.

        if spos === nothing continue end            # If you haven't found it, skip it.

        # Extracting the numeric part of the text (up to the last space)
        numfromtext = phrasetointeger(mention[begin:spos-1])
        numfromtext == 0 && continue  # If the number is 0, we continue

        # Extracting the part of the string indicating the character
        c = mention[begin+spos:end]

        if c == "letters"
            # Checking the total number of letters
            if numfromtext != count(isletter, txt)
                verbose && println("The total letter count (should be $(count(isletter, txt))) is incorrect.")
                return false
            end
            continue
        end

        # We define the symbol according to the description
        ch = contains(c, "comma") ? ',' : 
             contains(c, "apostrophe") ? '\'' : 
             contains(c, "hyphen") ? '-' : Char(c[1])

        # Checking the number of the mentioned character
        if charcounts[ch] == numfromtext
            stillneedmention[ch] = 0  # Mark it as verified
        else
            verbose && println("The count of $ch in the phrase is incorrect.")
            return false
        end
    end

    # We check if there are any characters left whose mention has been forgotten.
    for p in stillneedmention
        if p[2] > 0
            verbose && println("The letter and count $p was not mentioned in the counts in the phrase.")
            return false
        end
    end

    return true  # If everything is checked and matches, the autogram is correct.
end
Out[0]:
isautogram

This basic function works in the following steps::

  1. Normalizes the text (all letters are lowercase).
  2. Counts the frequency of all characters.
  3. Bypasses all mentions of characters in the text and checks the numeric values.
  4. Checks whether all the characters that actually occur are taken into account in the text.
  5. Returns true if everything is correct, otherwise — false.

If verbose = true, explanatory error messages are displayed.

Testing autograms

In [ ]:
# A set of test lines — autograms and non-autograms
for (i, t) in enumerate([
    ("This sentence employs two a's, two c's, two d's, twenty-eight e's, five f's, three g's, eight h's, eleven i's, three l's, two m's, thirteen n's, nine o's, two p's, five r's, twenty-five s's, twenty-three t's, six v's, ten w's, two x's, five y's, and one z.", false),
    ("This sentence employs two a's, two c's, two d's, twenty eight e's, five f's, three g's, eight h's, eleven i's, three l's, two m's, thirteen n's, nine o's, two p's, five r's, twenty five s's, twenty three t's, six v's, ten w's, two x's, five y's, and one z.", false),
    ("This pangram contains four as, one b, two cs, one d, thirty es, six fs, five gs, seven hs, eleven is, one j, one k, two ls, two ms, eighteen ns, fifteen os, two ps, one q, five rs, twenty-seven ss, eighteen ts, two us, seven vs, eight ws, two xs, three ys, & one z.", false),
    ("This sentence contains one hundred and ninety-seven letters: four a's, one b, three c's, five d's, thirty-four e's, seven f's, one g, six h's, twelve i's, three l's, twenty-six n's, ten o's, ten r's, twenty-nine s's, nineteen t's, six u's, seven v's, four w's, four x's, five y's, and one z.", false),
    ("Thirteen e's, five f's, two g's, five h's, eight i's, two l's, three n's, six o's, six r's, twenty s's, twelve t's, three u's, four v's, six w's, four x's, two y's.", false),
    ("Fifteen e's, seven f's, four g's, six h's, eight i's, four n's, five o's, six r's, eighteen s's, eight t's, four u's, three v's, two w's, three x's.", true),
    ("Sixteen e's, five f's, three g's, six h's, nine i's, five n's, four o's, six r's, eighteen s's, eight t's, three u's, three v's, two w's, four z's.", false),
   ])
    
    # We check each line
    println("Test phrase $i is", isautogram(t[1], t[2]) ? " " : " not ", "a valid autogram.\n")
end
Test phrase 1 is a valid autogram.

Test phrase 2 is a valid autogram.

Test phrase 3 is a valid autogram.

Test phrase 4 is a valid autogram.

Test phrase 5 is a valid autogram.

The letter and count '\'' => 14 was not mentioned in the counts in the phrase.
Test phrase 6 is not a valid autogram.

The count of z in the phrase is incorrect.
Test phrase 7 is not a valid autogram.

Seven sample texts are checked here, some of which are autograms and some of which are not. Parameter t[2] specifies whether punctuation marks should be taken into account when checking.

Conclusion

We have reviewed the implementation of an algorithm for verifying the correctness of autograms in the Julia programming language. We used an external package DataStructures to calculate the frequency of characters, the functions have been implemented:

  • Converting verbal numbers into numeric ones.
  • Checking the correctness of the description of each character.
  • Checking the completeness of the mention of all characters present in the sentence.

This algorithm can be used in natural language processing tasks, automatic verification of the logical correctness of texts, and can also serve as an example of interesting tricks in the field of programming and linguistics.

The program has been tested on several examples, including both correct and incorrect autograms, which confirms its operability.