Checking autograms in Julia language
This example demonstrates the implementation of an algorithm for verifying the correctness of so—called autograms - sentences in natural language that contain an accurate description of their own structure, that is, describing the number of occurrences of certain letters, punctuation marks, and other symbols.
Introduction
What is an autogram?
An autogram is a special type of autological (self—descriptive) sentence that describes its own structure. Namely, such a sentence indicates how many times each letter of the alphabet, punctuation marks and other symbols occur in it. An autogram is correct if the counter of each mentioned item matches its actual quantity in the same sentence.
For example, an autogram might say, "There are six letters 'e' in this sentence," and if it does contain six letters 'e', then that's part of the correct description.
The purpose of the algorithm
The purpose of the implementation is to check whether the autogram describes itself correctly, that is:
- For each mentioned letter or symbol: the specified number is the same as the actual number.
- All the symbols mentioned in the sentence are indeed taken into account.
- Whether punctuation marks are taken into account, if required.
This algorithm can be used in games, linguistic tasks, text generation, and even as an interesting way of self-reflection in artificial intelligence.
The main part
Connecting the necessary packages
# Установка необходимого пакета DataStructures, если он не установлен
import Pkg; Pkg.add("DataStructures")
# Использование пакета DataStructures для удобного подсчёта частоты символов
using DataStructures
Defining a dictionary of numbers in text form
# Словарь с текстовым представлением чисел от одного до девяносто
const textnumbers = Dict(
    "single" => 1, "one" => 1, "two" => 2, "three" => 3, "four" => 4,
    "five" => 5, "six" => 6, "seven" => 7, "eight" => 8, "nine" => 9,
    "ten" => 10, "eleven" => 11, "twelve" => 12, "thirteen" => 13,
    "fourteen" => 14, "fifteen" => 15, "sixteen" => 16,
    "seventeen" => 17, "eighteen" => 18, "nineteen" => 19,
    "twenty" => 20, "thirty" => 30, "forty" => 40, "fifty" => 50,
    "sixty" => 60, "seventy" => 70, "eighty" => 80, "ninety" => 90
)
This dictionary associates string representations of numbers (example: "five", "twenty-five") with their integer values. It is used to convert numbers described by words into specific numeric values.
Text-to-number conversion function
"""
    phrasetointeger(txt)
Преобразование текстовых чисел в числовое значение.
Пример: "twenty five" => 25
"""
function phrasetointeger(txt)
    words = split(txt, r"\W+")      # Разбивка строки на отдельные слова, исключая знаки препинания
    n = 0                           # Накопитель числа
    for w in words                  # Проход по всем словам
        n += get(textnumbers, w, 0) # Если слово есть в словаре, добавляем его значение
        w == "hundred" && (n *= 100) # Если слово сотня — умножаем на 100
    end
    return n                        # Возвращаем общее числовое значение
end
This function takes a string of text, extracts numeric words from it, and makes up an integer from 1 to 999. It is important that the function supports composite numbers if they are separated by a space (for example, "eleven" or "twenty five").
The main function of checking the autogram
"""
    isautogram(txt, countpunctuation; verbose = true)
Функция проверяет, является ли строка txt корректной автограммой.
Аргументы:
- txt: текст, который нужно проверить
- countpunctuation: считать ли знаки препинания (true/false)
- verbose: выводить ли сообщения об ошибках
"""
function isautogram(txt, countpunctuation; verbose = true)
    # Приводим весь текст к строчным буквам
    s = lowercase(txt)
    
    # Считаем количество вхождений каждого символа
    charcounts = counter(s)
    # Словарь символов, о которых нужно проверить упоминание
    stillneedmention = Dict(
        p[1] => isletter(p[1]) || p[1] != ' ' && countpunctuation ? p[2] : 0
        for p in charcounts
    )
    # Небольшое форматирование строки перед разбором
    s = " " * replace(s, r"^\.(?:employs|composed|contains)" => "")
    # Разбиваем строку по запятым и двоеточиям — на токены, описывающие символы
    for mention in split(s, r"\s*,|:\s*")
        mention = replace(mention, r" and$" => "")  # Удаляем слово "and" в конце строки
        spos = findlast(isspace, mention)           # Находим последний пробел — далее будет слово символа
        if spos === nothing continue end            # Если не нашли — пропускаем
        # Извлекаем числовую часть текста (до последнего пробела)
        numfromtext = phrasetointeger(mention[begin:spos-1])
        numfromtext == 0 && continue  # Если число 0 — продолжаем
        # Извлекаем часть строки, обозначающую символ
        c = mention[begin+spos:end]
        if c == "letters"
            # Проверка общего количества букв
            if numfromtext != count(isletter, txt)
                verbose && println("The total letter count (should be $(count(isletter, txt))) is incorrect.")
                return false
            end
            continue
        end
        # Определяем символ по описанию
        ch = contains(c, "comma") ? ',' : 
             contains(c, "apostrophe") ? '\'' : 
             contains(c, "hyphen") ? '-' : Char(c[1])
        # Проверка количества упомянутого символа
        if charcounts[ch] == numfromtext
            stillneedmention[ch] = 0  # Помечаем как проверенный
        else
            verbose && println("The count of $ch in the phrase is incorrect.")
            return false
        end
    end
    # Проверяем, не осталось ли символов, упоминание которых забыто
    for p in stillneedmention
        if p[2] > 0
            verbose && println("The letter and count $p was not mentioned in the counts in the phrase.")
            return false
        end
    end
    return true  # Если всё проверено и совпадает — автограмма корректна
end
This basic function works in the following steps::
- Normalizes the text (all letters are lowercase).
- Counts the frequency of all characters.
- Bypasses all mentions of characters in the text and checks the numeric values.
- Checks whether all the characters that actually occur are taken into account in the text.
- Returns trueif everything is correct, otherwise —false.
If verbose = true, explanatory error messages are displayed.
Testing autograms
# Набор тестовых строк — автограмм и не автограмм
for (i, t) in enumerate([
    ("This sentence employs two a's, two c's, two d's, twenty-eight e's, five f's, three g's, eight h's, eleven i's, three l's, two m's, thirteen n's, nine o's, two p's, five r's, twenty-five s's, twenty-three t's, six v's, ten w's, two x's, five y's, and one z.", false),
    ("This sentence employs two a's, two c's, two d's, twenty eight e's, five f's, three g's, eight h's, eleven i's, three l's, two m's, thirteen n's, nine o's, two p's, five r's, twenty five s's, twenty three t's, six v's, ten w's, two x's, five y's, and one z.", false),
    ("This pangram contains four as, one b, two cs, one d, thirty es, six fs, five gs, seven hs, eleven is, one j, one k, two ls, two ms, eighteen ns, fifteen os, two ps, one q, five rs, twenty-seven ss, eighteen ts, two us, seven vs, eight ws, two xs, three ys, & one z.", false),
    ("This sentence contains one hundred and ninety-seven letters: four a's, one b, three c's, five d's, thirty-four e's, seven f's, one g, six h's, twelve i's, three l's, twenty-six n's, ten o's, ten r's, twenty-nine s's, nineteen t's, six u's, seven v's, four w's, four x's, five y's, and one z.", false),
    ("Thirteen e's, five f's, two g's, five h's, eight i's, two l's, three n's, six o's, six r's, twenty s's, twelve t's, three u's, four v's, six w's, four x's, two y's.", false),
    ("Fifteen e's, seven f's, four g's, six h's, eight i's, four n's, five o's, six r's, eighteen s's, eight t's, four u's, three v's, two w's, three x's.", true),
    ("Sixteen e's, five f's, three g's, six h's, nine i's, five n's, four o's, six r's, eighteen s's, eight t's, three u's, three v's, two w's, four z's.", false),
   ])
    
    # Проверяем каждую строку
    println("Test phrase $i is", isautogram(t[1], t[2]) ? " " : " not ", "a valid autogram.\n")
end
Seven sample texts are checked here, some of which are autograms and some of which are not. Parameter t[2] specifies whether punctuation marks should be taken into account when checking.
Conclusion
We have reviewed the implementation of an algorithm for verifying the correctness of autograms in the Julia programming language. We used an external package DataStructures to calculate the frequency of characters, the functions have been implemented:
- Converting verbal numbers to numeric ones.
- Checking the correctness of the description of each character.
- Checking the completeness of the mention of all characters present in the sentence.
This algorithm can be used in natural language processing tasks, automatic verification of the logical correctness of texts, and can also serve as an example of interesting tricks in the field of programming and linguistics.
The program has been tested on several examples, including both correct and incorrect autograms, which confirms its operability.