Checking autograms in Julia language
This example demonstrates the implementation of an algorithm for verifying the correctness of so—called autograms - sentences in natural language that contain an accurate description of their own structure, that is, describing the number of occurrences of certain letters, punctuation marks, and other symbols.
Introduction
What is an autogram?
An autogram is a special type of autological (self—descriptive) sentence that describes its own structure. Namely, such a sentence indicates how many times each letter of the alphabet, punctuation marks and other symbols occur in it. An autogram is correct if the counter of each mentioned item matches its actual quantity in the same sentence.
For example, an autogram might say, "There are six letters 'e' in this sentence," and if it does contain six letters 'e', then that's part of the correct description.
The purpose of the algorithm
The purpose of the implementation is to check whether the autogram describes itself correctly, that is:
- For each mentioned letter or symbol: the specified number is the same as the actual number.
- All the symbols mentioned in the offer are really taken into account.
- Whether punctuation marks are taken into account, if required.
This algorithm can be used in games, linguistic tasks, text generation, and even as an interesting way of self-reflection in artificial intelligence.
The main part
Connecting the necessary packages
# Installing the necessary DataStructures package if it is not installed
import Pkg; Pkg.add("DataStructures")
# Using the DataStructures package for convenient character frequency calculation
using DataStructures
Defining a dictionary of numbers in text form
# A dictionary with a textual representation of numbers from one to ninety
const textnumbers = Dict(
"single" => 1, "one" => 1, "two" => 2, "three" => 3, "four" => 4,
"five" => 5, "six" => 6, "seven" => 7, "eight" => 8, "nine" => 9,
"ten" => 10, "eleven" => 11, "twelve" => 12, "thirteen" => 13,
"fourteen" => 14, "fifteen" => 15, "sixteen" => 16,
"seventeen" => 17, "eighteen" => 18, "nineteen" => 19,
"twenty" => 20, "thirty" => 30, "forty" => 40, "fifty" => 50,
"sixty" => 60, "seventy" => 70, "eighty" => 80, "ninety" => 90
)
This dictionary associates string representations of numbers (example: "five", "twenty-five") with their integer values. It is used to convert numbers described by words into specific numeric values.
Text-to-number conversion function
"""
phrasetointeger(txt)
Converting text numbers to a numeric value.
Пример: "twenty five" => 25
"""
function phrasetointeger(txt)
words = split(txt, r"\W+") # Splitting a string into separate words, excluding punctuation marks
n = 0 # Accumulator of numbers
for w in words # Passage through all words
n += get(textnumbers, w, 0) # If the word is in the dictionary, add its meaning.
w == "hundred" && (n *= 100) # If the word is a hundred, we multiply it by 100
end
return n # Returning the total numeric value
end
This function takes a string of text, extracts numeric words from it, and makes up an integer from 1 to 999. It is important that the function supports composite numbers if they are separated by a space (for example, "eleven" or "twenty five").
The main function of checking the autogram
"""
isautogram(txt, countpunctuation; verbose = true)
The function checks whether the txt string is a valid autogram.
Arguments:
- txt: the text that needs to be checked
- countpunctuation: whether to count punctuation marks (true/false)
- verbose: whether to display error messages
"""
function isautogram(txt, countpunctuation; verbose = true)
# We reduce the entire text to lowercase letters
s = lowercase(txt)
# We count the number of occurrences of each character
charcounts = counter(s)
# A dictionary of characters that need to be checked for references
stillneedmention = Dict(
p[1] => isletter(p[1]) || p[1] != ' ' && countpunctuation ? p[2] : 0
for p in charcounts
)
# A little formatting of the string before parsing
s = " " * replace(s, r"^\.(?:employs|composed|contains)" => "")
# Splitting the string by commas and colons into tokens describing the characters
for mention in split(s, r"\s*,|:\s*")
mention = replace(mention, r" and$" => "") # Delete the word "and" at the end of the line
spos = findlast(isspace, mention) # We find the last space — then there will be the word of the symbol.
if spos === nothing continue end # If you haven't found it, skip it.
# Extracting the numeric part of the text (up to the last space)
numfromtext = phrasetointeger(mention[begin:spos-1])
numfromtext == 0 && continue # If the number is 0, we continue
# Extracting the part of the string indicating the character
c = mention[begin+spos:end]
if c == "letters"
# Checking the total number of letters
if numfromtext != count(isletter, txt)
verbose && println("The total letter count (should be $(count(isletter, txt))) is incorrect.")
return false
end
continue
end
# We define the symbol according to the description
ch = contains(c, "comma") ? ',' :
contains(c, "apostrophe") ? '\'' :
contains(c, "hyphen") ? '-' : Char(c[1])
# Checking the number of the mentioned character
if charcounts[ch] == numfromtext
stillneedmention[ch] = 0 # Mark it as verified
else
verbose && println("The count of $ch in the phrase is incorrect.")
return false
end
end
# We check if there are any characters left whose mention has been forgotten.
for p in stillneedmention
if p[2] > 0
verbose && println("The letter and count $p was not mentioned in the counts in the phrase.")
return false
end
end
return true # If everything is checked and matches, the autogram is correct.
end
This basic function works in the following steps::
- Normalizes the text (all letters are lowercase).
- Counts the frequency of all characters.
- Bypasses all mentions of characters in the text and checks the numeric values.
- Checks whether all the characters that actually occur are taken into account in the text.
- Returns
trueif everything is correct, otherwise —false.
If verbose = true, explanatory error messages are displayed.
Testing autograms
# A set of test lines — autograms and non-autograms
for (i, t) in enumerate([
("This sentence employs two a's, two c's, two d's, twenty-eight e's, five f's, three g's, eight h's, eleven i's, three l's, two m's, thirteen n's, nine o's, two p's, five r's, twenty-five s's, twenty-three t's, six v's, ten w's, two x's, five y's, and one z.", false),
("This sentence employs two a's, two c's, two d's, twenty eight e's, five f's, three g's, eight h's, eleven i's, three l's, two m's, thirteen n's, nine o's, two p's, five r's, twenty five s's, twenty three t's, six v's, ten w's, two x's, five y's, and one z.", false),
("This pangram contains four as, one b, two cs, one d, thirty es, six fs, five gs, seven hs, eleven is, one j, one k, two ls, two ms, eighteen ns, fifteen os, two ps, one q, five rs, twenty-seven ss, eighteen ts, two us, seven vs, eight ws, two xs, three ys, & one z.", false),
("This sentence contains one hundred and ninety-seven letters: four a's, one b, three c's, five d's, thirty-four e's, seven f's, one g, six h's, twelve i's, three l's, twenty-six n's, ten o's, ten r's, twenty-nine s's, nineteen t's, six u's, seven v's, four w's, four x's, five y's, and one z.", false),
("Thirteen e's, five f's, two g's, five h's, eight i's, two l's, three n's, six o's, six r's, twenty s's, twelve t's, three u's, four v's, six w's, four x's, two y's.", false),
("Fifteen e's, seven f's, four g's, six h's, eight i's, four n's, five o's, six r's, eighteen s's, eight t's, four u's, three v's, two w's, three x's.", true),
("Sixteen e's, five f's, three g's, six h's, nine i's, five n's, four o's, six r's, eighteen s's, eight t's, three u's, three v's, two w's, four z's.", false),
])
# We check each line
println("Test phrase $i is", isautogram(t[1], t[2]) ? " " : " not ", "a valid autogram.\n")
end
Seven sample texts are checked here, some of which are autograms and some of which are not. Parameter t[2] specifies whether punctuation marks should be taken into account when checking.
Conclusion
We have reviewed the implementation of an algorithm for verifying the correctness of autograms in the Julia programming language. We used an external package DataStructures to calculate the frequency of characters, the functions have been implemented:
- Converting verbal numbers into numeric ones.
- Checking the correctness of the description of each character.
- Checking the completeness of the mention of all characters present in the sentence.
This algorithm can be used in natural language processing tasks, automatic verification of the logical correctness of texts, and can also serve as an example of interesting tricks in the field of programming and linguistics.
The program has been tested on several examples, including both correct and incorrect autograms, which confirms its operability.