Engee documentation
Notebook

Anagram generation algorithm

This guide will help you understand how the algorithm for searching and composing anagram phrases from a given word or phrase works using the Julia programming language.

Introduction

An anagram generation algorithm is a way of rearranging the letters of an original word (or phrase) in such a way as to obtain new meaningful words or phrases. It is widely used in linguistics, cryptography, mindfulness games, and entertainment tasks.

For example, the word Listen can be converted to Silent because both of these combinations use the same letters, but in a different order. Our application will search for such anagrams among a set of English words from a file. unixdict.txt.

The main part of the program

Reading the dictionary

Reading the word list file (unixdict.txt ) and convert everything to lowercase. Then we divide the string into words using a regular expression. \s+ for spaces, to remove unnecessary characters and turn them into an internal collection - an array.

In [ ]:
const unixwords = split(read("unixdict.txt", String) |> lowercase, r"\s+")
Out[0]:
25104-element Vector{SubString{String}}:
 "10th"
 "1st"
 "2nd"
 "3rd"
 "4th"
 "5th"
 "6th"
 "7th"
 "8th"
 "9th"
 "a"
 "a&m"
 "a&p"
 ⋮
 "zombie"
 "zone"
 "zoo"
 "zoology"
 "zoom"
 "zorn"
 "zoroaster"
 "zoroastrian"
 "zounds"
 "zucchini"
 "zurich"
 "zygote"

Function findphrases

Function findphrases accepts the string (anastring) — what needs to be anagrammed — as well as a dictionary array choices. You can also specify a minimum word length (sizelong, 4 characters by default) and how many short words can be added to the result (n_shortpermitted, zero by default). The returned result is a list of matching phrases.

In [ ]:
function findphrases(anastring::AbstractString, choices, sizelong = 4, n_shortpermitted = 1)
    # Подсчет частот каждой буквы входной строки (только a-z)
    anadict = Dict{Char, Int}()
    for c in lowercase(anastring)
        if 'a' <= c <= 'z'
            anadict[c] = get(anadict, c, 0) + 1
        end
    end
    
    # Здесь будем хранить найденные анаграммы
    phrases = String[]
    
    # Рекурсивно подставляем слова
    function addword(remaining, phrase, numshort)
        # Проходим через каждое слово в нашем словаре
        for w in unixwords
            len = length(w)
            
            # Если нет возможности использовать ещё короткие слова и это слово слишком маленькое → пропускаем
            numshort < 1 && len < sizelong && continue
            
            # Проверяем, достаточно ли букв из оставшихся для использования слова W
            any(c -> get(remaining, c, 0) < count(==(c), w), w) && @goto nextword
            
            # Создаем новую версию оставшихся букв без тех, что уже использовали
            cdict = copy(remaining)
            for c in w
                cdict[c] -= 1
            end
            
            # Все буквы совпадают? То есть остались только нули?
            if all(==(0), values(cdict))
                # Мы собрали полную анаграмму! Добавляем её в результат
                return strip(phrase * " " * w)
            elseif (newphrase = addword(cdict, phrase * " " * w, numshort - (len < sizelong))) != nothing
                push!(phrases, newphrase)
            end
            
            @label nextword
        end
        
        # Не нашли ничего на этом шаге → возвращаем nothing
        return nothing
    end
    
    # Запуск рекурсивного процесса сбора результатов
    addword(anadict, "", n_shortpermitted)
    
    # Возвращаем все собранные фразы
    return phrases
end
Out[0]:
findphrases (generic function with 3 methods)

Calling a function with test data

In [ ]:
for s in ["Engee the best", "Simulink", "Exponenta"]
    println("\nFrom '$s':")
    # Удаление повторяющихся и вывод отсортированных результатов
    foreach(println, findphrases(s, unixwords, 4, 0) |> unique |> sort!)
end
From 'Engee the best':
beet gene seth
beet seth gene
best gene thee
best thee gene
gene beet seth
gene best thee
gene hebe test
gene seth beet
gene test hebe
gene thee best
hebe gene test
hebe test gene
seth beet gene
seth gene beet
test gene hebe
test hebe gene
thee best gene
thee gene best

From 'Simulink':
luis mink
mini sulk
mink luis
sulk mini

From 'Exponenta':
annex poet
apex tenon
open texan
pate xenon
peat xenon
poet annex
tape xenon
tenon apex
texan open
xenon pate

Conclusion

We have reviewed in detail the implementation of the algorithm for searching for anagrams in the Julia language. Using the file unixdict.txt We were able to find meaningful phrases that are anagrams of certain phrases.

This technique has applications in writing puzzles, analyzing texts, and even verifying the uniqueness of passwords. The main advantage of this code is the purity of the implementation and the use of recursion to accurately match letter combinations. Efficiency depends on the size of the dictionary and the strategy of selecting new words, so if necessary, it can always be improved by selective filtering or optimization of crawling.

The example was developed using materials from Rosetta Code