Anagram generation algorithm

This guide will help you understand how the algorithm for searching and composing anagram phrases from a given word or phrase works using the Julia programming language.

Introduction

An anagram generation algorithm is a way of rearranging the letters of an original word (or phrase) in such a way as to obtain new meaningful words or phrases. It is widely used in linguistics, cryptography, mindfulness games, and entertainment tasks.

For example, the word Listen can be converted to Silent because both of these combinations use the same letters, but in a different order. Our application will search for such anagrams among a set of English words from a file. unixdict.txt.

The main part of the program

Reading the dictionary

Reading the word list file (unixdict.txt ) and convert everything to lowercase. Then we divide the string into words using a regular expression. \s+ for spaces, to remove unnecessary characters and turn them into an internal collection - an array.

const unixwords = split(read("unixdict.txt", String) |> lowercase, r"\s+")

25104-element Vector{SubString{String}}:
 "10th"
 "1st"
 "2nd"
 "3rd"
 "4th"
 "5th"
 "6th"
 "7th"
 "8th"
 "9th"
 "a"
 "a&m"
 "a&p"
 ⋮
 "zombie"
 "zone"
 "zoo"
 "zoology"
 "zoom"
 "zorn"
 "zoroaster"
 "zoroastrian"
 "zounds"
 "zucchini"
 "zurich"
 "zygote"

Function `findphrases`

Function findphrases accepts the string (anastring) — what needs to be anagrammed — as well as a dictionary array choices. You can also specify a minimum word length (sizelong, by default 4 characters) and how many short words can be added to the result (n_shortpermitted, zero by default). The returned result is a list of matching phrases.

function findphrases(anastring::AbstractString, choices, sizelong = 4, n_shortpermitted = 1)
    # Counting the frequencies of each letter of the input string (a-z only)
    anadict = Dict{Char, Int}()
    for c in lowercase(anastring)
        if 'a' <= c <= 'z'
            anadict[c] = get(anadict, c, 0) + 1
        end
    end
    
    # We will store the found anagrams here.
    phrases = String[]
    
    # Recursively substituting words
    function addword(remaining, phrase, numshort)
        # We go through every word in our dictionary.
        for w in unixwords
            len = length(w)
            
            # If it is not possible to use more short words and this word is too small → skip
            numshort < 1 && len < sizelong && continue
            
            # We check whether there are enough letters from the remaining ones to use the word W.
            any(c -> get(remaining, c, 0) < count(==(c), w), w) && @goto nextword
            
            # Creating a new version of the remaining letters without those that have already been used
            cdict = copy(remaining)
            for c in w
                cdict[c] -= 1
            end
            
            # Do all the letters match? So there are only zeros left?
            if all(==(0), values(cdict))
                # We have collected a complete anagram! Adding it to the result
                return strip(phrase * " " * w)
            elseif (newphrase = addword(cdict, phrase * " " * w, numshort - (len < sizelong))) != nothing
                push!(phrases, newphrase)
            end
            
            @label nextword
        end
        
        # Didn't find anything at this step → return nothing
        return nothing
    end
    
    # Starting a recursive process of collecting results
    addword(anadict, "", n_shortpermitted)
    
    # We are returning all the collected phrases
    return phrases
end

findphrases (generic function with 3 methods)

Calling a function with test data

for s in ["Engee the best", "Simulink", "Exponenta"]
    println("\nFrom '$s':")
    # Removing duplicate results and displaying sorted results
    foreach(println, findphrases(s, unixwords, 4, 0) |> unique |> sort!)
end

From 'Engee the best':
beet gene seth
beet seth gene
best gene thee
best thee gene
gene beet seth
gene best thee
gene hebe test
gene seth beet
gene test hebe
gene thee best
hebe gene test
hebe test gene
seth beet gene
seth gene beet
test gene hebe
test hebe gene
thee best gene
thee gene best

From 'Simulink':
luis mink
mini sulk
mink luis
sulk mini

From 'Exponenta':
annex poet
apex tenon
open texan
pate xenon
peat xenon
poet annex
tape xenon
tenon apex
texan open
xenon pate

Conclusion

We have reviewed in detail the implementation of the algorithm for searching for anagrams in the Julia language. Using the file unixdict.txt We were able to find meaningful phrases that are anagrams of certain phrases.

This technique has applications in writing puzzles, analyzing texts, and even verifying the uniqueness of passwords. The main advantage of this code is the purity of the implementation and the use of recursion to accurately match letter combinations. Efficiency depends on the size of the dictionary and the strategy of selecting new words, so if necessary, it can always be improved by selective filtering or optimization of crawling.

The example was developed using materials from Rosetta Code