The Aho-Korasik algorithm
This example examines the implementation of the Aho-Korasik algorithm in the Julia programming language. The algorithm is designed to search for multiple samples in a text at the same time.
The theoretical part
Algorithm Ахо–Корасик is an efficient substring search algorithm designed to find all occurrences of a finite set of template strings (dictionary) in the input text. Designed by Alfred V. Aho and Margaret J. Founded in 1975, it is widely used in tasks that require simultaneous search for multiple patterns, such as text processing, bioinformatics, and network security.
Task description
A text string is given lengths and many of the templates , the total length of which is (the sum of the lengths of all the patterns). The Aho–Korasik algorithm finds all occurrences of any pattern in . Specifically, it returns all pairs. , such that the template matches the substring of the text starting from the position (i.e. The algorithm is especially effective when searching for multiple templates at the same time, as it avoids multiple text scans for each template individually.
The formal statement of the problem:
Input: text and lots of templates .
Output: all pairs , such that the template it occurs in starting from the position .
Applications
The Aho–Korasik algorithm is used in the following areas:
-
Full-text search systems — for searching multiple keywords in documents;
-
Intrusion detection systems — for matching signatures of malicious patterns in network traffic;
-
Bioinformatics — to search for multiple motifs in DNA sequences;
-
Data compression and natural language processing — for template-based analysis.
The main part
Data fields
# All valid characters of the English alphabet from 'a' to 'z'
const VALID_CHARS = collect('a':'z')
The structure of a finite state machine node
Determination of the structure of the Bohr node / Aho-Corasick finite state machine
mutable struct Node
son::Vector{Int} # child vertex numbers for each character ('a'-'z')
ans::Int # the number of matches ending here
fail::Int # a link to the "previous" node with the same suffix
du::Int # the number of dependent nodes to be processed after the current one
idx::Int # a unique index for the detected word
end
Default constructor: create a new node without child nodes and with initial values
Node() = Node(zeros(Int, length(VALID_CHARS)), 0, 1, 0, 0)
The structure of the Aho-Korasik automaton
The definition of the Aho-Korasik automaton itself
mutable struct ACAutomaton
tr::Vector{Node} # an array of all nodes of the prefix boron/ automaton tree
tot::Int # total number of used/existing nodes
final_ans::Vector{Int} # the result of counting the number of occurrences of each pattern
pidx::Int # a counter for unique word indexes
end
Initializing an automaton with a maximum of max_nodes nodes and a root
ACAutomaton(max_nodes) = ACAutomaton([Node() for _ in 1:max_nodes], 1, Int[], 0)
Operations on the machine
Inserting a pattern into an automaton
The function of adding a new template to the machine and assigning it a unique ID
function insert(ac::ACAutomaton, pattern)
u = 1 # the initial node is the root
for ch in pattern # passing through each character of the template
char_code = findfirst(==(ch), VALID_CHARS)
isnothing(char_code) && continue
if ac.tr[u].son[char_code] == 0 # if there is no such child node
ac.tot += 1 # we select a new node
ac.tr[u].son[char_code] = ac.tot
end
u = ac.tr[u].son[char_code] # moving on to the next node
end
# We assign a unique identifier to the pattern, if it has not yet been assigned.
if ac.tr[u].idx == 0
ac.pidx += 1
ac.tr[u].idx = ac.pidx
end
# returning a unique index
return ac.tr[u].idx
end
Building "suffix" links
The function of constructing suffix transitions (fail links) is necessary for the correct operation of the AC algorithm.
function build(ac::ACAutomaton)
q = [ac.tr[begin].son[i] for i in eachindex(VALID_CHARS) if ac.tr[begin].son[i] != 0]
while !isempty(q)
u = q[begin]
q = q[begin+1:end]
for i in eachindex(VALID_CHARS)
son_node_idx = ac.tr[u].son[i]
fail_node_idx = ac.tr[u].fail
if son_node_idx != 0
ac.tr[son_node_idx].fail = ac.tr[fail_node_idx].son[i]
ac.tr[ac.tr[son_node_idx].fail].du += 1
push!(q, son_node_idx)
else
ac.tr[u].son[i] = ac.tr[fail_node_idx].son[i] # Making a greedy sequel
end
end
end
end
Search for words in the input text
The function of searching for all words in the text using a built-in automaton
function query(ac::ACAutomaton, text)
u = 1
for ch in text # we follow the characters of the text
char_code = findfirst(==(ch), VALID_CHARS)
isnothing(char_code) && continue
u = ac.tr[u].son[char_code]
ac.tr[u].ans += 1 # increasing the count of matches ending at this vertex
end
end
Calculation of final results
After completing the main pass, we calculate the exact match values for each template, taking into account the possible dependence between nodes through fail links.
function calculate_final_answers(ac::ACAutomaton)
ac.final_ans = zeros(Int, ac.pidx + 1)
# We start with nodes that have no dependencies.
q = filter(i -> ac.tr[i].du == 0, 1:ac.tot)
while !isempty(q)
u = pop!(q)
idx = ac.tr[u].idx
if idx != 0
ac.final_ans[idx] = ac.tr[u].ans
end
v = ac.tr[u].fail
if v > 1
ac.tr[v].ans += ac.tr[u].ans
ac.tr[v].du -= 1
ac.tr[v].du == 0 && push!(q, v) # if this node is now ready for processing, we add it to the queue.
end
end
end
Implementation testing
A feature that automatically tests the entire system
function test_aho_corasick(input, text)
max_nodes, n = 200000 + 6, 5 # limit on the maximum number of nodes
ac = ACAutomaton(max_nodes) # creating a new slot machine
ids = zeros(Int, n + 1) # an array for storing the IDs of each template
# Adding all the templates to the machine
for i in 1:n
pattern = input[i]
ids[i+1] = insert(ac, pattern)
end
build(ac) # we have built suffix links
query(ac, text) # completed the search
calculate_final_answers(ac) # We have collected the answers
# We display the results
for i in 1:n
uniqueid = ids[i]
println("Number of instances of '$(input[i])': " => ac.final_ans[uniqueid+1])
end
end
Let's start testing:
# Several test templates
input = ["a", "bb", "aa", "abaa", "abaaa"]
text = "abaaabaa" # Text for analysis
test_aho_corasick(input, text)
Conclusion
We have reviewed the implementation of the Aho-Korasik algorithm in Julia, which makes it possible to efficiently find all occurrences of several substrings in a text in one pass. This implementation demonstrates the use of graph-type data structures (prefix tree) along with special pointers (fail-links) for fast state transitions.
The algorithm is especially useful for text filtering, dictionary compilation, network protocols, or any other tasks that require parallel analysis of multiple samples. The implemented code can be extended to a wider range of characters and used in large NLP projects.
The example was developed using materials from Rosetta Code