The Aho-Korasik algorithm
This example examines the implementation of the Aho-Korasik algorithm in the Julia programming language. The algorithm is designed to search for multiple samples in a text at the same time.
The theoretical part
Algorithm Ахо–Корасик is an efficient substring search algorithm designed to find all occurrences of a finite set of template strings (dictionary) in the input text. Designed by Alfred V. Aho and Margaret J. Founded in 1975, it is widely used in tasks that require simultaneous search for multiple patterns, such as text processing, bioinformatics, and network security.
Task description
A text string is given lengths and many of the templates , the total length of which is (the sum of the lengths of all the patterns). The Aho–Korasik algorithm finds all occurrences of any pattern in . Specifically, it returns all pairs. , such that the template matches the substring of the text starting from the position (i.e. The algorithm is especially effective when searching for multiple templates at the same time, as it avoids multiple text scans for each template individually.
The formal statement of the problem:
Input: text and lots of templates .
Output: all pairs , such that the template it occurs in starting from the position .
Applications
The Aho–Korasik algorithm is used in the following areas:
-
Full-text search systems — for searching multiple keywords in documents;
-
Intrusion detection systems — for matching signatures of malicious patterns in network traffic;
-
Bioinformatics — to search for multiple motifs in DNA sequences;
-
Data compression and natural language processing — for template-based analysis.
The main part
Data fields
# Все допустимые символы английского алфавита от 'a' до 'z'
const VALID_CHARS = collect('a':'z')
The structure of a finite state machine node
Determination of the structure of the Bohr node / Aho-Corasick finite state machine
mutable struct Node
son::Vector{Int} # номера дочерних вершин по каждому символу ('a'-'z')
ans::Int # количество совпадений, оканчивающихся здесь
fail::Int # ссылка на «предыдущий» узел с таким же суффиксом
du::Int # число зависимых узлов, которые должны быть обработаны после текущего
idx::Int # уникальный индекс для обнаруженного слова
end
Default constructor: create a new node without child nodes and with initial values
Node() = Node(zeros(Int, length(VALID_CHARS)), 0, 1, 0, 0)
The structure of the Aho-Korasik automaton
The definition of the Aho-Korasik automaton itself
mutable struct ACAutomaton
tr::Vector{Node} # массив всех узлов дерева префиксного бора / автомата
tot::Int # общее количество использованных/существующих узлов
final_ans::Vector{Int} # результат подсчета количества вхождений каждого шаблона
pidx::Int # счетчик уникальных индексов слов
end
Initializing an automaton with a maximum of max_nodes nodes and a root
ACAutomaton(max_nodes) = ACAutomaton([Node() for _ in 1:max_nodes], 1, Int[], 0)
Operations on the machine
Inserting a pattern into a slot machine
The function of adding a new template to the machine and assigning it a unique ID
function insert(ac::ACAutomaton, pattern)
u = 1 # начальный узел -- корень
for ch in pattern # проход по каждому символу шаблона
char_code = findfirst(==(ch), VALID_CHARS)
isnothing(char_code) && continue
if ac.tr[u].son[char_code] == 0 # если такого дочернего узла нет
ac.tot += 1 # выделяем новый узел
ac.tr[u].son[char_code] = ac.tot
end
u = ac.tr[u].son[char_code] # переходим к следующему узлу
end
# Присваиваем паттерну уникальный идентификатор, если ещё не назначен
if ac.tr[u].idx == 0
ac.pidx += 1
ac.tr[u].idx = ac.pidx
end
# возвращаем уникальный индекс
return ac.tr[u].idx
end
Building "suffix" links
The function of constructing suffix transitions (fail links) is necessary for the correct operation of the AC algorithm.
function build(ac::ACAutomaton)
q = [ac.tr[begin].son[i] for i in eachindex(VALID_CHARS) if ac.tr[begin].son[i] != 0]
while !isempty(q)
u = q[begin]
q = q[begin+1:end]
for i in eachindex(VALID_CHARS)
son_node_idx = ac.tr[u].son[i]
fail_node_idx = ac.tr[u].fail
if son_node_idx != 0
ac.tr[son_node_idx].fail = ac.tr[fail_node_idx].son[i]
ac.tr[ac.tr[son_node_idx].fail].du += 1
push!(q, son_node_idx)
else
ac.tr[u].son[i] = ac.tr[fail_node_idx].son[i] # делаем жадное продолжение
end
end
end
end
Search for words in the input text
The function of searching for all words in the text using a built-in automaton
function query(ac::ACAutomaton, text)
u = 1
for ch in text # идём по символам текста
char_code = findfirst(==(ch), VALID_CHARS)
isnothing(char_code) && continue
u = ac.tr[u].son[char_code]
ac.tr[u].ans += 1 # увеличиваем счётчик совпадений, заканчивающихся в этой вершине
end
end
Calculation of final results
After completing the main pass, we calculate the exact match values for each template, taking into account the possible dependence between nodes through fail links.
function calculate_final_answers(ac::ACAutomaton)
ac.final_ans = zeros(Int, ac.pidx + 1)
# Начинаем с узлов, не имеющих зависимостей
q = filter(i -> ac.tr[i].du == 0, 1:ac.tot)
while !isempty(q)
u = pop!(q)
idx = ac.tr[u].idx
if idx != 0
ac.final_ans[idx] = ac.tr[u].ans
end
v = ac.tr[u].fail
if v > 1
ac.tr[v].ans += ac.tr[u].ans
ac.tr[v].du -= 1
ac.tr[v].du == 0 && push!(q, v) # если теперь этот узел готов к обработке, добавляем в очередь
end
end
end
Implementation testing
A feature that automatically tests the entire system
function test_aho_corasick(input, text)
max_nodes, n = 200000 + 6, 5 # ограничение на максимальное количество узлов
ac = ACAutomaton(max_nodes) # создаем новый автомат
ids = zeros(Int, n + 1) # массив хранения идентификаторов каждого шаблона
# Добавляем все шаблоны в автомат
for i in 1:n
pattern = input[i]
ids[i+1] = insert(ac, pattern)
end
build(ac) # построили суффиксные ссылки
query(ac, text) # выполнили поиск
calculate_final_answers(ac) # собрали ответы
# Выводим результаты
for i in 1:n
uniqueid = ids[i]
println("Number of instances of '$(input[i])': " => ac.final_ans[uniqueid+1])
end
end
Let's start testing:
# Несколько тестовых шаблонов
input = ["a", "bb", "aa", "abaa", "abaaa"]
text = "abaaabaa" # Текст для анализа
test_aho_corasick(input, text)
Conclusion
We have reviewed the implementation of the Aho-Korasik algorithm in Julia, which makes it possible to efficiently find all occurrences of several substrings in a text in one pass. This implementation demonstrates the use of graph-type data structures (prefix tree) along with special pointers (fail-links) for fast state transitions.
The algorithm is especially useful for text filtering, dictionary compilation, network protocols, or any other tasks that require parallel analysis of multiple samples. The implemented code can be extended to a wider range of characters and used in large NLP projects.
The example was developed using materials from Rosetta Code