Engee documentation
Notebook

The Aho-Korasik algorithm

This example examines the implementation of the Aho-Korasik algorithm in the Julia programming language. The algorithm is designed to search for multiple samples in a text at the same time.

The theoretical part

Algorithm Ахо–Корасик is an efficient substring search algorithm designed to find all occurrences of a finite set of template strings (dictionary) in the input text. Designed by Alfred V. Aho and Margaret J. Founded in 1975, it is widely used in tasks that require simultaneous search for multiple patterns, such as text processing, bioinformatics, and network security.

Task description

A text string is given lengths and many of the templates , the total length of which is (the sum of the lengths of all the patterns). The Aho–Korasik algorithm finds all occurrences of any pattern in . Specifically, it returns all pairs. , such that the template matches the substring of the text starting from the position (i.e. The algorithm is especially effective when searching for multiple templates at the same time, as it avoids multiple text scans for each template individually.

The formal statement of the problem:

Input: text and lots of templates .

Output: all pairs , such that the template it occurs in starting from the position .

Applications

The Aho–Korasik algorithm is used in the following areas:

  • Full-text search systems — for searching multiple keywords in documents;

  • Intrusion detection systems — for matching signatures of malicious patterns in network traffic;

  • Bioinformatics — to search for multiple motifs in DNA sequences;

  • Data compression and natural language processing — for template-based analysis.

The main part

Data fields

In [ ]:
# Все допустимые символы английского алфавита от 'a' до 'z'
const VALID_CHARS = collect('a':'z')
Out[0]:
26-element Vector{Char}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
 'd': ASCII/Unicode U+0064 (category Ll: Letter, lowercase)
 'e': ASCII/Unicode U+0065 (category Ll: Letter, lowercase)
 'f': ASCII/Unicode U+0066 (category Ll: Letter, lowercase)
 'g': ASCII/Unicode U+0067 (category Ll: Letter, lowercase)
 'h': ASCII/Unicode U+0068 (category Ll: Letter, lowercase)
 'i': ASCII/Unicode U+0069 (category Ll: Letter, lowercase)
 'j': ASCII/Unicode U+006A (category Ll: Letter, lowercase)
 'k': ASCII/Unicode U+006B (category Ll: Letter, lowercase)
 'l': ASCII/Unicode U+006C (category Ll: Letter, lowercase)
 'm': ASCII/Unicode U+006D (category Ll: Letter, lowercase)
 'n': ASCII/Unicode U+006E (category Ll: Letter, lowercase)
 'o': ASCII/Unicode U+006F (category Ll: Letter, lowercase)
 'p': ASCII/Unicode U+0070 (category Ll: Letter, lowercase)
 'q': ASCII/Unicode U+0071 (category Ll: Letter, lowercase)
 'r': ASCII/Unicode U+0072 (category Ll: Letter, lowercase)
 's': ASCII/Unicode U+0073 (category Ll: Letter, lowercase)
 't': ASCII/Unicode U+0074 (category Ll: Letter, lowercase)
 'u': ASCII/Unicode U+0075 (category Ll: Letter, lowercase)
 'v': ASCII/Unicode U+0076 (category Ll: Letter, lowercase)
 'w': ASCII/Unicode U+0077 (category Ll: Letter, lowercase)
 'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)
 'y': ASCII/Unicode U+0079 (category Ll: Letter, lowercase)
 'z': ASCII/Unicode U+007A (category Ll: Letter, lowercase)

The structure of a finite state machine node

Determination of the structure of the Bohr node / Aho-Corasick finite state machine

In [ ]:
mutable struct Node
    son::Vector{Int}   # номера дочерних вершин по каждому символу ('a'-'z')
    ans::Int           # количество совпадений, оканчивающихся здесь
    fail::Int          # ссылка на «предыдущий» узел с таким же суффиксом 
    du::Int            # число зависимых узлов, которые должны быть обработаны после текущего 
    idx::Int           # уникальный индекс для обнаруженного слова
end

Default constructor: create a new node without child nodes and with initial values

In [ ]:
Node() = Node(zeros(Int, length(VALID_CHARS)), 0, 1, 0, 0)
Out[0]:
Node

The structure of the Aho-Korasik automaton

The definition of the Aho-Korasik automaton itself

In [ ]:
mutable struct ACAutomaton
    tr::Vector{Node}     # массив всех узлов дерева префиксного бора / автомата
    tot::Int             # общее количество использованных/существующих узлов
    final_ans::Vector{Int} # результат подсчета количества вхождений каждого шаблона
    pidx::Int            # счетчик уникальных индексов слов
end

Initializing an automaton with a maximum of max_nodes nodes and a root

In [ ]:
ACAutomaton(max_nodes) = ACAutomaton([Node() for _ in 1:max_nodes], 1, Int[], 0)
Out[0]:
ACAutomaton

Operations on the machine

Inserting a pattern into a slot machine

The function of adding a new template to the machine and assigning it a unique ID

In [ ]:
function insert(ac::ACAutomaton, pattern)
    u = 1                      # начальный узел -- корень
    for ch in pattern          # проход по каждому символу шаблона
        char_code = findfirst(==(ch), VALID_CHARS)
        isnothing(char_code) && continue
        if ac.tr[u].son[char_code] == 0         # если такого дочернего узла нет
            ac.tot += 1                          # выделяем новый узел
            ac.tr[u].son[char_code] = ac.tot
        end
        u = ac.tr[u].son[char_code]              # переходим к следующему узлу
    end

    # Присваиваем паттерну уникальный идентификатор, если ещё не назначен
    if ac.tr[u].idx == 0
        ac.pidx += 1
        ac.tr[u].idx = ac.pidx
    end
    
    # возвращаем уникальный индекс
    return ac.tr[u].idx
end
Out[0]:
insert (generic function with 1 method)

Building "suffix" links

The function of constructing suffix transitions (fail links) is necessary for the correct operation of the AC algorithm.

In [ ]:
function build(ac::ACAutomaton)
    q = [ac.tr[begin].son[i] for i in eachindex(VALID_CHARS) if ac.tr[begin].son[i] != 0]

    while !isempty(q)
        u = q[begin]
        q = q[begin+1:end]

        for i in eachindex(VALID_CHARS)
            son_node_idx = ac.tr[u].son[i]
            fail_node_idx = ac.tr[u].fail

            if son_node_idx != 0
                ac.tr[son_node_idx].fail = ac.tr[fail_node_idx].son[i]
                ac.tr[ac.tr[son_node_idx].fail].du += 1
                push!(q, son_node_idx)
            else
                ac.tr[u].son[i] = ac.tr[fail_node_idx].son[i] # делаем жадное продолжение
            end
        end
    end
end
Out[0]:
build (generic function with 1 method)

Search for words in the input text

The function of searching for all words in the text using a built-in automaton

In [ ]:
function query(ac::ACAutomaton, text)
    u = 1
    for ch in text             # идём по символам текста
        char_code = findfirst(==(ch), VALID_CHARS)
        isnothing(char_code) && continue
        u = ac.tr[u].son[char_code]
        ac.tr[u].ans += 1       # увеличиваем счётчик совпадений, заканчивающихся в этой вершине
    end
end
Out[0]:
query (generic function with 1 method)

Calculation of final results

After completing the main pass, we calculate the exact match values for each template, taking into account the possible dependence between nodes through fail links.

In [ ]:
function calculate_final_answers(ac::ACAutomaton)
    ac.final_ans = zeros(Int, ac.pidx + 1)
    
    # Начинаем с узлов, не имеющих зависимостей
    q = filter(i -> ac.tr[i].du == 0, 1:ac.tot)
    
    while !isempty(q)
        u = pop!(q)
        idx = ac.tr[u].idx
        if idx != 0
            ac.final_ans[idx] = ac.tr[u].ans
        end

        v = ac.tr[u].fail
        if v > 1
            ac.tr[v].ans += ac.tr[u].ans
            ac.tr[v].du -= 1
            ac.tr[v].du == 0 && push!(q, v)  # если теперь этот узел готов к обработке, добавляем в очередь
        end
    end
end
Out[0]:
calculate_final_answers (generic function with 1 method)

Implementation testing

A feature that automatically tests the entire system

In [ ]:
function test_aho_corasick(input, text)
    max_nodes, n = 200000 + 6, 5    # ограничение на максимальное количество узлов
    ac = ACAutomaton(max_nodes)      # создаем новый автомат
    ids = zeros(Int, n + 1)          # массив хранения идентификаторов каждого шаблона
    
    # Добавляем все шаблоны в автомат
    for i in 1:n
        pattern = input[i]
        ids[i+1] = insert(ac, pattern)
    end

    build(ac)                        # построили суффиксные ссылки
    query(ac, text)                  # выполнили поиск
    calculate_final_answers(ac)      # собрали ответы

    # Выводим результаты
    for i in 1:n
        uniqueid = ids[i]
        println("Number of instances of '$(input[i])': " => ac.final_ans[uniqueid+1])
    end
end
Out[0]:
test_aho_corasick (generic function with 1 method)

Let's start testing:

In [ ]:
# Несколько тестовых шаблонов
input = ["a", "bb", "aa", "abaa", "abaaa"]   
text = "abaaabaa"               # Текст для анализа

test_aho_corasick(input, text)
"Number of instances of 'a': " => 6
"Number of instances of 'bb': " => 0
"Number of instances of 'aa': " => 3
"Number of instances of 'abaa': " => 2
"Number of instances of 'abaaa': " => 1

Conclusion

We have reviewed the implementation of the Aho-Korasik algorithm in Julia, which makes it possible to efficiently find all occurrences of several substrings in a text in one pass. This implementation demonstrates the use of graph-type data structures (prefix tree) along with special pointers (fail-links) for fast state transitions.

The algorithm is especially useful for text filtering, dictionary compilation, network protocols, or any other tasks that require parallel analysis of multiple samples. The implemented code can be extended to a wider range of characters and used in large NLP projects.

The example was developed using materials from Rosetta Code