Engee documentation
Notebook

The Aho-Korasik algorithm

This example examines the implementation of the Aho-Korasik algorithm in the Julia programming language. The algorithm is designed to search for multiple samples in a text at the same time.

The theoretical part

Algorithm Ахо–Корасик is an efficient substring search algorithm designed to find all occurrences of a finite set of template strings (dictionary) in the input text. Designed by Alfred V. Aho and Margaret J. Founded in 1975, it is widely used in tasks that require simultaneous search for multiple patterns, such as text processing, bioinformatics, and network security.

Task description

A text string is given lengths and many of the templates , the total length of which is (the sum of the lengths of all the patterns). The Aho–Korasik algorithm finds all occurrences of any pattern in . Specifically, it returns all pairs. , such that the template matches the substring of the text starting from the position (i.e. The algorithm is especially effective when searching for multiple templates at the same time, as it avoids multiple text scans for each template individually.

The formal statement of the problem:

Input: text and lots of templates .

Output: all pairs , such that the template it occurs in starting from the position .

Applications

The Aho–Korasik algorithm is used in the following areas:

  • Full-text search systems — for searching multiple keywords in documents;

  • Intrusion detection systems — for matching signatures of malicious patterns in network traffic;

  • Bioinformatics — to search for multiple motifs in DNA sequences;

  • Data compression and natural language processing — for template-based analysis.

The main part

Data fields

In [ ]:
# All valid characters of the English alphabet from 'a' to 'z'
const VALID_CHARS = collect('a':'z')
Out[0]:
26-element Vector{Char}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
 'd': ASCII/Unicode U+0064 (category Ll: Letter, lowercase)
 'e': ASCII/Unicode U+0065 (category Ll: Letter, lowercase)
 'f': ASCII/Unicode U+0066 (category Ll: Letter, lowercase)
 'g': ASCII/Unicode U+0067 (category Ll: Letter, lowercase)
 'h': ASCII/Unicode U+0068 (category Ll: Letter, lowercase)
 'i': ASCII/Unicode U+0069 (category Ll: Letter, lowercase)
 'j': ASCII/Unicode U+006A (category Ll: Letter, lowercase)
 'k': ASCII/Unicode U+006B (category Ll: Letter, lowercase)
 'l': ASCII/Unicode U+006C (category Ll: Letter, lowercase)
 'm': ASCII/Unicode U+006D (category Ll: Letter, lowercase)
 'n': ASCII/Unicode U+006E (category Ll: Letter, lowercase)
 'o': ASCII/Unicode U+006F (category Ll: Letter, lowercase)
 'p': ASCII/Unicode U+0070 (category Ll: Letter, lowercase)
 'q': ASCII/Unicode U+0071 (category Ll: Letter, lowercase)
 'r': ASCII/Unicode U+0072 (category Ll: Letter, lowercase)
 's': ASCII/Unicode U+0073 (category Ll: Letter, lowercase)
 't': ASCII/Unicode U+0074 (category Ll: Letter, lowercase)
 'u': ASCII/Unicode U+0075 (category Ll: Letter, lowercase)
 'v': ASCII/Unicode U+0076 (category Ll: Letter, lowercase)
 'w': ASCII/Unicode U+0077 (category Ll: Letter, lowercase)
 'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)
 'y': ASCII/Unicode U+0079 (category Ll: Letter, lowercase)
 'z': ASCII/Unicode U+007A (category Ll: Letter, lowercase)

The structure of a finite state machine node

Determination of the structure of the Bohr node / Aho-Corasick finite state machine

In [ ]:
mutable struct Node
    son::Vector{Int}   # child vertex numbers for each character ('a'-'z')
    ans::Int           # the number of matches ending here
    fail::Int          # a link to the "previous" node with the same suffix
    du::Int            # the number of dependent nodes to be processed after the current one
    idx::Int           # a unique index for the detected word
end

Default constructor: create a new node without child nodes and with initial values

In [ ]:
Node() = Node(zeros(Int, length(VALID_CHARS)), 0, 1, 0, 0)
Out[0]:
Node

The structure of the Aho-Korasik automaton

The definition of the Aho-Korasik automaton itself

In [ ]:
mutable struct ACAutomaton
    tr::Vector{Node}     # an array of all nodes of the prefix boron/ automaton tree
    tot::Int             # total number of used/existing nodes
    final_ans::Vector{Int} # the result of counting the number of occurrences of each pattern
    pidx::Int            # a counter for unique word indexes
end

Initializing an automaton with a maximum of max_nodes nodes and a root

In [ ]:
ACAutomaton(max_nodes) = ACAutomaton([Node() for _ in 1:max_nodes], 1, Int[], 0)
Out[0]:
ACAutomaton

Operations on the machine

Inserting a pattern into an automaton

The function of adding a new template to the machine and assigning it a unique ID

In [ ]:
function insert(ac::ACAutomaton, pattern)
    u = 1                      # the initial node is the root
    for ch in pattern          # passing through each character of the template
        char_code = findfirst(==(ch), VALID_CHARS)
        isnothing(char_code) && continue
        if ac.tr[u].son[char_code] == 0         # if there is no such child node
            ac.tot += 1                          # we select a new node
            ac.tr[u].son[char_code] = ac.tot
        end
        u = ac.tr[u].son[char_code]              # moving on to the next node
    end

    # We assign a unique identifier to the pattern, if it has not yet been assigned.
    if ac.tr[u].idx == 0
        ac.pidx += 1
        ac.tr[u].idx = ac.pidx
    end
    
    # returning a unique index
    return ac.tr[u].idx
end
Out[0]:
insert (generic function with 1 method)

Building "suffix" links

The function of constructing suffix transitions (fail links) is necessary for the correct operation of the AC algorithm.

In [ ]:
function build(ac::ACAutomaton)
    q = [ac.tr[begin].son[i] for i in eachindex(VALID_CHARS) if ac.tr[begin].son[i] != 0]

    while !isempty(q)
        u = q[begin]
        q = q[begin+1:end]

        for i in eachindex(VALID_CHARS)
            son_node_idx = ac.tr[u].son[i]
            fail_node_idx = ac.tr[u].fail

            if son_node_idx != 0
                ac.tr[son_node_idx].fail = ac.tr[fail_node_idx].son[i]
                ac.tr[ac.tr[son_node_idx].fail].du += 1
                push!(q, son_node_idx)
            else
                ac.tr[u].son[i] = ac.tr[fail_node_idx].son[i] # Making a greedy sequel
            end
        end
    end
end
Out[0]:
build (generic function with 1 method)

Search for words in the input text

The function of searching for all words in the text using a built-in automaton

In [ ]:
function query(ac::ACAutomaton, text)
    u = 1
    for ch in text             # we follow the characters of the text
        char_code = findfirst(==(ch), VALID_CHARS)
        isnothing(char_code) && continue
        u = ac.tr[u].son[char_code]
        ac.tr[u].ans += 1       # increasing the count of matches ending at this vertex
    end
end
Out[0]:
query (generic function with 1 method)

Calculation of final results

After completing the main pass, we calculate the exact match values for each template, taking into account the possible dependence between nodes through fail links.

In [ ]:
function calculate_final_answers(ac::ACAutomaton)
    ac.final_ans = zeros(Int, ac.pidx + 1)
    
    # We start with nodes that have no dependencies.
    q = filter(i -> ac.tr[i].du == 0, 1:ac.tot)
    
    while !isempty(q)
        u = pop!(q)
        idx = ac.tr[u].idx
        if idx != 0
            ac.final_ans[idx] = ac.tr[u].ans
        end

        v = ac.tr[u].fail
        if v > 1
            ac.tr[v].ans += ac.tr[u].ans
            ac.tr[v].du -= 1
            ac.tr[v].du == 0 && push!(q, v)  # if this node is now ready for processing, we add it to the queue.
        end
    end
end
Out[0]:
calculate_final_answers (generic function with 1 method)

Implementation testing

A feature that automatically tests the entire system

In [ ]:
function test_aho_corasick(input, text)
    max_nodes, n = 200000 + 6, 5    # limit on the maximum number of nodes
    ac = ACAutomaton(max_nodes)      # creating a new slot machine
    ids = zeros(Int, n + 1)          # an array for storing the IDs of each template
    
    # Adding all the templates to the machine
    for i in 1:n
        pattern = input[i]
        ids[i+1] = insert(ac, pattern)
    end

    build(ac)                        # we have built suffix links
    query(ac, text)                  # completed the search
    calculate_final_answers(ac)      # We have collected the answers

    # We display the results
    for i in 1:n
        uniqueid = ids[i]
        println("Number of instances of '$(input[i])': " => ac.final_ans[uniqueid+1])
    end
end
Out[0]:
test_aho_corasick (generic function with 1 method)

Let's start testing:

In [ ]:
# Several test templates
input = ["a", "bb", "aa", "abaa", "abaaa"]   
text = "abaaabaa"               # Text for analysis

test_aho_corasick(input, text)
"Number of instances of 'a': " => 6
"Number of instances of 'bb': " => 0
"Number of instances of 'aa': " => 3
"Number of instances of 'abaa': " => 2
"Number of instances of 'abaaa': " => 1

Conclusion

We have reviewed the implementation of the Aho-Korasik algorithm in Julia, which makes it possible to efficiently find all occurrences of several substrings in a text in one pass. This implementation demonstrates the use of graph-type data structures (prefix tree) along with special pointers (fail-links) for fast state transitions.

The algorithm is especially useful for text filtering, dictionary compilation, network protocols, or any other tasks that require parallel analysis of multiple samples. The implemented code can be extended to a wider range of characters and used in large NLP projects.

The example was developed using materials from Rosetta Code