Engee documentation
Notebook

Regular expressions in Julia

Introduction

Regular expressions (regex) is a universal tool for searching, extracting and processing text according to specified patterns. They allow you to solve tasks like checking the email format, extracting phone numbers, or analyzing data from texts. In Julia, regular expressions are particularly useful due to their ease of integration and concise syntax. Julia does not require additional modules to work with regex (unlike some other languages) and uses the prefix r"..." to create templates, which makes the code intuitive and readable. This material will show you how to use regular expressions in Julia, with an emphasis on practical examples and language features.


Regular Expression syntax in Julia

Julia uses the [PCRE] syntax(https://ru.wikipedia.org/wiki/PCRE ), which supports a rich set of features. Let's look at the main elements of regular expressions and their use in Julia.

Basic syntax elements

  • Literals: Regular characters (for example, a, b, 1) are searched in the text as is. For example, r"cat" corresponds to the word cat.
  • Metacharacters:
    • . — any character other than a newline. For example, r"c.t" respond cat, cot But not c\nt.
    • ^ — the beginning of the line: r"^hello" will find hello only at the beginning of the line.
    • $ \text{—} \text{конец} \text{строки}: r"world$" will find world only at the end of the line.
    • \ — shielding: r"\." searches for a dot as a character, not any character. r"\^" looking for exactly the symbol ^.
  • Quantifiers:
    • * — zero or more: r"a*" respond "", a, aa etc.
    • + — one or more: r"a+" respond a, aa But not "".
    • ? — zero or one: r"colou?r" respond color and colour.
    • {n,m} — range of repetitions: r"a{2,4}" respond aa, aaa, aaaa.
  • Character classes:
    • [abc] — one of the symbols: r"[abc]" respond a, b or c.
    • [a-z] — range: r"[a-z]+" It will find any lowercase letter word.
    • [^abc] — denial: r"[^abc]" any character other than a, b, c.
    • \d — the figure: r"\d+" it will find numbers like 123.
    • \w — letter, number, or _: r"\w+" will find words like hello123.
    • \s — space character: r"\s+" It will find spaces or tabs.
  • Grouping:
    • () — captures part of the template: r"(\d+)-(\d+)" selects it from the line 12-34 numbers 12 and 34 and save it.
    • (?:...) — groups, but does not "save" the result. A non-enclosing group is useful for simplifying the structure of an expression.
regex = r"(?:abc)+(\d+)(?:def)+(\&+)"
text = "abcabc123defdefdef&&&"
match(regex,text)[1] # returns "123"
match(regex,text)[2] # will return "&&&"

Features in Julia

Julia offers convenient methods for working with regular expressions.:

  • match(r"шаблон", строка): Finds the first match. Returns an object RegexMatch or nothing.

Julia's feature is that there is no need for double escaping (for example, r"\d" Instead of "\\d"), which makes it easier to write templates.

In [ ]:
m = match(r"\d+", "Возраст: 42")  #\d+  `\d` - выбери цифру  `+` - одно или больше вхождений
println(m.match)
42
  • eachmatch(r"шаблон", строка): Iterator of all matches.
In [ ]:
for m in eachmatch(r"\w+", "Здравствуй, дорогой друг!")  # `\w` - буква, цифра или _  (`,` и `!` не подходят)
    println(m.match)
end
Здравствуй
дорогой
друг
  • replace(строка, r"шаблон" => "замена"): Replaces matches.
In [ ]:
new = replace("Формат даты: 01-02-2025", r"\d" => "X")
println(new)
Формат даты: XX-XX-XXXX
  • occursin(r"шаблон", строка): Checks for a match.
In [ ]:
@show occursin(r"[A-Z]", "Hello")
@show occursin(r"[A-Z]", "hello")
@show occursin(r"[A-Z]+", "HELLO");
occursin(r"[A-Z]", "Hello") = true
occursin(r"[A-Z]", "hello") = false
occursin(r"[A-Z]+", "HELLO") = true

Practical applications

Example 1: First and last name extraction

Let the photo of the conservatory visitors' magazine be taken.
Next, we digitized this document and performed text recognition, receiving the "document" OCR_text.
And it turned out that some letters became lowercase, extra spaces were added somewhere, and disappeared somewhere. In some cases, the stroke was recognized as a letter.

In [ ]:
# Текст с данными
OCR_text = """
Журнал посетителей:
Фамилия: иванов имя: Иван
фамилия : Петров имя : пётр     l
Фамилия - Римский-Корсаков Имя  -Николай    
"""
Out[0]:
"Журнал посетителей:\nФамилия: иванов имя: Иван\nфамилия : Петров имя : пётр     l\nФамилия - Римский-Корсаков Имя  -Николай    \n"

By specifying the flag:

  • i in the expression r"..."i - we specify case independence (case-insensitive). That is, "last name" and "Last name" will be considered equivalent.
  • m in the expression r"..."m means mleg length. ^ in the expression, it will mean the beginning of the line after each \n, not just the beginning of a "big" string OCR_text.
  • x in the expression r"..."x - we can use spaces and specify comments using # (x from the word extended)

We will discuss the meaning of parentheses below.

In [ ]:
regex_fullname = r"
        ^Фамилия\s*   # `Фамилия` в начале строки, а после 0 или более пробелов
        [:-]\s*       # далее один знак `:` или `-` и 0 или более пробелов
        ([\p{L}-]+)   # [\p{L}-]+  - `\p{L}` - символы Unicode, `-` - дефис
        \s*           # после фамилии снова 0 или более пробелов
        Имя\s*[:-]\s* # то же, что и с фамилией
        (\p{L}+)      # Любая последовательность букв (русских в том числе) это и есть имя"imx;

In order to extract useful information from our document OCR_textusing a regular expression regex_fullname, we will use eachmatch.

Note that we have 3 people. And each person has 2 characteristics: Last name and First Name.

eachmatch returns an iterator containing objects of the RegexMatch type, where each object represents one pattern match in the text.

Our template contains the last name and first name. The last name comes first in our expression, so for last names we will use m.captures[1]. Our name is the second one.

In other words, we have created an array of tuples of the last name and first name of the visitors.

In [ ]:
fullnames = [(m.captures[1], m.captures[2]) for m in eachmatch(regex_fullname, OCR_text)]
Out[0]:
3-element Vector{Tuple{SubString{String}, SubString{String}}}:
 ("иванов", "Иван")
 ("Петров", "пётр")
 ("Римский-Корсаков", "Николай")

We will display the names and surnames in the Header format:

titlecase("abc"); # Abc
titlecase("aBC"); # Abc
In [ ]:
for (surname, name) in fullnames
    println("Здравствуйте, $(titlecase(surname)) $(titlecase(name))!")
end
Здравствуйте, Иванов Иван!
Здравствуйте, Петров Пётр!
Здравствуйте, Римский-Корсаков Николай!

Example 2: Extracting phone numbers

Let's say we need to find a number in the format +7-XXX-XXX-XX-XX or 8-XXX-XXX-XX-XX:

Explanation:

\d{3} means exactly 3 digits,

\+ escapes the plus sign as a literal.

| means *** or ***

(?:...) - "non-capturing group", i.e. this is a subpart of the expression that we want to define separately (+7 or 8, and then a set of numbers and hyphens).

But the information itself, is the phone recorded via +7 or through 8 we are NOT interested. That's why it's NOTexciting.

In [ ]:
text1 = "Российские номера это +7-912-345-67-89 или 8-987-654-32-10, но не +1-234-567-89-10"

russian_phone_regex = r"(?:\+7|8)-\d{3}-\d{3}-\d{2}-\d{2}"

for m in eachmatch(russian_phone_regex, text1)
    println("Найден российский номер: ", m.match)
end
Найден российский номер: +7-912-345-67-89
Найден российский номер: 8-987-654-32-10

Example 3: Checking email addresses

Let's check the correctness of the email:

In [ ]:
email = "test_User-name.123@pochta.ru"

email_regex = r"^[a-zA-Z0-9._-]+@[a-zA-Z0-9.-]+\.[a-z]{2,}$"

if match(email_regex, email) !== nothing
    println("Email корректен") 
else
    println("Некорректный email")
end
Email корректен

Explanation: ^[a-zA-Z0-9._-]+ requires a user name consisting of letters, numbers, and some symbols, and \.[a-z]{2,}$ — a top-level domain with a length of 2+ characters.

Example 4: Processing footnotes to literature

Extract the footnotes of the form [1], [1, 2]:

In [ ]:
text2 = "Текст ссылается на [1], [2, 3] и [4], и содержит математические выражения: 1 + (2{3 - x[y-z]})."
ref_regex = r"\[\d+(?:,\s*\d+)*\]"
matches = [m.match for m in eachmatch(ref_regex, text2)]
println("Сноски: ", join(matches, ", "))
Сноски: [1], [2, 3], [4]

Explanation: (?:,\s*\d+)* — a non-enclosing group for numbers with commas.

Example 5: Receiving Shakespeare's Sonnets

Let [William's sonnets] be given Шекспира](https://engee.com/community/ru/catalogs/projects/analiz-tekstovykh-dannykh-s-pomoshchiu-massivov-strok) , numbered with Roman numerals. Let's create an array of these sonnets, numbered in the original order, so that they can be easily accessed by index.

In [ ]:
sonnets_text = read("sonnets.txt",String);
print(sonnets_text[1:1000])
THE SONNETS

by William Shakespeare




  I

  From fairest creatures we desire increase,
  That thereby beauty's rose might never die,
  But as the riper should by time decease,
  His tender heir might bear his memory:
  But thou, contracted to thine own bright eyes,
  Feed'st thy light's flame with self-substantial fuel,
  Making a famine where abundance lies,
  Thy self thy foe, to thy sweet self too cruel:
  Thou that art now the world's fresh ornament,
  And only herald to the gaudy spring,
  Within thine own bud buriest thy content,
  And tender churl mak'st waste in niggarding:
    Pity the world, or else this glutton be,
    To eat the world's due, by the grave and thee.

  II

  When forty winters shall besiege thy brow,
  And dig deep trenches in thy beauty's field,
  Thy youth's proud livery so gazed on now,
  Will be a tatter'd weed of small worth held:
  Then being asked, where all thy beauty lies,
  Where all the treasure of thy lusty days;
  To say, within thine own deep

There are line breaks in the sonnets. Therefore, the dot cannot be used to indicate any character (see the beginning of the chapter "Regular expressions"). To get around this, use:

\s means a space character.

\S means NOT an impenetrable character

Which means [\s\S] means any character.

In [ ]:
function split_sonnets(text)
    pattern = r"""
        ^                    # Начало строки (с флагом m — для каждой строки)
        \s*                  # Ноль или более пробелов перед римской цифрой
        [IVXLCDM]+           # Одна или более римских цифр (I, V, X, L, C, D, M)
        \s*                  # Ноль или более пробелов после цифры
        $                    # Конец строки (ограничивает строку только цифрой)
        \s*                  # Пробелы или пустые строки после цифры
#___________________________________________________________________________________________________
        (                    # Начало захватывающей группы для текста сонета
            [\s\S]*?         # Любой символ (включая \n), нежадно (до ближайшей остановки)
        )                    # Конец захватывающей группы
#___________________________________________________________________________________________________
        (?=                  # Положительный просмотр вперёд (условие остановки)
            ^                # Начало следующей строки
            \s*              # Пробелы перед следующей цифрой
            [IVXLCDM]+       # Следующая римская цифра
            \s*              # Пробелы после неё
            $                # Конец строки с цифрой
            |                # Или
            \z               # Абсолютный конец текста (для последнего сонета)
        )                    # Конец просмотра вперёд
    """mx                    # Флаги: m (многострочный режим), x (расширенный режим)
    sonnets = [strip(m.captures[1]) for m in eachmatch(pattern, text)]
    return sonnets
end
Out[0]:
split_sonnets (generic function with 1 method)
In [ ]:
sonnets = split_sonnets(sonnets_text)
Out[0]:
154-element Vector{SubString{String}}:
 "From fairest creatures we desir" ⋯ 579 bytes ⋯ "'s due, by the grave and thee."
 "When forty winters shall besieg" ⋯ 597 bytes ⋯ "arm when thou feel'st it cold."
 "Look in thy glass and tell the " ⋯ 576 bytes ⋯ "nd thine image dies with thee."
 "Unthrifty loveliness, why dost " ⋯ 558 bytes ⋯ "sed, lives th' executor to be."
 "Those hours, that with gentle w" ⋯ 593 bytes ⋯ "r substance still lives sweet."
 "Then let not winter's ragged ha" ⋯ 580 bytes ⋯ "est and make worms thine heir."
 "Lo! in the orient when the grac" ⋯ 552 bytes ⋯ "n diest unless thou get a son."
 "Music to hear, why hear'st thou" ⋯ 612 bytes ⋯ "'Thou single wilt prove none.'"
 "Is it for fear to wet a widow's" ⋯ 585 bytes ⋯ " such murd'rous shame commits."
 "For shame! deny that thou bear'" ⋯ 591 bytes ⋯ "ill may live in thine or thee."
 "As fast as thou shalt wane, so " ⋯ 645 bytes ⋯ "t more, not let that copy die."
 "When I do count the clock that " ⋯ 594 bytes ⋯ " him when he takes thee hence."
 "O! that you were your self; but" ⋯ 574 bytes ⋯ "a father: let your son say so."
 ⋮
 "Lo, as a careful housewife runs" ⋯ 597 bytes ⋯ "back and my loud crying still."
 "Two loves I have of comfort and" ⋯ 542 bytes ⋯ "ad angel fire my good one out."
 "Those lips that Love's own hand" ⋯ 475 bytes ⋯ "v'd my life, saying 'not you'."
 "Poor soul, the centre of my sin" ⋯ 576 bytes ⋯ "d, there's no more dying then."
 "My love is as a fever longing s" ⋯ 563 bytes ⋯ "ack as hell, as dark as night."
 "O me! what eyes hath Love put i" ⋯ 594 bytes ⋯ "g thy foul faults should find."
 "Canst thou, O cruel! say I love" ⋯ 548 bytes ⋯ "e thou lov'st, and I am blind."
 "O! from what power hast thou th" ⋯ 576 bytes ⋯ "orthy I to be belov'd of thee."
 "Love is too young to know what " ⋯ 576 bytes ⋯ "ose dear love I rise and fall."
 "In loving thee thou know'st I a" ⋯ 602 bytes ⋯ "ainst the truth so foul a lie!"
 "Cupid laid by his brand and fel" ⋯ 578 bytes ⋯ "t new fire; my mistress' eyes."
 "The little Love-god lying once " ⋯ 563 bytes ⋯ "s water, water cools not love."

Let us now print only the first line of the first five sonnets.

To do this, we divide using split Each sonnet is divided into 2 parts thus:

  • Part 1: the first line
  • Part 2: all subsequent lines except the first one
In [ ]:
s = """1 строка
       2 строка
       3 строка
       4 строка"""
split(s,'\n',limit=2)
Out[0]:
2-element Vector{SubString{String}}:
 "1 строка"
 "2 строка\n3 строка\n4 строка"
In [ ]:
for (i, sonnet) in enumerate(sonnets[1:5])
    println("""Соннет $i:$(split(sonnet,'\n',limit=2)[1])\n...""")
end
Соннет 1:From fairest creatures we desire increase,
...
Соннет 2:When forty winters shall besiege thy brow,
...
Соннет 3:Look in thy glass and tell the face thou viewest
...
Соннет 4:Unthrifty loveliness, why dost thou spend
...
Соннет 5:Those hours, that with gentle work did frame
...

Let's measure the speed of our function.

In [ ]:
Pkg.add("BenchmarkTools")
In [ ]:
using BenchmarkTools
@btime split_sonnets(sonnets_text);
  921.698 μs (621 allocations: 39.19 KiB)

1.24 milliseconds is a pretty good result. for a file of 2.5 thousand lines. However, you need to understand that regular expressions may be inferior to classical approaches. In our case, we could solve the problem in a fairly explicit way. ** (But you can not dive into it, but look at the speed of its execution)**

In [ ]:
function split_sonnets_fast(text)
    sonnets = String[]
    current_sonnet = String[]
    in_sonnet = false
    
    for line in eachline(text)
        if !isempty(line)  # Проверяем до отбрасывания пробелов и \n через функцию strip
            stripped = strip(line)
            # Если каждый (all) символ строки принадлежит IVXLCDM, то это римское число
            if all(c -> c in "IVXLCDM", stripped)  
                if in_sonnet && !isempty(current_sonnet)
                    push!(sonnets, join(current_sonnet, '\n'))
                end
                current_sonnet = String[]
                in_sonnet = true
            elseif in_sonnet
                push!(current_sonnet, line)
            end
        end
    end
    
    if in_sonnet && !isempty(current_sonnet)
        push!(sonnets, join(current_sonnet, '\n'))
    end
    
    return sonnets
end


# Проверка работы
sonnets = split_sonnets_fast("sonnets.txt")

@btime split_sonnets_fast("sonnets.txt");
  752.183 μs (3870 allocations: 392.40 KiB)

Conclusion

Regular expressions in Julia are a powerful and convenient tool for working with text. Due to the simplicity of the syntax (r"..."), built-in functions like match and replace as well as the high performance of the language, they are ideal for data processing and analysis, search and replace tasks. But it is important to understand that regular expressions can work slowly for complex (nested, for example) parsing tasks. structures such as JSON files, HTML files, etc.

And despite the complexity of the syntax of regular expressions, extensions allow you to make comments. Which is a more versatile tool for working with text than the built-in functions for working with characters and strings in Julia.