Engee documentation
Notebook

Regular expressions in Julia

Introduction

Regular expressions (regex) are a versatile tool for searching, extracting and processing text according to patterns. They allow you to solve tasks like checking email formats, extracting phone numbers, or analysing data from texts. In Julia, regular expressions are particularly useful due to their easy integration and concise syntax. Julia does not require additional modules to work with regex (unlike some other languages) and uses the r"..." prefix to create templates, which makes the code intuitive and readable. This material will show you how to use regular expressions in Julia, with an emphasis on practical examples and language features.


Regular expression syntax in Julia

Julia uses the PCRE syntax, which supports a rich set of features. Let's look at the basic elements of regular expressions and their use in Julia.

Basic elements of the syntax

  • Literal: Regular characters (e.g., a, b, 1) are searched for in the text as is. For example, r"cat" matches the word cat.
  • Metacharacters:
    • . is any character other than a newline. For example, r"c.t" matches cat, cot, but not c\nt.
    • ^ - beginning of a string: r"^hello" will only find hello at the beginning of a string.
    • $ — конец строки: r"world$" will find world only at the end of the line.
    • \ - escape: r"\." looks for a dot as a character, not any character. r"\^" looks specifically for the character ^.
  • Quantifiers:
    • * - zero or more: r"a*" matches "", a, aa, etc.
    • + - one or more: r"a+" matches a, aa, but not "".
    • ? - zero or one: r"colou?r" matches color and colour.
    • {n,m} - range of repetition: r"a{2,4}" matches aa, aaa, aaaa.
  • Character classes:
    • [abc] - one of the characters: r"[abc]" matches a, b or c.
    • [a-z] - range: r"[a-z]+" will find any lowercase word.
    • [^abc] - negation: r"[^abc]" any character except a, b, c.
    • \d - digit: r"\d+" will find numbers like 123.
    • \w - letter, number or _: r"\w+" will find words like hello123.
    • \s - whitespace character: r"\s+" will find spaces or tabs.
  • Grouping:
    • () - captures part of the pattern: r"(\d+)-(\d+)" will select the numbers 12 and 34 from the string 12-34 and save.
    • (?:...) - groups, but will not "save" the result. A non-capturing group is useful for simplifying the structure of an expression.
regex = r"(?:abc)+(\d+)(?:def)+(\&+)"
text = "abcabc123defdefdef&&&&&"
match(regex,text)[1] # will return "123".
match(regex,text)[2] # return "&&&&"

Functions in Julia

Julia offers convenient methods for working with regular expressions:

  • match(r"шаблон", строка): Finds the first match. Returns the object RegexMatch or nothing.

A feature of Julia is that there is no need for double escaping (e.g. r"\d" instead of "\\d"), which makes it easier to write templates.

In [ ]:
m = match(r"\d+", "Возраст: 42")  #\d+  `\d` - выбери цифру  `+` - одно или больше вхождений
println(m.match)
42
  • eachmatch(r"шаблон", строка): An iterator of all matches.
In [ ]:
for m in eachmatch(r"\w+", "Здравствуй, дорогой друг!")  # `\w` - буква, цифра или _  (`,` и `!` не подходят)
    println(m.match)
end
Здравствуй
дорогой
друг
  • replace(строка, r"шаблон" => "замена"): Replaces matches.
In [ ]:
new = replace("Формат даты: 01-02-2025", r"\d" => "X")
println(new)
Формат даты: XX-XX-XXXX
  • occursin(r"шаблон", строка): Checks if there is a match.
In [ ]:
@show occursin(r"[A-Z]", "Hello")
@show occursin(r"[A-Z]", "hello")
@show occursin(r"[A-Z]+", "HELLO");
occursin(r"[A-Z]", "Hello") = true
occursin(r"[A-Z]", "hello") = false
occursin(r"[A-Z]+", "HELLO") = true

Practical applications

Example 1: extraction of first and last name

Let a photograph of the Conservatoire's visitor log was taken. Then we digitised this document and performed text recognition, obtaining the "document" OCR_text. And it turned out that some letters became lower case, somewhere extra spaces were added, and somewhere disappeared. In some cases a stroke was recognised as a letter.

In [ ]:
# Текст с данными
OCR_text = """
Журнал посетителей:
Фамилия: иванов имя: Иван
фамилия : Петров имя : пётр     l
Фамилия - Римский-Корсаков Имя  -Николай    
"""

By specifying the flag:

  • i in the expression r"..."i - we specify case-insensitive. That is, "surname" and "last name" will be considered equivalent
  • m in the expression r"..."m means mmultiline. ^ in the expression will mean the beginning of the line after each \n, not just the beginning of the "big" line OCR_text.
  • x in the expression r"..."x - we can use spaces and specify comments via # (x from the word extended).

We will discuss the meaning of parentheses below.

In [ ]:
regex_fullname = r"
        ^Фамилия\s*   # `Фамилия` в начале строки, а после 0 или более пробелов
        [:-]\s*       # далее один знак `:` или `-` и 0 или более пробелов
        ([\p{L}-]+)   # [\p{L}-]+  - `\p{L}` - символы Unicode, `-` - дефис
        \s*           # после фамилии снова 0 или более пробелов
        Имя\s*[:-]\s* # то же, что и с фамилией
        (\p{L}+)      # Любая последовательность букв (русских в том числе) это и есть имя"imx;

In order to extract useful information from our document OCR_text using the regular expression regex_fullname, let's use eachmatch.

Note that we have 3 people. And each person has 2 characteristics: First name and Name.

eachmatch returns an iterator containing objects of type RegexMatch, where each object represents one pattern match in the text.

Our pattern contains a last name and a first name. The last name comes first in the expression, so we will use m.captures[1] for last names. The first name is the second

That is, we have created an array of tuples from the last name and first name of visitors.

In [ ]:
fullnames = [(m.captures[1], m.captures[2]) for m in eachmatch(regex_fullname, OCR_text)]
Out[0]:
3-element Vector{Tuple{SubString{String}, SubString{String}}}:
 ("иванов", "Иван")
 ("Петров", "пётр")
 ("Римский-Корсаков", "Николай")

Let's output the first and last names, in Header format:

titlecase("abc"); # Abc
titlecase("aBC"); # Abc
In [ ]:
for (surname, name) in fullnames
    println("Здравствуйте, $(titlecase(surname)) $(titlecase(name))!")
end
Здравствуйте, Иванов Иван!
Здравствуйте, Петров Пётр!
Здравствуйте, Римский-Корсаков Николай!

Example 2: Extracting phone numbers

Suppose we need to find a number in the format +7-XXX-XXX-XX-XX or 8-XXX-XXX-XX-XX:

Explanation:

\d{3} means exactly 3 digits,

\+ escapes the plus sign as a literal.

| means or

(?:...) is a "non-capturing group", i.e. it is a subpart of an expression that we want to define separately (+7 or 8, and then a set of digits and hyphens).

But the information itself, whether the phone is written with +7 or 8, is of NOT interest to us. That's why it's NOT exciting.

In [ ]:
text = "Российские номера это +7-912-345-67-89 или 8-987-654-32-10, но не +1-234-567-89-10"

russian_phone_regex = r"(?:\+7|8)-\d{3}-\d{3}-\d{2}-\d{2}"

for m in eachmatch(russian_phone_regex, text)
    println("Найден российский номер: ", m.match)
end
Найден российский номер: +7-912-345-67-89
Найден российский номер: 8-987-654-32-10

Example 3: Checking email addresses

Let's check the correctness of the email:

In [ ]:
email = "test_User-name.123@pochta.ru"

email_regex = r"^[a-zA-Z0-9._-]+@[a-zA-Z0-9.-]+\.[a-z]{2,}$"

if match(email_regex, email) !== nothing
    println("Email корректен") 
else
    println("Некорректный email")
end
Email корректен

Explanation: ^[a-zA-Z0-9._-]+ requires a username of letters, numbers and some characters, and \.[a-z]{2,}$ requires a top-level domain of 2+ characters in length.

Example 4: Processing footnotes to literature

Extract footnotes of the form [1], [1, 2]:

In [ ]:
text = "Текст ссылается на [1], [2, 3] и [4], и содержит математические выражения: 1 + (2{3 - x[y-z]})."
ref_regex = r"\[\d+(?:,\s*\d+)*\]"
matches = [m.match for m in eachmatch(ref_regex, text)]
println("Сноски: ", join(matches, ", "))
Сноски: [1], [2, 3], [4]

Explanation: (?:,\s*\d+)* - non-capturing group for numbers with commas.

Example 5: Obtaining Shakespeare's sonnets

Let William Shakespeare's sonnets be given, numbered with Roman numerals. Let's create an array of these sonnets numbered in the original order, so that we can easily access them by index.

In [ ]:
sonnets_text = read("sonnets.txt",String);
print(sonnets_text[1:1000])
THE SONNETS

by William Shakespeare




  I

  From fairest creatures we desire increase,
  That thereby beauty's rose might never die,
  But as the riper should by time decease,
  His tender heir might bear his memory:
  But thou, contracted to thine own bright eyes,
  Feed'st thy light's flame with self-substantial fuel,
  Making a famine where abundance lies,
  Thy self thy foe, to thy sweet self too cruel:
  Thou that art now the world's fresh ornament,
  And only herald to the gaudy spring,
  Within thine own bud buriest thy content,
  And tender churl mak'st waste in niggarding:
    Pity the world, or else this glutton be,
    To eat the world's due, by the grave and thee.

  II

  When forty winters shall besiege thy brow,
  And dig deep trenches in thy beauty's field,
  Thy youth's proud livery so gazed on now,
  Will be a tatter'd weed of small worth held:
  Then being asked, where all thy beauty lies,
  Where all the treasure of thy lusty days;
  To say, within thine own deep

The sonnets contain line breaks. Therefore, it is not possible to use a dot to denote any character (see the beginning of the "Regular Expressions" chapter). To get around this, we'll use:

\s means a whitespace character.

\S means NE non-space character

And so [\s\S] means any character.

In [ ]:
function split_sonnets(text)
    pattern = r"""
        ^                    # Начало строки (с флагом m — для каждой строки)
        \s*                  # Ноль или более пробелов перед римской цифрой
        [IVXLCDM]+           # Одна или более римских цифр (I, V, X, L, C, D, M)
        \s*                  # Ноль или более пробелов после цифры
        $                    # Конец строки (ограничивает строку только цифрой)
        \s*                  # Пробелы или пустые строки после цифры
#___________________________________________________________________________________________________
        (                    # Начало захватывающей группы для текста сонета
            [\s\S]*?         # Любой символ (включая \n), нежадно (до ближайшей остановки)
        )                    # Конец захватывающей группы
#___________________________________________________________________________________________________
        (?=                  # Положительный просмотр вперёд (условие остановки)
            ^                # Начало следующей строки
            \s*              # Пробелы перед следующей цифрой
            [IVXLCDM]+       # Следующая римская цифра
            \s*              # Пробелы после неё
            $                # Конец строки с цифрой
            |                # Или
            \z               # Абсолютный конец текста (для последнего сонета)
        )                    # Конец просмотра вперёд
    """mx                    # Флаги: m (многострочный режим), x (расширенный режим)
    sonnets = [strip(m.captures[1]) for m in eachmatch(pattern, text)]
    return sonnets
end
Out[0]:
split_sonnets (generic function with 1 method)
In [ ]:
sonnets = split_sonnets(sonnets_text)
Out[0]:
154-element Vector{SubString{String}}:
 "From fairest creatures we desir" ⋯ 579 bytes ⋯ "'s due, by the grave and thee."
 "When forty winters shall besieg" ⋯ 597 bytes ⋯ "arm when thou feel'st it cold."
 "Look in thy glass and tell the " ⋯ 576 bytes ⋯ "nd thine image dies with thee."
 "Unthrifty loveliness, why dost " ⋯ 558 bytes ⋯ "sed, lives th' executor to be."
 "Those hours, that with gentle w" ⋯ 593 bytes ⋯ "r substance still lives sweet."
 "Then let not winter's ragged ha" ⋯ 580 bytes ⋯ "est and make worms thine heir."
 "Lo! in the orient when the grac" ⋯ 552 bytes ⋯ "n diest unless thou get a son."
 "Music to hear, why hear'st thou" ⋯ 612 bytes ⋯ "'Thou single wilt prove none.'"
 "Is it for fear to wet a widow's" ⋯ 585 bytes ⋯ " such murd'rous shame commits."
 "For shame! deny that thou bear'" ⋯ 591 bytes ⋯ "ill may live in thine or thee."
 "As fast as thou shalt wane, so " ⋯ 645 bytes ⋯ "t more, not let that copy die."
 "When I do count the clock that " ⋯ 594 bytes ⋯ " him when he takes thee hence."
 "O! that you were your self; but" ⋯ 574 bytes ⋯ "a father: let your son say so."
 ⋮
 "Lo, as a careful housewife runs" ⋯ 597 bytes ⋯ "back and my loud crying still."
 "Two loves I have of comfort and" ⋯ 542 bytes ⋯ "ad angel fire my good one out."
 "Those lips that Love's own hand" ⋯ 475 bytes ⋯ "v'd my life, saying 'not you'."
 "Poor soul, the centre of my sin" ⋯ 576 bytes ⋯ "d, there's no more dying then."
 "My love is as a fever longing s" ⋯ 563 bytes ⋯ "ack as hell, as dark as night."
 "O me! what eyes hath Love put i" ⋯ 594 bytes ⋯ "g thy foul faults should find."
 "Canst thou, O cruel! say I love" ⋯ 548 bytes ⋯ "e thou lov'st, and I am blind."
 "O! from what power hast thou th" ⋯ 576 bytes ⋯ "orthy I to be belov'd of thee."
 "Love is too young to know what " ⋯ 576 bytes ⋯ "ose dear love I rise and fall."
 "In loving thee thou know'st I a" ⋯ 602 bytes ⋯ "ainst the truth so foul a lie!"
 "Cupid laid by his brand and fel" ⋯ 578 bytes ⋯ "t new fire; my mistress' eyes."
 "The little Love-god lying once " ⋯ 563 bytes ⋯ "s water, water cools not love."

Let us now deduce only the first line of the first five sonnets.

To do this, let's divide each sonnet into two parts using split in this way:

  • Part 1: the first line
  • Part 2: all subsequent lines except the first line
In [ ]:
s = """1 строка
       2 строка
       3 строка
       4 строка"""
split(s,'\n',limit=2)
Out[0]:
2-element Vector{SubString{String}}:
 "1 строка"
 "2 строка\n3 строка\n4 строка"
In [ ]:
for (i, sonnet) in enumerate(sonnets[1:5])
    println("""Соннет $i:$(split(sonnet,'\n',limit=2)[1])\n...""")
end
Соннет 1:
From fairest creatures we desire increase,
...
Соннет 2:
When forty winters shall besiege thy brow,
...
Соннет 3:
Look in thy glass and tell the face thou viewest
...
Соннет 4:
Unthrifty loveliness, why dost thou spend
...
Соннет 5:
Those hours, that with gentle work did frame
...

Let's measure the speed of execution of our function.

In [ ]:
Pkg.add("BenchmarkTools")
In [ ]:
using BenchmarkTools
@btime split_sonnets(sonnets_text);
  1.244 ms (621 allocations: 39.19 KiB)

1.24 milliseconds is quite a good result for a file of 2.5 thousand lines. However, we should realise that regular expressions can be inferior to classical approaches. In our case, we could solve the problem in a rather explicit way. (But you may not dive into it, but look at the speed of its execution)(But you may not dive into it, but look at the speed of its execution).

In [ ]:
function split_sonnets_fast(text)
    sonnets = String[]
    current_sonnet = String[]
    in_sonnet = false
    
    for line in eachline(text)
        if !isempty(line)  # Проверяем до отбрасывания пробелов и \n через функцию strip
            stripped = strip(line)
            # Если каждый (all) символ строки принадлежит IVXLCDM, то это римское число
            if all(c -> c in "IVXLCDM", stripped)  
                if in_sonnet && !isempty(current_sonnet)
                    push!(sonnets, join(current_sonnet, '\n'))
                end
                current_sonnet = String[]
                in_sonnet = true
            elseif in_sonnet
                push!(current_sonnet, line)
            end
        end
    end
    
    if in_sonnet && !isempty(current_sonnet)
        push!(sonnets, join(current_sonnet, '\n'))
    end
    
    return sonnets
end


# Проверка работы
sonnets = split_sonnets_fast("sonnets.txt")

@btime split_sonnets_fast("sonnets.txt");
  888.738 μs (3870 allocations: 392.40 KiB)

Conclusion

Regular expressions in Julia are a powerful and convenient tool for working with text. Thanks to their simple syntax (r"..."), built-in features like match and replace, and the high performance of the language, they are ideal for data processing and analysis, search and replace tasks. But it is important to realise that regular expressions can be slow for parsing tasks of complex (nested, for example) structures, such as JSON files, HTML files, and so on.

And despite some complexity of regular expression syntax, thanks to extensions you can make comments. Which is a more versatile tool for working with text than the built-in functions for working with characters and strings in Julia.