Regular expressions in Julia¶
Introduction¶
Regular expressions (regex) are a versatile tool for searching, extracting and processing text according to patterns. They allow you to solve tasks like checking email formats, extracting phone numbers, or analysing data from texts. In Julia, regular expressions are particularly useful due to their easy integration and concise syntax. Julia does not require additional modules to work with regex (unlike some other languages) and uses the r"..."
prefix to create templates, which makes the code intuitive and readable. This material will show you how to use regular expressions in Julia, with an emphasis on practical examples and language features.
Regular expression syntax in Julia¶
Julia uses the PCRE syntax, which supports a rich set of features. Let's look at the basic elements of regular expressions and their use in Julia.
Basic elements of the syntax¶
- Literal: Regular characters (e.g.,
a
,b
,1
) are searched for in the text as is. For example,r"cat"
matches the wordcat
. - Metacharacters:
.
is any character other than a newline. For example,r"c.t"
matchescat
,cot
, but notc\nt
.^
- beginning of a string:r"^hello"
will only findhello
at the beginning of a string.$
— конец строки:r"world$"
will findworld
only at the end of the line.\
- escape:r"\."
looks for a dot as a character, not any character.r"\^"
looks specifically for the character^
.
- Quantifiers:
*
- zero or more:r"a*"
matches""
,a
,aa
, etc.+
- one or more:r"a+"
matchesa
,aa
, but not""
.?
- zero or one:r"colou?r"
matchescolor
andcolour
.{n,m}
- range of repetition:r"a{2,4}"
matchesaa
,aaa
,aaaa
.
- Character classes:
[abc]
- one of the characters:r"[abc]"
matchesa
,b
orc
.[a-z]
- range:r"[a-z]+"
will find any lowercase word.[^abc]
- negation:r"[^abc]"
any character excepta
,b
,c
.\d
- digit:r"\d+"
will find numbers like123
.\w
- letter, number or_
:r"\w+"
will find words likehello123
.\s
- whitespace character:r"\s+"
will find spaces or tabs.
- Grouping:
()
- captures part of the pattern:r"(\d+)-(\d+)"
will select the numbers12
and34
from the string12-34
and save.(?:...)
- groups, but will not "save" the result. A non-capturing group is useful for simplifying the structure of an expression.
regex = r"(?:abc)+(\d+)(?:def)+(\&+)"
text = "abcabc123defdefdef&&&&&"
match(regex,text)[1] # will return "123".
match(regex,text)[2] # return "&&&&"
Functions in Julia¶
Julia offers convenient methods for working with regular expressions:
match(r"шаблон", строка)
: Finds the first match. Returns the objectRegexMatch
ornothing
.
A feature of Julia is that there is no need for double escaping (e.g.
r"\d"
instead of"\\d"
), which makes it easier to write templates.
m = match(r"\d+", "Возраст: 42") #\d+ `\d` - выбери цифру `+` - одно или больше вхождений
println(m.match)
eachmatch(r"шаблон", строка)
: An iterator of all matches.
for m in eachmatch(r"\w+", "Здравствуй, дорогой друг!") # `\w` - буква, цифра или _ (`,` и `!` не подходят)
println(m.match)
end
replace(строка, r"шаблон" => "замена")
: Replaces matches.
new = replace("Формат даты: 01-02-2025", r"\d" => "X")
println(new)
occursin(r"шаблон", строка)
: Checks if there is a match.
@show occursin(r"[A-Z]", "Hello")
@show occursin(r"[A-Z]", "hello")
@show occursin(r"[A-Z]+", "HELLO");
Practical applications¶
Example 1: extraction of first and last name¶
Let a photograph of the Conservatoire's visitor log was taken.
Then we digitised this document and performed text recognition, obtaining the "document" OCR_text
.
And it turned out that some letters became lower case, somewhere extra spaces were added, and somewhere disappeared. In some cases a stroke was recognised as a letter.
# Текст с данными
OCR_text = """
Журнал посетителей:
Фамилия: иванов имя: Иван
фамилия : Петров имя : пётр l
Фамилия - Римский-Корсаков Имя -Николай
"""
By specifying the flag:
i
in the expressionr"..."i
- we specify case-insensitive. That is, "surname" and "last name" will be considered equivalentm
in the expressionr"..."m
means mmultiline.^
in the expression will mean the beginning of the line after each\n
, not just the beginning of the "big" lineOCR_text
.x
in the expressionr"..."x
- we can use spaces and specify comments via#
(x from the word extended).
We will discuss the meaning of parentheses below.
regex_fullname = r"
^Фамилия\s* # `Фамилия` в начале строки, а после 0 или более пробелов
[:-]\s* # далее один знак `:` или `-` и 0 или более пробелов
([\p{L}-]+) # [\p{L}-]+ - `\p{L}` - символы Unicode, `-` - дефис
\s* # после фамилии снова 0 или более пробелов
Имя\s*[:-]\s* # то же, что и с фамилией
(\p{L}+) # Любая последовательность букв (русских в том числе) это и есть имя"imx;
In order to extract useful information from our document OCR_text
using the regular expression regex_fullname
, let's use eachmatch
.
Note that we have 3 people. And each person has 2 characteristics: First name and Name.
eachmatch
returns an iterator containing objects of type RegexMatch, where each object represents one pattern match in the text.
Our pattern contains a last name and a first name. The last name comes first in the expression, so we will use m.captures[1]
for last names. The first name is the second
That is, we have created an array of tuples from the last name and first name of visitors.
fullnames = [(m.captures[1], m.captures[2]) for m in eachmatch(regex_fullname, OCR_text)]
Let's output the first and last names, in Header format:
titlecase("abc"); # Abc
titlecase("aBC"); # Abc
for (surname, name) in fullnames
println("Здравствуйте, $(titlecase(surname)) $(titlecase(name))!")
end
Example 2: Extracting phone numbers¶
Suppose we need to find a number in the format +7-XXX-XXX-XX-XX
or 8-XXX-XXX-XX-XX
:
Explanation:
\d{3}
means exactly 3 digits,
\+
escapes the plus sign as a literal.
|
means or
(?:...)
is a "non-capturing group", i.e. it is a subpart of an expression that we want to define separately (+7 or 8, and then a set of digits and hyphens).But the information itself, whether the phone is written with +7 or 8, is of NOT interest to us. That's why it's NOT exciting.
text = "Российские номера это +7-912-345-67-89 или 8-987-654-32-10, но не +1-234-567-89-10"
russian_phone_regex = r"(?:\+7|8)-\d{3}-\d{3}-\d{2}-\d{2}"
for m in eachmatch(russian_phone_regex, text)
println("Найден российский номер: ", m.match)
end
Example 3: Checking email addresses¶
Let's check the correctness of the email:
email = "test_User-name.123@pochta.ru"
email_regex = r"^[a-zA-Z0-9._-]+@[a-zA-Z0-9.-]+\.[a-z]{2,}$"
if match(email_regex, email) !== nothing
println("Email корректен")
else
println("Некорректный email")
end
Explanation:
^[a-zA-Z0-9._-]+
requires a username of letters, numbers and some characters, and\.[a-z]{2,}$
requires a top-level domain of 2+ characters in length.
Example 4: Processing footnotes to literature¶
Extract footnotes of the form [1]
, [1, 2]
:
text = "Текст ссылается на [1], [2, 3] и [4], и содержит математические выражения: 1 + (2{3 - x[y-z]})."
ref_regex = r"\[\d+(?:,\s*\d+)*\]"
matches = [m.match for m in eachmatch(ref_regex, text)]
println("Сноски: ", join(matches, ", "))
Explanation: (?:,\s*\d+)*
- non-capturing group for numbers with commas.
Example 5: Obtaining Shakespeare's sonnets¶
Let William Shakespeare's sonnets be given, numbered with Roman numerals. Let's create an array of these sonnets numbered in the original order, so that we can easily access them by index.
sonnets_text = read("sonnets.txt",String);
print(sonnets_text[1:1000])
The sonnets contain line breaks. Therefore, it is not possible to use a dot to denote any character (see the beginning of the "Regular Expressions" chapter). To get around this, we'll use:
\s
means a whitespace character.
\S
means NE non-space character
And so [\s\S]
means any character.
function split_sonnets(text)
pattern = r"""
^ # Начало строки (с флагом m — для каждой строки)
\s* # Ноль или более пробелов перед римской цифрой
[IVXLCDM]+ # Одна или более римских цифр (I, V, X, L, C, D, M)
\s* # Ноль или более пробелов после цифры
$ # Конец строки (ограничивает строку только цифрой)
\s* # Пробелы или пустые строки после цифры
#___________________________________________________________________________________________________
( # Начало захватывающей группы для текста сонета
[\s\S]*? # Любой символ (включая \n), нежадно (до ближайшей остановки)
) # Конец захватывающей группы
#___________________________________________________________________________________________________
(?= # Положительный просмотр вперёд (условие остановки)
^ # Начало следующей строки
\s* # Пробелы перед следующей цифрой
[IVXLCDM]+ # Следующая римская цифра
\s* # Пробелы после неё
$ # Конец строки с цифрой
| # Или
\z # Абсолютный конец текста (для последнего сонета)
) # Конец просмотра вперёд
"""mx # Флаги: m (многострочный режим), x (расширенный режим)
sonnets = [strip(m.captures[1]) for m in eachmatch(pattern, text)]
return sonnets
end
sonnets = split_sonnets(sonnets_text)
Let us now deduce only the first line of the first five sonnets.
To do this, let's divide each sonnet into two parts using split
in this way:
- Part 1: the first line
- Part 2: all subsequent lines except the first line
s = """1 строка
2 строка
3 строка
4 строка"""
split(s,'\n',limit=2)
for (i, sonnet) in enumerate(sonnets[1:5])
println("""Соннет $i:$(split(sonnet,'\n',limit=2)[1])\n...""")
end
Let's measure the speed of execution of our function.
Pkg.add("BenchmarkTools")
using BenchmarkTools
@btime split_sonnets(sonnets_text);
1.24 milliseconds is quite a good result for a file of 2.5 thousand lines. However, we should realise that regular expressions can be inferior to classical approaches. In our case, we could solve the problem in a rather explicit way. (But you may not dive into it, but look at the speed of its execution)(But you may not dive into it, but look at the speed of its execution).
function split_sonnets_fast(text)
sonnets = String[]
current_sonnet = String[]
in_sonnet = false
for line in eachline(text)
if !isempty(line) # Проверяем до отбрасывания пробелов и \n через функцию strip
stripped = strip(line)
# Если каждый (all) символ строки принадлежит IVXLCDM, то это римское число
if all(c -> c in "IVXLCDM", stripped)
if in_sonnet && !isempty(current_sonnet)
push!(sonnets, join(current_sonnet, '\n'))
end
current_sonnet = String[]
in_sonnet = true
elseif in_sonnet
push!(current_sonnet, line)
end
end
end
if in_sonnet && !isempty(current_sonnet)
push!(sonnets, join(current_sonnet, '\n'))
end
return sonnets
end
# Проверка работы
sonnets = split_sonnets_fast("sonnets.txt")
@btime split_sonnets_fast("sonnets.txt");
Conclusion¶
Regular expressions in Julia are a powerful and convenient tool for working with text. Thanks to their simple syntax (r"..."
), built-in features like match
and replace
, and the high performance of the language, they are ideal for data processing and analysis, search and replace tasks. But it is important to realise that regular expressions can be slow for parsing tasks of complex (nested, for example) structures, such as JSON files, HTML files, and so on.
And despite some complexity of regular expression syntax, thanks to extensions you can make comments. Which is a more versatile tool for working with text than the built-in functions for working with characters and strings in Julia.