Regular expressions in Julia
Introduction
Regular expressions (regex) is a universal tool for searching, extracting and processing text according to specified patterns. They allow you to solve tasks like checking the email format, extracting phone numbers, or analyzing data from texts. In Julia, regular expressions are particularly useful due to their ease of integration and concise syntax. Julia does not require additional modules to work with regex (unlike some other languages) and uses the prefix r"..." to create templates, which makes the code intuitive and readable. This material will show you how to use regular expressions in Julia, with an emphasis on practical examples and language features.
Regular Expression syntax in Julia
Julia uses the [PCRE] syntax(https://ru.wikipedia.org/wiki/PCRE ), which supports a rich set of features. Let's look at the main elements of regular expressions and their use in Julia.
Basic syntax elements
- Literals: Regular characters (for example,
a,b,1) are searched in the text as is. For example,r"cat"corresponds to the wordcat. - Metacharacters:
.— any character other than a newline. For example,r"c.t"respondcat,cotBut notc\nt.^— the beginning of the line:r"^hello"will findhelloonly at the beginning of the line.$\text{—} \text{конец} \text{строки}:r"world$"will findworldonly at the end of the line.\— shielding:r"\."searches for a dot as a character, not any character.r"\^"looking for exactly the symbol^.
- Quantifiers:
*— zero or more:r"a*"respond"",a,aaetc.+— one or more:r"a+"responda,aaBut not"".?— zero or one:r"colou?r"respondcolorandcolour.{n,m}— range of repetitions:r"a{2,4}"respondaa,aaa,aaaa.
- Character classes:
[abc]— one of the symbols:r"[abc]"responda,borc.[a-z]— range:r"[a-z]+"It will find any lowercase letter word.[^abc]— denial:r"[^abc]"any character other thana,b,c.\d— the figure:r"\d+"it will find numbers like123.\w— letter, number, or_:r"\w+"will find words likehello123.\s— space character:r"\s+"It will find spaces or tabs.
- Grouping:
()— captures part of the template:r"(\d+)-(\d+)"selects it from the line12-34numbers12and34and save it.(?:...)— groups, but does not "save" the result. A non-enclosing group is useful for simplifying the structure of an expression.
regex = r"(?:abc)+(\d+)(?:def)+(\&+)"
text = "abcabc123defdefdef&&&"
match(regex,text)[1] # returns "123"
match(regex,text)[2] # will return "&&&"
Features in Julia
Julia offers convenient methods for working with regular expressions.:
match(r"шаблон", строка): Finds the first match. Returns an objectRegexMatchornothing.
Julia's feature is that there is no need for double escaping (for example,
r"\d"Instead of"\\d"), which makes it easier to write templates.
m = match(r"\d+", "Возраст: 42") #\d+ `\d` - выбери цифру `+` - одно или больше вхождений
println(m.match)
eachmatch(r"шаблон", строка): Iterator of all matches.
for m in eachmatch(r"\w+", "Здравствуй, дорогой друг!") # `\w` - буква, цифра или _ (`,` и `!` не подходят)
println(m.match)
end
replace(строка, r"шаблон" => "замена"): Replaces matches.
new = replace("Формат даты: 01-02-2025", r"\d" => "X")
println(new)
occursin(r"шаблон", строка): Checks for a match.
@show occursin(r"[A-Z]", "Hello")
@show occursin(r"[A-Z]", "hello")
@show occursin(r"[A-Z]+", "HELLO");
Practical applications
Example 1: First and last name extraction
Let the photo of the conservatory visitors' magazine be taken.
Next, we digitized this document and performed text recognition, receiving the "document" OCR_text.
And it turned out that some letters became lowercase, extra spaces were added somewhere, and disappeared somewhere. In some cases, the stroke was recognized as a letter.
# Текст с данными
OCR_text = """
Журнал посетителей:
Фамилия: иванов имя: Иван
фамилия : Петров имя : пётр l
Фамилия - Римский-Корсаков Имя -Николай
"""
By specifying the flag:
iin the expressionr"..."i- we specify case independence (case-insensitive). That is, "last name" and "Last name" will be considered equivalent.min the expressionr"..."mmeans mleg length.^in the expression, it will mean the beginning of the line after each\n, not just the beginning of a "big" stringOCR_text.xin the expressionr"..."x- we can use spaces and specify comments using#(x from the word extended)
We will discuss the meaning of parentheses below.
regex_fullname = r"
^Фамилия\s* # `Фамилия` в начале строки, а после 0 или более пробелов
[:-]\s* # далее один знак `:` или `-` и 0 или более пробелов
([\p{L}-]+) # [\p{L}-]+ - `\p{L}` - символы Unicode, `-` - дефис
\s* # после фамилии снова 0 или более пробелов
Имя\s*[:-]\s* # то же, что и с фамилией
(\p{L}+) # Любая последовательность букв (русских в том числе) это и есть имя"imx;
In order to extract useful information from our document OCR_textusing a regular expression regex_fullname, we will use eachmatch.
Note that we have 3 people. And each person has 2 characteristics: Last name and First Name.
eachmatch returns an iterator containing objects of the RegexMatch type, where each object represents one pattern match in the text.
Our template contains the last name and first name. The last name comes first in our expression, so for last names we will use m.captures[1]. Our name is the second one.
In other words, we have created an array of tuples of the last name and first name of the visitors.
fullnames = [(m.captures[1], m.captures[2]) for m in eachmatch(regex_fullname, OCR_text)]
We will display the names and surnames in the Header format:
titlecase("abc"); # Abc
titlecase("aBC"); # Abc
for (surname, name) in fullnames
println("Здравствуйте, $(titlecase(surname)) $(titlecase(name))!")
end
Example 2: Extracting phone numbers
Let's say we need to find a number in the format +7-XXX-XXX-XX-XX or 8-XXX-XXX-XX-XX:
Explanation:
\d{3}means exactly 3 digits,
\+escapes the plus sign as a literal.
|means *** or ***
(?:...)- "non-capturing group", i.e. this is a subpart of the expression that we want to define separately (+7 or 8, and then a set of numbers and hyphens).But the information itself, is the phone recorded via +7 or through 8 we are NOT interested. That's why it's NOTexciting.
text1 = "Российские номера это +7-912-345-67-89 или 8-987-654-32-10, но не +1-234-567-89-10"
russian_phone_regex = r"(?:\+7|8)-\d{3}-\d{3}-\d{2}-\d{2}"
for m in eachmatch(russian_phone_regex, text1)
println("Найден российский номер: ", m.match)
end
Example 3: Checking email addresses
Let's check the correctness of the email:
email = "test_User-name.123@pochta.ru"
email_regex = r"^[a-zA-Z0-9._-]+@[a-zA-Z0-9.-]+\.[a-z]{2,}$"
if match(email_regex, email) !== nothing
println("Email корректен")
else
println("Некорректный email")
end
Explanation:
^[a-zA-Z0-9._-]+requires a user name consisting of letters, numbers, and some symbols, and\.[a-z]{2,}$— a top-level domain with a length of 2+ characters.
Example 4: Processing footnotes to literature
Extract the footnotes of the form [1], [1, 2]:
text2 = "Текст ссылается на [1], [2, 3] и [4], и содержит математические выражения: 1 + (2{3 - x[y-z]})."
ref_regex = r"\[\d+(?:,\s*\d+)*\]"
matches = [m.match for m in eachmatch(ref_regex, text2)]
println("Сноски: ", join(matches, ", "))
Explanation: (?:,\s*\d+)* — a non-enclosing group for numbers with commas.
Example 5: Receiving Shakespeare's Sonnets
Let [William's sonnets] be given Шекспира](https://engee.com/community/ru/catalogs/projects/analiz-tekstovykh-dannykh-s-pomoshchiu-massivov-strok) , numbered with Roman numerals. Let's create an array of these sonnets, numbered in the original order, so that they can be easily accessed by index.
sonnets_text = read("sonnets.txt",String);
print(sonnets_text[1:1000])
There are line breaks in the sonnets. Therefore, the dot cannot be used to indicate any character (see the beginning of the chapter "Regular expressions"). To get around this, use:
\s means a space character.
\S means NOT an impenetrable character
Which means [\s\S] means any character.
function split_sonnets(text)
pattern = r"""
^ # Начало строки (с флагом m — для каждой строки)
\s* # Ноль или более пробелов перед римской цифрой
[IVXLCDM]+ # Одна или более римских цифр (I, V, X, L, C, D, M)
\s* # Ноль или более пробелов после цифры
$ # Конец строки (ограничивает строку только цифрой)
\s* # Пробелы или пустые строки после цифры
#___________________________________________________________________________________________________
( # Начало захватывающей группы для текста сонета
[\s\S]*? # Любой символ (включая \n), нежадно (до ближайшей остановки)
) # Конец захватывающей группы
#___________________________________________________________________________________________________
(?= # Положительный просмотр вперёд (условие остановки)
^ # Начало следующей строки
\s* # Пробелы перед следующей цифрой
[IVXLCDM]+ # Следующая римская цифра
\s* # Пробелы после неё
$ # Конец строки с цифрой
| # Или
\z # Абсолютный конец текста (для последнего сонета)
) # Конец просмотра вперёд
"""mx # Флаги: m (многострочный режим), x (расширенный режим)
sonnets = [strip(m.captures[1]) for m in eachmatch(pattern, text)]
return sonnets
end
sonnets = split_sonnets(sonnets_text)
Let us now print only the first line of the first five sonnets.
To do this, we divide using split Each sonnet is divided into 2 parts thus:
- Part 1: the first line
- Part 2: all subsequent lines except the first one
s = """1 строка
2 строка
3 строка
4 строка"""
split(s,'\n',limit=2)
for (i, sonnet) in enumerate(sonnets[1:5])
println("""Соннет $i:$(split(sonnet,'\n',limit=2)[1])\n...""")
end
Let's measure the speed of our function.
Pkg.add("BenchmarkTools")
using BenchmarkTools
@btime split_sonnets(sonnets_text);
1.24 milliseconds is a pretty good result. for a file of 2.5 thousand lines. However, you need to understand that regular expressions may be inferior to classical approaches. In our case, we could solve the problem in a fairly explicit way. ** (But you can not dive into it, but look at the speed of its execution)**
function split_sonnets_fast(text)
sonnets = String[]
current_sonnet = String[]
in_sonnet = false
for line in eachline(text)
if !isempty(line) # Проверяем до отбрасывания пробелов и \n через функцию strip
stripped = strip(line)
# Если каждый (all) символ строки принадлежит IVXLCDM, то это римское число
if all(c -> c in "IVXLCDM", stripped)
if in_sonnet && !isempty(current_sonnet)
push!(sonnets, join(current_sonnet, '\n'))
end
current_sonnet = String[]
in_sonnet = true
elseif in_sonnet
push!(current_sonnet, line)
end
end
end
if in_sonnet && !isempty(current_sonnet)
push!(sonnets, join(current_sonnet, '\n'))
end
return sonnets
end
# Проверка работы
sonnets = split_sonnets_fast("sonnets.txt")
@btime split_sonnets_fast("sonnets.txt");
Conclusion
Regular expressions in Julia are a powerful and convenient tool for working with text. Due to the simplicity of the syntax (r"..."), built-in functions like match and replace as well as the high performance of the language, they are ideal for data processing and analysis, search and replace tasks. But it is important to understand that regular expressions can work slowly for complex (nested, for example) parsing tasks. structures such as JSON files, HTML files, etc.
And despite the complexity of the syntax of regular expressions, extensions allow you to make comments. Which is a more versatile tool for working with text than the built-in functions for working with characters and strings in Julia.