Regular expressions in Julia
Introduction
Regular expressions (regex) is a universal tool for searching, extracting and processing text according to specified patterns. They allow you to solve tasks like checking the email format, extracting phone numbers, or analyzing data from texts. In Julia, regular expressions are particularly useful due to their ease of integration and concise syntax. Julia does not require additional modules to work with regex (unlike some other languages) and uses the prefix r"..." to create templates, which makes the code intuitive and readable. This material will show you how to use regular expressions in Julia, with an emphasis on practical examples and language features.
Regular Expression syntax in Julia
Julia uses the [PCRE] syntax(https://ru.wikipedia.org/wiki/PCRE ), which supports a rich set of features. Let's look at the main elements of regular expressions and their use in Julia.
Basic syntax elements
- Literals: Regular characters (for example,
a,b,1) are searched in the text as is. For example,r"cat"corresponds to the wordcat. - Metacharacters:
.— any character other than a newline. For example,r"c.t"respondcat,cotBut notc\nt.^— the beginning of the line:r"^hello"will findhelloonly at the beginning of the line.$\text{—} \text{конец} \text{строки}:r"world$"will findworldonly at the end of the line.\— shielding:r"\."searches for a dot as a character, not any character.r"\^"looking for exactly the symbol^.
- Quantifiers:
*— zero or more:r"a*"respond"",a,aaetc.+— one or more:r"a+"responda,aaBut not"".?— zero or one:r"colou?r"respondcolorandcolour.{n,m}— range of repetitions:r"a{2,4}"respondaa,aaa,aaaa.
- Character classes:
[abc]— one of the symbols:r"[abc]"responda,borc.[a-z]— range:r"[a-z]+"It will find any lowercase letter word.[^abc]— denial:r"[^abc]"any character other thana,b,c.\d— the figure:r"\d+"it will find numbers like123.\w— letter, number, or_:r"\w+"will find words likehello123.\s— space character:r"\s+"It will find spaces or tabs.
- Grouping:
()— captures part of the template:r"(\d+)-(\d+)"selects it from the line12-34numbers12and34and save it.(?:...)— groups, but does not "save" the result. A non-enclosing group is useful for simplifying the structure of an expression.
regex = r"(?:abc)+(\d+)(?:def)+(\&+)"
text = "abcabc123defdefdef&&&"
match(regex,text)[1] # returns "123"
match(regex,text)[2] # will return "&&&"
Features in Julia
Julia offers convenient methods for working with regular expressions.:
match(r"шаблон", строка): Finds the first match. Returns an objectRegexMatchornothing.
Julia feature — no need for double escaping (for example,
r"\d"Instead of"\\d"), which makes it easier to write templates.
m = match(r"\d+", "Age: 42") # \d+`\d` - select the number `+` - one or more occurrences
println(m.match)
eachmatch(r"шаблон", строка): Iterator of all matches.
for m in eachmatch(r"\w+", "Hello, dear friend!") # `\w` - letter, number, or _ (`,` and '!` they don't fit)
println(m.match)
end
replace(строка, r"шаблон" => "замена"): Replaces matches.
new = replace("Date format: 01-02-2025", r"\d" => "X")
println(new)
occursin(r"шаблон", строка): Checks for a match.
@show occursin(r"[A-Z]", "Hello")
@show occursin(r"[A-Z]", "hello")
@show occursin(r"[A-Z]+", "HELLO");
Practical applications
Example 1: First and last name extraction
Let the photo of the conservatory visitors' magazine be taken.
Next, we digitized this document and performed text recognition, receiving the "document" OCR_text.
And it turned out that some letters became lowercase, extra spaces were added somewhere, and disappeared somewhere. In some cases, the stroke was recognized as a letter.
# Text with data
OCR_text = """
User Log:
Last name: Ivanov First name: Ivan
Last name : Petrov first name : Peter L
Last name - Rimsky-Korsakov First name -Nikolai
"""
By specifying the flag:
iin the expressionr"..."i- we specify case independence (case-insensitive). That is, "last name" and "Last name" will be considered equivalent.min the expressionr"..."mmeans mleg length.^in the expression, it will mean the beginning of the line after each\n, not just the beginning of a "big" stringOCR_text.xin the expressionr"..."x- we can use spaces and specify comments using#(x from the word extended)
We will discuss the meaning of parentheses below.
regex_fullname = r"
^Фамилия\s* # `Last name` at the beginning of the line, followed by 0 or more spaces
[:-]\s* # then there is one character `:` or `-` and 0 or more spaces
([\p{L}-]+) # [\p{L}-]+ - `\p{L}` - Unicode characters, `-` - hyphen
\s* # there are 0 or more spaces after the last name again
Имя\s*[:-]\s* # the same as with the last name
(\p{L}+) # Any sequence of letters (including Russian ones) this is the name "imx;
In order to extract useful information from our document OCR_textusing a regular expression regex_fullname, we will use eachmatch.
Note that we have 3 people. And each person has 2 characteristics: Last name and First Name.
eachmatch returns an iterator containing objects of the RegexMatch type, where each object represents one pattern match in the text.
Our template contains the last name and first name. The last name comes first in our expression, so for last names we will use m.captures[1]. Our name is the second one.
In other words, we have created an array of tuples of the last name and first name of the visitors.
fullnames = [(m.captures[1], m.captures[2]) for m in eachmatch(regex_fullname, OCR_text)]
We will display the names and surnames in the Header format:
titlecase("abc"); # Abc
titlecase("aBC"); # Abc
for (surname, name) in fullnames
println("Hello, $(titlecase(surname)) $(titlecase(name))!")
end
Example 2: Extracting phone numbers
Let's say we need to find a number in the format +7-XXX-XXX-XX-XX or 8-XXX-XXX-XX-XX:
Explanation:
\d{3}means exactly 3 digits,
\+escapes the plus sign as a literal.
|means *** or***
(?:...)- "non-capturing group", i.e. this is a subpart of the expression that we want to define separately (+7 or 8, and then a set of numbers and hyphens).But the information itself, is the phone recorded via +7 or through 8 we are NOT interested. That's why it's NOTexciting.
text1 = "The Russian numbers are +7-912-345-67-89 or 8-987-654-32-10, but not +1-234-567-89-10"
russian_phone_regex = r"(?:\+7|8)-\d{3}-\d{3}-\d{2}-\d{2}"
for m in eachmatch(russian_phone_regex, text1)
println("The Russian number was found: ", m.match)
end
Example 3: Checking email addresses
Let's check the correctness of the email:
email = "test_User-name.123@pochta.ru"
email_regex = r"^[a-zA-Z0-9._-]+@[a-zA-Z0-9.-]+\.[a-z]{2,}$"
if match(email_regex, email) !== nothing
println("Email is correct")
else
println("Incorrect email address")
end
Explanation:
^[a-zA-Z0-9._-]+requires a user name consisting of letters, numbers, and some symbols, and\.[a-z]{2,}$— a top-level domain with a length of 2+ characters.
Example 4: Processing footnotes to literature
Extract the footnotes of the form [1], [1, 2]:
text2 = "The text refers to [1], [2, 3] and [4], and contains mathematical expressions: 1 + (2{3 - x[y-z]})."
ref_regex = r"\[\d+(?:,\s*\d+)*\]"
matches = [m.match for m in eachmatch(ref_regex, text2)]
println("Footnotes: ", join(matches, ", "))
Explanation: (?:,\s*\d+)* — a non-enclosing group for numbers with commas.
Example 5: Receiving Shakespeare's Sonnets
Let [William's sonnets] be given Шекспира](https://engee.com/community/ru/catalogs/projects/analiz-tekstovykh-dannykh-s-pomoshchiu-massivov-strok) , numbered with Roman numerals. Let's create an array of these sonnets, numbered in the original order, so that they can be easily accessed by index.
sonnets_text = read("sonnets.txt",String);
print(sonnets_text[1:1000])
There are line breaks in the sonnets. Therefore, the dot cannot be used to indicate any character (see the beginning of the chapter "Regular expressions"). To get around this, use:
\s means a space character.
\S means NOT an impenetrable character
Which means [\s\S] means any character.
function split_sonnets(text)
pattern = r"""
^ # The beginning of the line (with the m — flag for each line)
\s* # Zero or more spaces before the Roman numeral
[IVXLCDM]+ # One or more Roman numerals (I, V, X, L, C, D, M)
\s* # Zero or more spaces after the number
$ # End of line (limits the line to a digit only)
\s* # Spaces or empty lines after the number
# ___________________________________________________________________________________________________
( # The beginning of an exciting group for the sonnet text
[\s\S]*? # Any character (including \n), not too far (to the nearest stop)
) # The end of an exciting band
# ___________________________________________________________________________________________________
(?= # Positive forward view (stop condition)
^ # The beginning of the next line
\s* # Spaces before the next digit
[IVXLCDM]+ # The next Roman numeral
\s* # Spaces after it
$ # End of line with a number
| # Or
\z # The absolute end of the text (for the last sonnet)
) # End of forward viewing
"""mx # Флаги: m (многострочный режим), x (расширенный режим)
sonnets = [strip(m.captures[1]) for m in eachmatch(pattern, text)]
return sonnets
end
sonnets = split_sonnets(sonnets_text)
Let us now print only the first line of the first five sonnets.
To do this, we divide using split Each sonnet is divided into 2 parts thus:
- Part 1: the first line
- Part 2: all subsequent lines except the first one
s = """1 строка
Line 2
Line 3
4 строка"""
split(s,'\n',limit=2)
for (i, sonnet) in enumerate(sonnets[1:5])
println("""Sonnet $i:$(split(sonnet,'\n',limit=2)[1])\n...""")
end
Let's measure the speed of our function.
Pkg.add("BenchmarkTools")
using BenchmarkTools
@btime split_sonnets(sonnets_text);
1.24 milliseconds is a pretty good result. for a file of 2.5 thousand lines. However, you need to understand that regular expressions may be inferior to classical approaches. In our case, we could solve the problem in a fairly explicit way. ** (But you can not dive into it, but look at the speed of its execution)**
function split_sonnets_fast(text)
sonnets = String[]
current_sonnet = String[]
in_sonnet = false
for line in eachline(text)
if !isempty(line) # We check before dropping spaces and \n through the strip function
stripped = strip(line)
# If each (all) character of the string belongs to IVXLCDM, then it is a Roman number.
if all(c -> c in "IVXLCDM", stripped)
if in_sonnet && !isempty(current_sonnet)
push!(sonnets, join(current_sonnet, '\n'))
end
current_sonnet = String[]
in_sonnet = true
elseif in_sonnet
push!(current_sonnet, line)
end
end
end
if in_sonnet && !isempty(current_sonnet)
push!(sonnets, join(current_sonnet, '\n'))
end
return sonnets
end
# Checking the work
sonnets = split_sonnets_fast("sonnets.txt")
@btime split_sonnets_fast("sonnets.txt");
Conclusion
Regular expressions in Julia are a powerful and convenient tool for working with text. Due to the simplicity of the syntax (r"..."), built-in functions like match and replace as well as the high performance of the language, they are ideal for data processing and analysis, search and replace tasks. But it is important to understand that regular expressions can work slowly for complex (nested, for example) parsing tasks. structures such as JSON files, HTML files, etc.
And despite the complexity of the regular expression syntax, extensions allow you to make comments. Which is a more versatile tool for working with text than the built-in functions for working with characters and strings in Julia.