Julia中的正则表达式

导言

正则表达式（regex）是一种通用的工具，用于根据指定的模式搜索，提取和处理文本。它们允许您解决检查电子邮件格式，提取电话号码或分析文本数据等任务。在Julia中，正则表达式由于其易于集成和简洁的语法而特别有用。 Julia不需要额外的模块来使用正则表达式（与其他一些语言不同），并使用前缀 r"..." 来创建模板，这使得代码直观可读。本材料将向您展示如何在Julia中使用正则表达式，重点介绍实际示例和语言特性。

Julia中的正则表达式语法

Julia使用[PCRE]语法（https://ru.wikipedia.org/wiki/PCRE ），它支持一组丰富的功能。让我们来看看正则表达式的主要元素及其在Julia中的使用。

基本语法元素

-文字：常规字符（例如, a, b, 1）按原样在文本中搜索。例如, r"cat" 对应于单词 cat.
-元字符:

. -除换行符以外的任何字符。例如, r"c.t" 回应 cat, cot 但不是 c\nt.
^ -行的开始: r"^hello" 会发现 hello 只有在行的开头。
$ \text{—} \text{конец} \text{строки}: r"world$" 会发现 world 只有在行的末尾。
\ -屏蔽: r"\." 搜索点作为字符，而不是任何字符。 r"\^" 寻找确切的符号 ^.
-量词:
* -零或以上: r"a*" 回应 "", a, aa 等。
+ -一个或多个: r"a+" 回应 a, aa 但不是 "".
? -零或一: r"colou?r" 回应 color 和 colour.
{n,m} -重复范围: r"a{2,4}" 回应 aa, aaa, aaaa.
-字符类:
[abc] -符号之一: r"[abc]" 回应 a, b 或 c.
[a-z] -范围: r"[a-z]+" 它会找到任何小写字母单词。
[^abc] -否认: r"[^abc]" 除此以外的任何字符 a, b, c.
\d -数字: r"\d+" 它会发现像这样的数字 123.
\w -字母、数字或 _: r"\w+" 会发现像 hello123.
\s -空间字符: r"\s+" 它会找到空格或制表符。
-分组:
() -捕获模板的一部分: r"(\d+)-(\d+)" 从行中选择它 12-34 数字 12 和 34 并保存它。
(?:...) -组，但不"保存"结果。非封闭组对于简化表达式的结构很有用。

regex = r"(?:abc)+(\d+)(?:def)+(\&+)"
text = "abcabc123defdefdef&&&"
match(regex,text)[1] # вернёт "123"
match(regex,text)[2] # вернёт "&&&"

朱莉娅的特征

Julia提供了使用正则表达式的便捷方法。:

match(r"шаблон", строка)：找到第一个匹配项。返回一个对象 RegexMatch 或 nothing.

Julia的特点是不需要双重转义（例如, r"\d" 而不是 "\\d"），这使得编写模板更容易。

m = match(r"\d+", "年龄:42")  # \d+'\d'-选择数字'+'-一次或多次出现
println(m.match)

42

eachmatch(r"шаблон", строка)：所有匹配的迭代器。

for m in eachmatch(r"\w+", "你好，亲爱的朋友！")  # `\w'-字母，数字或_（'，'和'！'他们不适合）
    println(m.match)
end

你好！
昂贵
朋友

replace(строка, r"шаблон" => "замена")：替换匹配项。

new = replace("日期格式:01-02-2025", r"\d" => "X")
println(new)

日期格式：XX-XX-XXXX

occursin(r"шаблон", строка)：检查比赛。

@show occursin(r"[A-Z]", "Hello")
@show occursin(r"[A-Z]", "hello")
@show occursin(r"[A-Z]+", "HELLO");

occursin(r"[A-Z]", "Hello") = true
occursin(r"[A-Z]", "hello") = false
occursin(r"[A-Z]+", "HELLO") = true

实际应用

示例1：名字和姓氏提取

让温室参观者杂志的照片拍摄。
接下来，我们将此文档数字化并进行文本识别，接收"文档" OCR_text.
事实证明，一些字母变成了小写，在某处添加了额外的空格，并在某处消失了。在某些情况下，中风被识别为字母。

# 带有数据的文本
OCR_text = """
用户日志:
姓：伊万诺夫名：伊万
姓：彼得罗夫名：彼得L
姓-里姆斯基-科萨科夫姓-尼古拉    
"""

"用户日志：姓名：伊万诺夫名字：伊万名字：彼得罗夫名字：彼得l\Pfamily-Rimsky-Korsakov名字-Nikolai\n"

通过指定标志:

i 在表达式中 r"..."i -我们指定大小写独立（大小写-insensitive）。也就是说，"姓氏"和"姓氏"将被认为是等价的。
m 在表达式中 r"..."m 表示m腿长。 ^ 在表达式中，它将意味着每个之后的行的开始 \n，而不仅仅是一个"大"字符串的开头 OCR_text.
x 在表达式中 r"..."x -我们可以使用空格，并指定使用注释 # （x来自单词ex）

我们将在下面讨论括号的含义。

regex_fullname = r"
        ^Фамилия\s*   # 行首的"姓氏"，后跟0或更多空格
        [:-]\s*       # 然后有一个字符`:`或'-'和0或多个空格
        ([\p{L}-]+)   # [\p{L}-]+-`\p{L}`-Unicode字符，'-`-连字符
        \s*           # 姓氏后面有0个或更多空格。
        Имя\s*[:-]\s* # 与姓氏相同
        (\p{L}+)      # 任何字母序列（包括俄罗斯的）这是名称"imx;

为了从我们的文档中提取有用的信息 OCR_text使用正则表达式 regex_fullname，我们将使用 eachmatch.

请注意，我们有3人。每个人都有2个特征：姓和名。

eachmatch 返回包含RegexMatch类型对象的迭代器，其中每个对象表示文本中的一个模式匹配。

我们的模板包含姓氏和名字。姓氏在我们的表达式中排在第一位，因此对于姓氏，我们将使用 m.captures[1]. 我们的名字是第二个。

换句话说，我们创建了一个包含访问者姓氏和名字的元组数组。

fullnames = [(m.captures[1], m.captures[2]) for m in eachmatch(regex_fullname, OCR_text)]

3-element Vector{Tuple{SubString{String}, SubString{String}}}:
 ("伊万诺夫", "伊凡")
 ("彼得罗夫", "彼得")
 ("里姆斯基至科萨科夫航线", "尼古拉")

我们将以标题格式显示名称和姓氏:

titlecase("abc"); # Abc
titlecase("aBC"); # Abc

for (surname, name) in fullnames
    println("你好，$（titlecase（姓氏））$（titlecase（名称））！")
end

你好，伊万*伊万诺夫！
你好，彼得罗夫*彼得！
你好，尼古拉*里姆斯基-科萨科夫！

示例2：提取电话号码

假设我们需要找到一个格式的数字 +7-XXX-XXX-XX-XX 或 8-XXX-XXX-XX-XX:

说明:

\d{3} 意思正好是3位数字,

\+ 将加号转义为文字。

| 表示***或 ***

(?:...) -一个"非捕获组"，即这是我们要单独定义的表达式的子部分（+7或8，然后是一组数字和连字符）。

但是信息本身，是通过电话记录的吗？+7 或通过8 我们不感兴趣。这就是为什么它不令人兴奋。

text1 = "俄罗斯号码是+7-912-345-67-89或8-987-654-32-10，但不是+1-234-567-89-10"

russian_phone_regex = r"(?:\+7|8)-\d{3}-\d{3}-\d{2}-\d{2}"

for m in eachmatch(russian_phone_regex, text1)
    println("俄罗斯号码被发现: ", m.match)
end

找到俄罗斯号码：+7-912-345-67-89
找到俄罗斯号码：8-987-654-32-10

示例3：检查电子邮件地址

让我们检查电子邮件的正确性:

email = "test_User-name.123@pochta.ru"

email_regex = r"^[a-zA-Z0-9._-]+@[a-zA-Z0-9.-]+\.[a-z]{2,}$"

if match(email_regex, email) !== nothing
    println("电子邮件是正确的") 
else
    println("电邮地址不正确")
end

电子邮件是正确的

说明: ^[a-zA-Z0-9._-]+ 需要一个由字母、数字和一些符号组成的用户名，以及 \.[a-z]{2,}$ -长度为2+个字符的顶级域。

示例4：处理文献脚注

提取表单的脚注 [1], [1, 2]:

text2 = "文本引用[1]，[2，3]和[4]，并包含数学表达式：1+（2{3-x[y-z]}）。"
ref_regex = r"\[\d+(?:,\s*\d+)*\]"
matches = [m.match for m in eachmatch(ref_regex, text2)]
println("脚注: ", join(matches, ", "))

脚注: [1], [2, 3], [4]

解释: (?:,\s*\d+)* -带有逗号的数字的非封闭组。

例5：接收莎士比亚十四行诗

让[威廉的十四行诗]给Шекспира](https://engee.com/community/ru/catalogs/projects/analiz-tekstovykh-dannykh-s-pomoshchiu-massivov-strok），用罗马数字编号。让我们创建一个这些十四行诗的数组，按原始顺序编号，以便它们可以通过索引轻松访问。

sonnets_text = read("sonnets.txt",String);
print(sonnets_text[1:1000])

THE SONNETS

by William Shakespeare


  I

  From fairest creatures we desire increase,
  That thereby beauty's rose might never die,
  But as the riper should by time decease,
  His tender heir might bear his memory:
  But thou, contracted to thine own bright eyes,
  Feed'st thy light's flame with self-substantial fuel,
  Making a famine where abundance lies,
  Thy self thy foe, to thy sweet self too cruel:
  Thou that art now the world's fresh ornament,
  And only herald to the gaudy spring,
  Within thine own bud buriest thy content,
  And tender churl mak'st waste in niggarding:
    Pity the world, or else this glutton be,
    To eat the world's due, by the grave and thee.

  II

  When forty winters shall besiege thy brow,
  And dig deep trenches in thy beauty's field,
  Thy youth's proud livery so gazed on now,
  Will be a tatter'd weed of small worth held:
  Then being asked, where all thy beauty lies,
  Where all the treasure of thy lusty days;
  To say, within thine own deep

十四行诗中有换行符。因此，点不能用于指示任何字符（请参阅"正则表达式"一章的开头）。要解决这个问题，请使用:

\s 表示空格字符。

\S 表示不是不可穿透的字符

这意味着 [\s\S] 表示任何字符。

function split_sonnets(text)
    pattern = r"""
        ^                    # 行的开头（每行都有m标志）
        \s*                  # 罗马数字前的零个或多个空格
        [IVXLCDM]+           # 一个或多个罗马数字(I,V,X,L,C,D,M)
        \s*                  # 数字后零个或多个空格
        $                    # 行尾（仅限行数）
        \s*                  # 数字后面的空格或空行
# ___________________________________________________________________________________________________
        (                    # 十四行诗文本的一个令人兴奋的小组的开始
            [\s\S]*?         # 任何字符（包括\n），不太远（到最近的站）
        )                    # 一个令人兴奋的乐队的结束
# ___________________________________________________________________________________________________
        (?=                  # 正面前视（停止条件）
            ^                # 下一行的开头
            \s*              # 下一个数字前的空格
            [IVXLCDM]+       # 下一个罗马数字
            \s*              # 后面的空格
            $                # 带数字的行尾
            |                # 或
            \z               # 文本的绝对结束（对于最后一首十四行诗）
        )                    # 前向观看结束
    """mx                    # Флаги: m (многострочный режим), x (расширенный режим)
    sonnets = [strip(m.captures[1]) for m in eachmatch(pattern, text)]
    return sonnets
end

split_sonnets (generic function with 1 method)

sonnets = split_sonnets(sonnets_text)

154-element Vector{SubString{String}}:
 "From fairest creatures we desir" ⋯ 579 bytes ⋯ "'s due, by the grave and thee."
 "When forty winters shall besieg" ⋯ 597 bytes ⋯ "arm when thou feel'st it cold."
 "Look in thy glass and tell the " ⋯ 576 bytes ⋯ "nd thine image dies with thee."
 "Unthrifty loveliness, why dost " ⋯ 558 bytes ⋯ "sed, lives th' executor to be."
 "Those hours, that with gentle w" ⋯ 593 bytes ⋯ "r substance still lives sweet."
 "Then let not winter's ragged ha" ⋯ 580 bytes ⋯ "est and make worms thine heir."
 "Lo! in the orient when the grac" ⋯ 552 bytes ⋯ "n diest unless thou get a son."
 "Music to hear, why hear'st thou" ⋯ 612 bytes ⋯ "'Thou single wilt prove none.'"
 "Is it for fear to wet a widow's" ⋯ 585 bytes ⋯ " such murd'rous shame commits."
 "For shame! deny that thou bear'" ⋯ 591 bytes ⋯ "ill may live in thine or thee."
 "As fast as thou shalt wane, so " ⋯ 645 bytes ⋯ "t more, not let that copy die."
 "When I do count the clock that " ⋯ 594 bytes ⋯ " him when he takes thee hence."
 "O! that you were your self; but" ⋯ 574 bytes ⋯ "a father: let your son say so."
 ⋮
 "Lo, as a careful housewife runs" ⋯ 597 bytes ⋯ "back and my loud crying still."
 "Two loves I have of comfort and" ⋯ 542 bytes ⋯ "ad angel fire my good one out."
 "Those lips that Love's own hand" ⋯ 475 bytes ⋯ "v'd my life, saying 'not you'."
 "Poor soul, the centre of my sin" ⋯ 576 bytes ⋯ "d, there's no more dying then."
 "My love is as a fever longing s" ⋯ 563 bytes ⋯ "ack as hell, as dark as night."
 "O me! what eyes hath Love put i" ⋯ 594 bytes ⋯ "g thy foul faults should find."
 "Canst thou, O cruel! say I love" ⋯ 548 bytes ⋯ "e thou lov'st, and I am blind."
 "O! from what power hast thou th" ⋯ 576 bytes ⋯ "orthy I to be belov'd of thee."
 "Love is too young to know what " ⋯ 576 bytes ⋯ "ose dear love I rise and fall."
 "In loving thee thou know'st I a" ⋯ 602 bytes ⋯ "ainst the truth so foul a lie!"
 "Cupid laid by his brand and fel" ⋯ 578 bytes ⋯ "t new fire; my mistress' eyes."
 "The little Love-god lying once " ⋯ 563 bytes ⋯ "s water, water cools not love."

现在让我们只打印前五首十四行诗的第一行。

要做到这一点，我们使用 split 每首十四行诗分为2部分。:

-第1部分：第一行
-第2部分：除第一行外的所有后续行

s = """1 строка
       第二行
       第3行
       4 строка"""
split(s,'\n',limit=2)

2-element Vector{SubString{String}}:
 "1行"
 "2行\n3行\n4行"

for (i, sonnet) in enumerate(sonnets[1:5])
    println("""十四行诗$i:split(split(sonnet,'\n',limit=2)[1])\n...""")
end

十四行诗1：从最美丽的生物，我们渴望增加,
...
十四行诗2：当四十个冬天将包围你的额头,
...
十四行诗3：看看你的杯子，告诉你所看到的脸
...
十四行诗4:Unthrifty可爱,为什么你花
...
十四行诗5:那些时间,那与温和的工作确实框架
...

让我们来衡量我们的功能的速度。

Pkg.add("BenchmarkTools")

using BenchmarkTools
@btime split_sonnets(sonnets_text);

  921.698 μs (621 allocations: 39.19 KiB)

1.24毫秒是一个相当不错的结果。对于2.5千行的文件。但是，您需要了解正则表达式可能不如经典方法。在我们的例子中，我们可以以相当明确的方式解决问题。 （但你不能潜入它，但看看它的执行速度）

function split_sonnets_fast(text)
    sonnets = String[]
    current_sonnet = String[]
    in_sonnet = false
    
    for line in eachline(text)
        if !isempty(line)  # 我们通过条带函数在删除空格和\n之前进行检查
            stripped = strip(line)
            # 如果字符串的每个（全部）字符属于IVXLCDM，则它是罗马数字。
            if all(c -> c in "IVXLCDM", stripped)  
                if in_sonnet && !isempty(current_sonnet)
                    push!(sonnets, join(current_sonnet, '\n'))
                end
                current_sonnet = String[]
                in_sonnet = true
            elseif in_sonnet
                push!(current_sonnet, line)
            end
        end
    end
    
    if in_sonnet && !isempty(current_sonnet)
        push!(sonnets, join(current_sonnet, '\n'))
    end
    
    return sonnets
end


# 检查工作
sonnets = split_sonnets_fast("sonnets.txt")

@btime split_sonnets_fast("sonnets.txt");

  752.183 μs (3870 allocations: 392.40 KiB)

结论

Julia中的正则表达式是处理文本的强大而方便的工具。由于语法的简单性（r"..."），内置的功能，如 match 和 replace 除了语言的高性能，它们是数据处理和分析，搜索和替换任务的理想选择。但重要的是要明白，正则表达式对于复杂（例如嵌套）解析任务可以缓慢工作。结构，如JSON文件，HTML文件等。

尽管正则表达式的语法很复杂，但扩展允许您进行注释。与Julia中处理字符和字符串的内置函数相比，这是一个更通用的处理文本的工具。