Engee 文档
Notebook

用Julia语言检查自动图

这个例子演示了一种算法的实现,用于验证自然语言中所谓的autograms—句子的正确性,这些句子包含对其自身结构的准确描述,即描述某些字母,标点符号和其

导言

什么是自动图?

Autogram是描述其自身结构的一种特殊类型的autological(自我描述)句子。 也就是说,这样的句子表示字母表的每个字母,标点符号和其他符号在其中出现多少次。 如果每个提到的项目的计数器与其在同一句子中的实际数量相匹配,则自动图是正确的。

例如,一个自动图可能会说,"这句话中有六个字母'e',"如果它确实包含六个字母'e',那么这是正确描述的一部分。

算法的目的

实现的目的是检查autogram是否正确描述自己,即:

  1. 对于每个提到的字母或符号:指定的数字与实际数字相同。
  2. 报价中提到的所有符号都被真正考虑在内。
  3. 如果需要,是否考虑标点符号。

该算法可用于游戏,语言任务,文本生成,甚至作为人工智能中一种有趣的自我反思方式。

主要部分

连接必要的软件包

In [ ]:
# 安装必要的DataStructures包(如果未安装
import Pkg; Pkg.add("DataStructures")

# 使用数据结构包方便字符频率计算
using DataStructures
   Resolving package versions...
   Installed FiniteDiff ─ v2.27.0
    Updating `~/.project/Project.toml`
 [864edb3b] + DataStructures v0.18.22
  No Changes to `~/.project/Manifest.toml`
WARNING: using DataStructures.reset! in module Main conflicts with an existing identifier.

以文本形式定义数字字典

In [ ]:
# 从一到九十的数字的文本表示的字典
const textnumbers = Dict(
    "single" => 1, "one" => 1, "two" => 2, "three" => 3, "four" => 4,
    "five" => 5, "six" => 6, "seven" => 7, "eight" => 8, "nine" => 9,
    "ten" => 10, "eleven" => 11, "twelve" => 12, "thirteen" => 13,
    "fourteen" => 14, "fifteen" => 15, "sixteen" => 16,
    "seventeen" => 17, "eighteen" => 18, "nineteen" => 19,
    "twenty" => 20, "thirty" => 30, "forty" => 40, "fifty" => 50,
    "sixty" => 60, "seventy" => 70, "eighty" => 80, "ninety" => 90
)
Out[0]:
Dict{String, Int64} with 28 entries:
  "fourteen" => 14
  "seventy"  => 70
  "twelve"   => 12
  "eight"    => 8
  "twenty"   => 20
  "one"      => 1
  "sixteen"  => 16
  "eighteen" => 18
  "seven"    => 7
  "thirty"   => 30
  "six"      => 6
  "five"     => 5
  "forty"    => 40
  "ten"      => 10
  "sixty"    => 60
  "nineteen" => 19
  "fifty"    => 50
  "thirteen" => 13
  "ninety"   => 90
  "three"    => 3
  "single"   => 1
  "fifteen"  => 15
  "eleven"   => 11
  "two"      => 2
  "four"     => 4
  ⋮          => ⋮

此字典将数字的字符串表示形式(例如:"five","twenty-five")与其整数值相关联。 它用于将单词描述的数字转换为特定的数值。

文本到数字转换功能

In [ ]:
"""
    phrasetointeger(txt)

将文本数字转换为数值。
Пример: "twenty five" => 25
"""
function phrasetointeger(txt)
    words = split(txt, r"\W+")      # 将字符串拆分为单独的单词,不包括标点符号
    n = 0                           # 数字累加器
    for w in words                  # 通过所有的话
        n += get(textnumbers, w, 0) # 如果这个词在字典中,请添加其含义。
        w == "hundred" && (n *= 100) # 如果这个词是一百,我们乘以100。
    end
    return n                        # 返回总数值
end
Out[0]:
phrasetointeger

该函数采用一串文本,从中提取数字单词,并组成一个从1到999的整数。 如果复合数字由空格分隔(例如,"十一"或"二十五"),则该函数支持复合数字非常重要。

检查自动图的主要功能

In [ ]:
"""
    isautogram(txt, countpunctuation; verbose = true)

该函数检查txt字符串是否是有效的自动标记。
争论:
-txt:需要检查的文本
-countpunctuation:是否计算标点符号(true/false)
-verbose:是否显示错误消息
"""
function isautogram(txt, countpunctuation; verbose = true)
    # 我们将整个文本减少为小写字母
    s = lowercase(txt)
    
    # 我们计算每个字符的出现次数
    charcounts = counter(s)

    # 要检查引用的字符字典
    stillneedmention = Dict(
        p[1] => isletter(p[1]) || p[1] != ' ' && countpunctuation ? p[2] : 0
        for p in charcounts
    )

    # 在解析之前对字符串进行一点格式化
    s = " " * replace(s, r"^\.(?:employs|composed|contains)" => "")

    # 用逗号和冒号将字符串拆分为描述字符的标记
    for mention in split(s, r"\s*,|:\s*")
        mention = replace(mention, r" and$" => "")  # 删除行尾的"和"字
        spos = findlast(isspace, mention)           # 我们找到最后一个空格-然后会有符号的单词。

        if spos === nothing continue end            # 如果你还没有找到它,跳过它。

        # 提取文本的数字部分(直到最后一个空格)
        numfromtext = phrasetointeger(mention[begin:spos-1])
        numfromtext == 0 && continue  # 如果数字为0,我们继续

        # 提取字符串中表示字符的部分
        c = mention[begin+spos:end]

        if c == "letters"
            # 检查字母总数
            if numfromtext != count(isletter, txt)
                verbose && println("The total letter count (should be $(count(isletter, txt))) is incorrect.")
                return false
            end
            continue
        end

        # 我们根据描述定义符号
        ch = contains(c, "comma") ? ',' : 
             contains(c, "apostrophe") ? '\'' : 
             contains(c, "hyphen") ? '-' : Char(c[1])

        # 检查提到的字符的数量
        if charcounts[ch] == numfromtext
            stillneedmention[ch] = 0  # 将其标记为已验证
        else
            verbose && println("The count of $ch in the phrase is incorrect.")
            return false
        end
    end

    # 我们检查是否有任何字符被遗忘。
    for p in stillneedmention
        if p[2] > 0
            verbose && println("The letter and count $p was not mentioned in the counts in the phrase.")
            return false
        end
    end

    return true  # 如果所有内容都被检查并匹配,则自动图是正确的。
end
Out[0]:
isautogram

此基本功能按以下步骤工作:

  1. 规范化文本(所有字母都是小写)。
  2. 计数所有字符的频率。
  3. 绕过文本中所有提到的字符并检查数值。
  4. 检查文本中是否考虑了实际出现的所有字符。
  5. 申报表 true 如果一切都是正确的,否则 — false.

如果 verbose = true,将显示解释性错误消息。

测试自动图

In [ ]:
# 一组测试线-自动图和非自动图
for (i, t) in enumerate([
    ("This sentence employs two a's, two c's, two d's, twenty-eight e's, five f's, three g's, eight h's, eleven i's, three l's, two m's, thirteen n's, nine o's, two p's, five r's, twenty-five s's, twenty-three t's, six v's, ten w's, two x's, five y's, and one z.", false),
    ("This sentence employs two a's, two c's, two d's, twenty eight e's, five f's, three g's, eight h's, eleven i's, three l's, two m's, thirteen n's, nine o's, two p's, five r's, twenty five s's, twenty three t's, six v's, ten w's, two x's, five y's, and one z.", false),
    ("This pangram contains four as, one b, two cs, one d, thirty es, six fs, five gs, seven hs, eleven is, one j, one k, two ls, two ms, eighteen ns, fifteen os, two ps, one q, five rs, twenty-seven ss, eighteen ts, two us, seven vs, eight ws, two xs, three ys, & one z.", false),
    ("This sentence contains one hundred and ninety-seven letters: four a's, one b, three c's, five d's, thirty-four e's, seven f's, one g, six h's, twelve i's, three l's, twenty-six n's, ten o's, ten r's, twenty-nine s's, nineteen t's, six u's, seven v's, four w's, four x's, five y's, and one z.", false),
    ("Thirteen e's, five f's, two g's, five h's, eight i's, two l's, three n's, six o's, six r's, twenty s's, twelve t's, three u's, four v's, six w's, four x's, two y's.", false),
    ("Fifteen e's, seven f's, four g's, six h's, eight i's, four n's, five o's, six r's, eighteen s's, eight t's, four u's, three v's, two w's, three x's.", true),
    ("Sixteen e's, five f's, three g's, six h's, nine i's, five n's, four o's, six r's, eighteen s's, eight t's, three u's, three v's, two w's, four z's.", false),
   ])
    
    # 我们检查每一行
    println("Test phrase $i is", isautogram(t[1], t[2]) ? " " : " not ", "a valid autogram.\n")
end
Test phrase 1 is a valid autogram.

Test phrase 2 is a valid autogram.

Test phrase 3 is a valid autogram.

Test phrase 4 is a valid autogram.

Test phrase 5 is a valid autogram.

The letter and count '\'' => 14 was not mentioned in the counts in the phrase.
Test phrase 6 is not a valid autogram.

The count of z in the phrase is incorrect.
Test phrase 7 is not a valid autogram.

这里检查了七个示例文本,其中一些是自动图,其中一些不是。 参数 t[2] 指定检查时是否应考虑标点符号。

结论

我们已经回顾了在Julia编程语言中验证autograms正确性的算法的实现。 我们使用了一个外部包 DataStructures 要计算字符的频率,功能已经实现:

-将口头数字转换为数字数字。
-检查每个字符描述的正确性。
-检查句子中所有字符的完整性。

该算法可用于自然语言处理任务,自动验证文本的逻辑正确性,也可以作为编程和语言学领域有趣技巧的一个例子。

该程序已经在几个示例上进行了测试,包括正确和不正确的自动图,这证实了其可操作性。