Engee 文档
Notebook

使用字符串数组分析文本数据

本例演示了如何将文件中的文本保存为字符串数组,按出现频率对单词进行排序,绘制图表,以及收集文件中单词的基本统计数据。

将文本文件导入字符串数组

使用函数read() 读取莎士比亚十四行诗的文本。它以 100266 个字符的向量形式返回文本。

In [ ]:
sonnets = read("/user/Отсортированное/Base_demo/AnalizeTextDataExample/sonnets.txt", String)
sonnets[1:35]
Out[0]:
"THE SONNETS\n\nby William Shakespeare"

使用string 函数将文本转换为字符串。然后使用split() 将其分割成行。sonnets 将成为大小为 2625 x 1 的行数组,其中每行包含诗歌中的一行。显示sonnets 的前五行。

In [ ]:
sonnets = string(sonnets)
sonnets = split(sonnets, "\n")
sonnets[1:5]
Out[0]:
5-element Vector{SubString{String}}:
 "THE SONNETS"
 ""
 "by William Shakespeare"
 ""
 ""

字符串数组

要计算sonnets 中的词频,首先要清除空行和标点符号。然后将其转换为包含单个词作为元素的字符串数组。

从字符串数组中删除包含零字符("")的字符串。将 sonnets 的每个元素与空字符串""进行比较。您可以使用双引号创建字符串,包括空字符串。TF 是一个布尔矢量,只要sonnets 包含一个零字符的字符串,TF 就会包含值 true。使用 TF 索引sonnets 并删除所有包含零字符的字符串。

In [ ]:
TF = sonnets .== ""
sonnets = sonnets[.!TF]
sonnets[1:10]
Out[0]:
10-element Vector{SubString{String}}:
 "THE SONNETS"
 "by William Shakespeare"
 "  I"
 "  From fairest creatures we desire increase,"
 "  That thereby beauty's rose might never die,"
 "  But as the riper should by time decease,"
 "  His tender heir might bear his memory:"
 "  But thou, contracted to thine own bright eyes,"
 "  Feed'st thy light's flame with self-substantial fuel,"
 "  Making a famine where abundance lies,"

用空格替换某些标点符号。例如,替换句号、逗号和分号。保留撇号,因为它们可能是十四行诗中某些单词的一部分,如 light's。

In [ ]:
p = ['.','?','!',',',';',':'];
In [ ]:
sonnets = replace.(sonnets[:],p=>" ")
Out[0]:
2311-element Vector{String}:
 "THE SONNETS"
 "by William Shakespeare"
 "  I"
 "  From fairest creatures we desire increase "
 "  That thereby beauty's rose might never die "
 "  But as the riper should by time decease "
 "  His tender heir might bear his memory "
 "  But thou  contracted to thine own bright eyes "
 "  Feed'st thy light's flame with self-substantial fuel "
 "  Making a famine where abundance lies "
 "  Thy self thy foe  to thy sweet self too cruel "
 "  Thou that art now the world's fresh ornament "
 "  And only herald to the gaudy spring "
 ⋮
 "  Whilst many nymphs that vow'd chaste life to keep"
 "  Came tripping by  but in her maiden hand"
 "  The fairest votary took up that fire"
 "  Which many legions of true hearts had warm'd "
 "  And so the general of hot desire"
 "  Was  sleeping  by a virgin hand disarm'd "
 "  This brand she quenched in a cool well by "
 "  Which from Love's fire took heat perpetual "
 "  Growing a bath and healthful remedy "
 "  For men diseas'd  but I  my mistress' thrall "
 "    Came there for cure and this by that I prove "
 "    Love's fire heats water  water cools not love "

从数组sonnets 的每个元素中删除开头和结尾的空格字符。

In [ ]:
sonnets = strip.(sonnets[:],' ')
sonnets[1:10]
Out[0]:
10-element Vector{SubString{String}}:
 "THE SONNETS"
 "by William Shakespeare"
 "I"
 "From fairest creatures we desire increase"
 "That thereby beauty's rose might never die"
 "But as the riper should by time decease"
 "His tender heir might bear his memory"
 "But thou  contracted to thine own bright eyes"
 "Feed'st thy light's flame with self-substantial fuel"
 "Making a famine where abundance lies"

sonnets 拆分为字符串数组,其元素为单个单词。您可以使用split() 用空白或您指定的分隔符分隔字符串数组的元素。但是,split() 要求字符串数组中的每个元素都划分为相同数量的换行符。sonnets 的元素具有不同数量的空格,因此不能被等量的行分割。要在sonnets 中使用split() 函数,请编写一个 for 循环,每次对一个元素调用split()

创建一个字符串数组sonnetWords 。编写一个 for 循环,断开sonnets 中的每个元素。将split() 的输出与sonnetWords 的输出合并。sonnetWords 的每个元素都是十四行诗中的一个不同单词。

In [ ]:
sonnetWords = split(sonnets[1]);
for i = 2:length(sonnets)
    sonnetWords = [sonnetWords ; split(sonnets[i])];
 end

sonnetWords[1:10]
Out[0]:
10-element Vector{SubString{String}}:
 "THE"
 "SONNETS"
 "by"
 "William"
 "Shakespeare"
 "I"
 "From"
 "fairest"
 "creatures"
 "we"

按频率对数组排序

在 sonnetWords 中查找唯一的单词。计数并按出现频率排序。

要将大小写不同的单词算作同一个单词,请将sonnetWords 转换为小写。例如,the 和 the 算作同一个词。

连接函数StatsBaseStatistics ,则将使用这些库中的函数。

In [ ]:
import Pkg; 
Pkg.add("StatsBase")
Pkg.add("Statistics")
In [ ]:
using Statistics, StatsBase

要将单个单词的大小写改为小写,请使用函数lowercase

In [ ]:
sonnetWords = lowercase.(sonnetWords)
Out[0]:
17711-element Vector{String}:
 "the"
 "sonnets"
 "by"
 "william"
 "shakespeare"
 "i"
 "from"
 "fairest"
 "creatures"
 "we"
 "desire"
 "increase"
 "that"
 ⋮
 "by"
 "that"
 "i"
 "prove"
 "love's"
 "fire"
 "heats"
 "water"
 "water"
 "cools"
 "not"
 "love"

使用函数unique() 查找唯一单词。此外,这里还应用了从小到大排序的功能。这样做是为了方便下一步工作。

In [ ]:
words = sort(unique(sonnetWords))
Out[0]:
3435-element Vector{String}:
 "'"
 "''tis"
 "'amen'"
 "'fair"
 "'fore"
 "'gainst"
 "'greeing"
 "'had"
 "'hues'"
 "'i"
 "'love"
 "'no'"
 "'not"
 ⋮
 "you'"
 "you've"
 "young"
 "youngly"
 "your"
 "yours"
 "yourself"
 "yourself's"
 "youth"
 "youth's"
 "youthful"
 "zealous"

然后使用函数countmap() 计算每个唯一单词出现的次数。该函数返回一个字典,在该字典中,数组sonnetWords 的每个唯一值都与其出现次数进行比较。

In [ ]:
numOccurrences = sort(countmap(sonnetWords))
Out[0]:
OrderedCollections.OrderedDict{String, Int64} with 3435 entries:
  "'"        => 16
  "''tis"    => 1
  "'amen'"   => 1
  "'fair"    => 2
  "'fore"    => 1
  "'gainst"  => 6
  "'greeing" => 1
  "'had"     => 1
  "'hues'"   => 1
  "'i"       => 3
  "'love"    => 1
  "'no'"     => 1
  "'not"     => 1
  "'now"     => 1
  "'scap'd"  => 1
  "'since"   => 1
  "'this"    => 2
  "'thou"    => 1
  "'thus"    => 1
  "'thy"     => 1
  "'tis"     => 11
  "'truth"   => 2
  "'twixt"   => 2
  "'will"    => 5
  "'will'"   => 5
  ⋮          => ⋮

将十四行诗中的单词按出现次数从多到少排序。

In [ ]:
rankOfOccurrences = sort(collect(values(numOccurrences)), rev = true)
Out[0]:
3435-element Vector{Int64}:
 490
 436
 409
 371
 370
 341
 321
 320
 280
 233
 181
 171
 168
   ⋮
   1
   1
   1
   1
   1
   1
   1
   1
   1
   1
   1
   1

我们还要将按出现次数排序的词的索引写入数组。为此,请使用sortperm()

In [ ]:
rankIndex = sortperm(collect(values(numOccurrences)), rev = true)
Out[0]:
3435-element Vector{Int64}:
  143
 2881
 2957
 1912
 1994
 1458
 1493
 2879
 2939
 2915
 3293
 1150
 1536
    ⋮
 3411
 3412
 3413
 3415
 3419
 3422
 3424
 3425
 3427
 3431
 3433
 3435

使用记录的按出现频率排序的单词索引,输出《十四行诗》中 10 个经常出现的单词。

In [ ]:
wordsByFrequency = words[rankIndex]
wordsByFrequency[1:10]
Out[0]:
10-element Vector{String}:
 "and"
 "the"
 "to"
 "my"
 "of"
 "i"
 "in"
 "that"
 "thy"
 "thou"

词频图

创建一个图表,显示十四行诗中的词频,从词频最高的开始,到词频最低的结束。根据齐普夫定律,大量文本中的词频分布遵循幂律。

In [ ]:
plot(rankOfOccurrences, xscale=:log10, yscale=:log10)
Out[0]:

让我们将统计结果编制成表格

统计sonnetWords 中每个词的总出现次数。计算出现次数占单词总数的百分比,并计算从出现频率最高的单词到出现频率最低的单词的累计百分比。在表格中记录单词及其主要统计数据。

In [ ]:
using DataFrames
In [ ]:
T = DataFrame();
T.Words = wordsByFrequency;
T.NumOccurrences = rankOfOccurrences;
T.PercentOfText = rankOfOccurrences / length(sonnetWords) * 100.0;
T.CumulativePercentOfText = cumsum(rankOfOccurrences) / length(sonnetWords) * 100.0;
T[1:10, :]
Out[0]:
10×4 DataFrame
RowWordsNumOccurrencesPercentOfTextCumulativePercentOfText
StringInt64Float64Float64
1and4902.766642.76664
2the4362.461755.22839
3to4092.30937.53769
4my3712.094749.63243
5of3702.089111.7215
6i3411.9253613.6469
7in3211.8124315.4593
8that3201.8067917.2661
9thy2801.5809418.847
10thou2331.3155720.1626

结论

十四行诗中出现频率最高的词是。它出现了 490 次。这十个出现频率最高的词共占全文的 20.163%。 在分析十四行诗文本时,我们从文件中读取数据,使用统计库对信息进行分类和处理。