使用字符串数组分析文本数据

此示例演示如何将文件中的文本保存为字符串数组，按出现频率对单词进行排序，绘制和收集文件中找到的单词的基本统计信息。

将文本文件导入到字符串数组中

使用该函数阅读莎士比亚十四行诗中的文本 read(). 它将文本作为100266个字符的向量返回。

sonnets = read("/user/start/examples/language_basics/analizetextdataexample/sonnets.txt", String)
sonnets[1:35]

"THE SONNETS\n\nby William Shakespeare"

使用以下方法将文本转换为字符串 string 函数。然后将其分成使用 split(). sonnets 它变成了一个2625乘1的字符串数组，每行包含一行诗。显示前五行 sonnets.

sonnets = string(sonnets)
sonnets = split(sonnets, "\n")
sonnets[1:5]

5-element Vector{SubString{String}}:
 "THE SONNETS"
 ""
 "by William Shakespeare"
 ""
 ""

字符串数组

计算单词在 sonnets 首先，通过删除空行和标点符号来清理它。然后将其转换为包含单个单词作为元素的字符串数组。

从字符串数组中删除零字符("")的行。将每个十四行诗元素与空字符串""进行比较。您可以使用双引号创建字符串，包括空字符串。 TF是包含值true的逻辑向量。 sonnets 包含零字符的字符串。索引它 sonnets 使用TF并删除所有零字符的行。

TF = sonnets .== ""
sonnets = sonnets[.!TF]
sonnets[1:10]

10-element Vector{SubString{String}}:
 "THE SONNETS"
 "by William Shakespeare"
 "  I"
 "  From fairest creatures we desire increase,"
 "  That thereby beauty's rose might never die,"
 "  But as the riper should by time decease,"
 "  His tender heir might bear his memory:"
 "  But thou, contracted to thine own bright eyes,"
 "  Feed'st thy light's flame with self-substantial fuel,"
 "  Making a famine where abundance lies,"

用空格替换一些标点符号。例如，替换点、逗号和分号。保存撇号，因为它们可能是十四行诗中某些单词的一部分，例如light's。

p = ['.','?','!',',',';',':'];

sonnets = replace.(sonnets[:],p=>" ")

2311-element Vector{String}:
 "THE SONNETS"
 "by William Shakespeare"
 "  I"
 "  From fairest creatures we desire increase "
 "  That thereby beauty's rose might never die "
 "  But as the riper should by time decease "
 "  His tender heir might bear his memory "
 "  But thou  contracted to thine own bright eyes "
 "  Feed'st thy light's flame with self-substantial fuel "
 "  Making a famine where abundance lies "
 "  Thy self thy foe  to thy sweet self too cruel "
 "  Thou that art now the world's fresh ornament "
 "  And only herald to the gaudy spring "
 ⋮
 "  Whilst many nymphs that vow'd chaste life to keep"
 "  Came tripping by  but in her maiden hand"
 "  The fairest votary took up that fire"
 "  Which many legions of true hearts had warm'd "
 "  And so the general of hot desire"
 "  Was  sleeping  by a virgin hand disarm'd "
 "  This brand she quenched in a cool well by "
 "  Which from Love's fire took heat perpetual "
 "  Growing a bath and healthful remedy "
 "  For men diseas'd  but I  my mistress' thrall "
 "    Came there for cure and this by that I prove "
 "    Love's fire heats water  water cools not love "

从数组的每个元素中删除前导和结尾空格字符 sonnets.

sonnets = strip.(sonnets[:],' ')
sonnets[1:10]

10-element Vector{SubString{String}}:
 "THE SONNETS"
 "by William Shakespeare"
 "I"
 "From fairest creatures we desire increase"
 "That thereby beauty's rose might never die"
 "But as the riper should by time decease"
 "His tender heir might bear his memory"
 "But thou  contracted to thine own bright eyes"
 "Feed'st thy light's flame with self-substantial fuel"
 "Making a famine where abundance lies"

崩溃 sonnets 到一个字符串数组，其中的元素是单个单词。您可以使用 split() 以空格或指定的分隔符分隔字符串数组的元素。然而 split() 要求行数组的每个元素都能被相等数量的新行整除。元素 sonnets 它们具有不同数量的空格，因此不会分成相等数量的行。要使用 split() 一个函数 sonnets，写一个for循环调用 split() 一次为一个元素。

创建字符串数组 sonnetWords. 编写一个拆分每个元素的for循环 sonnets. 组合输出数据 split() 与 sonnetWords. 每个元素 sonnetWords —这是一个单独的词从十四行诗。

sonnetWords = split(sonnets[1]);
for i = 2:length(sonnets)
    sonnetWords = [sonnetWords ; split(sonnets[i])];
 end

sonnetWords[1:10]

10-element Vector{SubString{String}}:
 "THE"
 "SONNETS"
 "by"
 "William"
 "Shakespeare"
 "I"
 "From"
 "fairest"
 "creatures"
 "we"

按频率对数组进行排序

在十四行诗中找到独特的单词。对它们进行计数并按出现频率对它们进行排序。

要将仅在大小写上不同的单词计算为相同的单词，请转换 sonnetWords 小写。例如，The和the被认为是同一个词。

连接功能 StatsBase 和 Statistics 接下来，将使用这些库中的函数。

import Pkg; 
Pkg.add("StatsBase")
Pkg.add("Statistics")

using Statistics, StatsBase

要制作相同小写的单词，请使用函数 lowercase.

sonnetWords = lowercase.(sonnetWords)

17711-element Vector{String}:
 "the"
 "sonnets"
 "by"
 "william"
 "shakespeare"
 "i"
 "from"
 "fairest"
 "creatures"
 "we"
 "desire"
 "increase"
 "that"
 ⋮
 "by"
 "that"
 "i"
 "prove"
 "love's"
 "fire"
 "heats"
 "water"
 "water"
 "cools"
 "not"
 "love"

使用该功能查找独特的单词 unique(). 这里也应用了从最小到最大的排序函数。这样做是为了方便进一步的工作。

words = sort(unique(sonnetWords))

3435-element Vector{String}:
 "'"
 "''tis"
 "'amen'"
 "'fair"
 "'fore"
 "'gainst"
 "'greeing"
 "'had"
 "'hues'"
 "'i"
 "'love"
 "'no'"
 "'not"
 ⋮
 "you'"
 "you've"
 "young"
 "youngly"
 "your"
 "yours"
 "yourself"
 "yourself's"
 "youth"
 "youth's"
 "youthful"
 "zealous"

然后使用函数统计每个唯一单词出现的次数 countmap(). 它返回一个字典，其中数组的每个唯一值都是 sonnetWords 它与其出现的次数进行比较。

numOccurrences = sort(countmap(sonnetWords))

OrderedCollections.OrderedDict{String, Int64} with 3435 entries:
  "'"        => 16
  "''tis"    => 1
  "'amen'"   => 1
  "'fair"    => 2
  "'fore"    => 1
  "'gainst"  => 6
  "'greeing" => 1
  "'had"     => 1
  "'hues'"   => 1
  "'i"       => 3
  "'love"    => 1
  "'no'"     => 1
  "'not"     => 1
  "'now"     => 1
  "'scap'd"  => 1
  "'since"   => 1
  "'this"    => 2
  "'thou"    => 1
  "'thus"    => 1
  "'thy"     => 1
  "'tis"     => 11
  "'truth"   => 2
  "'twixt"   => 2
  "'will"    => 5
  "'will'"   => 5
  ⋮          => ⋮

根据出现次数对十四行诗中的单词进行排序，从最常见到最不常见。

rankOfOccurrences = sort(collect(values(numOccurrences)), rev = true)

3435-element Vector{Int64}:
 490
 436
 409
 371
 370
 341
 321
 320
 280
 233
 181
 171
 168
   ⋮
   1
   1
   1
   1
   1
   1
   1
   1
   1
   1
   1
   1

我们还将编写数组中单词的索引，这些索引按出现次数排序。要做到这一点，请使用 sortperm().

rankIndex = sortperm(collect(values(numOccurrences)), rev = true)

3435-element Vector{Int64}:
  143
 2881
 2957
 1912
 1994
 1458
 1493
 2879
 2939
 2915
 3293
 1150
 1536
    ⋮
 3411
 3412
 3413
 3415
 3419
 3422
 3424
 3425
 3427
 3431
 3433
 3435

使用按单词频率排序的记录索引，输出十四行诗中的10个频繁单词。

wordsByFrequency = words[rankIndex]
wordsByFrequency[1:10]

10-element Vector{String}:
 "and"
 "the"
 "to"
 "my"
 "of"
 "i"
 "in"
 "that"
 "thy"
 "thou"

词频图

创建一个图表，显示十四行诗中单词的频率，从最常见的开始，以最不常见的结束。根据Zipf定律，广泛文本中单词的频率分布遵循幂律。

plot(rankOfOccurrences, xscale=:log10, yscale=:log10)

让我们把统计数据放在一个表格中

计算每个单词的总出现次数 sonnetWords. 以单词总数的百分比计算出现次数，并计算从最频繁出现的单词到最不频繁出现的单词的累积百分比。在表格中写下单词和它们的主要统计数据。

using DataFrames

T = DataFrame();
T.Words = wordsByFrequency;
T.NumOccurrences = rankOfOccurrences;
T.PercentOfText = rankOfOccurrences / length(sonnetWords) * 100.0;
T.CumulativePercentOfText = cumsum(rankOfOccurrences) / length(sonnetWords) * 100.0;
T[1:10, :]

结论

十四行诗中最常见的词是和。它发生490次。总的来说，十个最常见的单词占文本的20.163％。
分析十四行诗文本，我们从文件中读取数据，使用统计库对信息进行排序和处理。

Row	Words	NumOccurrences	PercentOfText	CumulativePercentOfText
	String	Int64	Float64	Float64
1	and	490	2.76664	2.76664
2	the	436	2.46175	5.22839
3	to	409	2.3093	7.53769
4	my	371	2.09474	9.63243
5	of	370	2.0891	11.7215
6	i	341	1.92536	13.6469
7	in	321	1.81243	15.4593
8	that	320	1.80679	17.2661
9	thy	280	1.58094	18.847
10	thou	233	1.31557	20.1626