Engee documentation
Notebook

Analyzing text data with string arrays

This example shows how to save text from a file as an array of strings, sort words by frequency of occurrence, plot and collect basic statistics on the words found in the file.

Importing a text file into an array of strings

Read the text from Shakespeare's Sonnet using the function read(). It returns the text as a vector of 100266 characters.

In [ ]:
sonnets = read("/user/start/examples/language_basics/analizetextdataexample/sonnets.txt", String)
sonnets[1:35]
Out[0]:
"THE SONNETS\n\nby William Shakespeare"

Convert text to a string using string functions. Then divide it into lines using split(). sonnets it becomes an array of strings measuring 2625 by 1, where each line contains one line from poems. Display the first five lines sonnets.

In [ ]:
sonnets = string(sonnets)
sonnets = split(sonnets, "\n")
sonnets[1:5]
Out[0]:
5-element Vector{SubString{String}}:
 "THE SONNETS"
 ""
 "by William Shakespeare"
 ""
 ""

String array

To calculate the frequency of words in sonnets first, clean it up by removing blank lines and punctuation marks. Then convert it to an array of strings containing individual words as elements.

Delete lines with zero characters ("") from the string array. Compare each sonnets element with "", an empty string. You can create strings, including empty ones, using double quotes. TF is a logical vector that contains the value true wherever sonnets contains a string with zero characters. Index it sonnets use TF and delete all lines with zero characters.

In [ ]:
TF = sonnets .== ""
sonnets = sonnets[.!TF]
sonnets[1:10]
Out[0]:
10-element Vector{SubString{String}}:
 "THE SONNETS"
 "by William Shakespeare"
 "  I"
 "  From fairest creatures we desire increase,"
 "  That thereby beauty's rose might never die,"
 "  But as the riper should by time decease,"
 "  His tender heir might bear his memory:"
 "  But thou, contracted to thine own bright eyes,"
 "  Feed'st thy light's flame with self-substantial fuel,"
 "  Making a famine where abundance lies,"

Replace some punctuation marks with spaces. For example, replace dots, commas, and semicolons. Save the apostrophes, because they may be part of some words in sonnets, such as light's.

In [ ]:
p = ['.','?','!',',',';',':'];
In [ ]:
sonnets = replace.(sonnets[:],p=>" ")
Out[0]:
2311-element Vector{String}:
 "THE SONNETS"
 "by William Shakespeare"
 "  I"
 "  From fairest creatures we desire increase "
 "  That thereby beauty's rose might never die "
 "  But as the riper should by time decease "
 "  His tender heir might bear his memory "
 "  But thou  contracted to thine own bright eyes "
 "  Feed'st thy light's flame with self-substantial fuel "
 "  Making a famine where abundance lies "
 "  Thy self thy foe  to thy sweet self too cruel "
 "  Thou that art now the world's fresh ornament "
 "  And only herald to the gaudy spring "
 ⋮
 "  Whilst many nymphs that vow'd chaste life to keep"
 "  Came tripping by  but in her maiden hand"
 "  The fairest votary took up that fire"
 "  Which many legions of true hearts had warm'd "
 "  And so the general of hot desire"
 "  Was  sleeping  by a virgin hand disarm'd "
 "  This brand she quenched in a cool well by "
 "  Which from Love's fire took heat perpetual "
 "  Growing a bath and healthful remedy "
 "  For men diseas'd  but I  my mistress' thrall "
 "    Came there for cure and this by that I prove "
 "    Love's fire heats water  water cools not love "

Remove the leading and ending space characters from each element of the array sonnets.

In [ ]:
sonnets = strip.(sonnets[:],' ')
sonnets[1:10]
Out[0]:
10-element Vector{SubString{String}}:
 "THE SONNETS"
 "by William Shakespeare"
 "I"
 "From fairest creatures we desire increase"
 "That thereby beauty's rose might never die"
 "But as the riper should by time decease"
 "His tender heir might bear his memory"
 "But thou  contracted to thine own bright eyes"
 "Feed'st thy light's flame with self-substantial fuel"
 "Making a famine where abundance lies"

Crash sonnets to an array of strings, the elements of which are individual words. You can use split() to separate the elements of an array of strings by spaces or by the separators you specify. However split() requires that each element of the row array be divisible by an equal number of new rows. Elements sonnets they have a different number of spaces and, therefore, are not divided into an equal number of lines. To use split() a function for sonnets, write a for loop that calls split() for one element at a time.

Create an array of strings sonnetWords. Write a for loop that splits each element sonnets. Combine the output data split() with sonnetWords. Each element sonnetWords — this is a separate word from sonnets.

In [ ]:
sonnetWords = split(sonnets[1]);
for i = 2:length(sonnets)
    sonnetWords = [sonnetWords ; split(sonnets[i])];
 end

sonnetWords[1:10]
Out[0]:
10-element Vector{SubString{String}}:
 "THE"
 "SONNETS"
 "by"
 "William"
 "Shakespeare"
 "I"
 "From"
 "fairest"
 "creatures"
 "we"

Sorting an array by frequency

Find unique words in sonnetWords. Count them and sort them by frequency of occurrence.

To count words that differ only in case as the same word, convert sonnetWords in lowercase. For example, The and the are considered to be the same word.

Connect the functions StatsBase and Statistics next, functions from these libraries will be used.

In [ ]:
import Pkg; 
Pkg.add("StatsBase")
Pkg.add("Statistics")
In [ ]:
using Statistics, StatsBase

To make words of the same lowercase, use the function lowercase.

In [ ]:
sonnetWords = lowercase.(sonnetWords)
Out[0]:
17711-element Vector{String}:
 "the"
 "sonnets"
 "by"
 "william"
 "shakespeare"
 "i"
 "from"
 "fairest"
 "creatures"
 "we"
 "desire"
 "increase"
 "that"
 ⋮
 "by"
 "that"
 "i"
 "prove"
 "love's"
 "fire"
 "heats"
 "water"
 "water"
 "cools"
 "not"
 "love"

Find unique words using the function unique(). The sorting function from the smallest to the largest is also applied here. This is done for the convenience of further work.

In [ ]:
words = sort(unique(sonnetWords))
Out[0]:
3435-element Vector{String}:
 "'"
 "''tis"
 "'amen'"
 "'fair"
 "'fore"
 "'gainst"
 "'greeing"
 "'had"
 "'hues'"
 "'i"
 "'love"
 "'no'"
 "'not"
 ⋮
 "you'"
 "you've"
 "young"
 "youngly"
 "your"
 "yours"
 "yourself"
 "yourself's"
 "youth"
 "youth's"
 "youthful"
 "zealous"

Then count the number of times each unique word occurs using the function countmap(). It returns a dictionary in which each unique value of the array is sonnetWords it is compared with the number of its occurrences.

In [ ]:
numOccurrences = sort(countmap(sonnetWords))
Out[0]:
OrderedCollections.OrderedDict{String, Int64} with 3435 entries:
  "'"        => 16
  "''tis"    => 1
  "'amen'"   => 1
  "'fair"    => 2
  "'fore"    => 1
  "'gainst"  => 6
  "'greeing" => 1
  "'had"     => 1
  "'hues'"   => 1
  "'i"       => 3
  "'love"    => 1
  "'no'"     => 1
  "'not"     => 1
  "'now"     => 1
  "'scap'd"  => 1
  "'since"   => 1
  "'this"    => 2
  "'thou"    => 1
  "'thus"    => 1
  "'thy"     => 1
  "'tis"     => 11
  "'truth"   => 2
  "'twixt"   => 2
  "'will"    => 5
  "'will'"   => 5
  ⋮          => ⋮

Sort the words in the sonnets by the number of occurrences, from the most common to the least common.

In [ ]:
rankOfOccurrences = sort(collect(values(numOccurrences)), rev = true)
Out[0]:
3435-element Vector{Int64}:
 490
 436
 409
 371
 370
 341
 321
 320
 280
 233
 181
 171
 168
   ⋮
   1
   1
   1
   1
   1
   1
   1
   1
   1
   1
   1
   1

We will also write the indexes of words in the array, which are sorted by the number of occurrences. To do this, use sortperm().

In [ ]:
rankIndex = sortperm(collect(values(numOccurrences)), rev = true)
Out[0]:
3435-element Vector{Int64}:
  143
 2881
 2957
 1912
 1994
 1458
 1493
 2879
 2939
 2915
 3293
 1150
 1536
    ⋮
 3411
 3412
 3413
 3415
 3419
 3422
 3424
 3425
 3427
 3431
 3433
 3435

Using the recorded indexes sorted by frequency of words, output 10 frequent words in the Sonnets.

In [ ]:
wordsByFrequency = words[rankIndex]
wordsByFrequency[1:10]
Out[0]:
10-element Vector{String}:
 "and"
 "the"
 "to"
 "my"
 "of"
 "i"
 "in"
 "that"
 "thy"
 "thou"

Graph of word frequency

Create a graph showing the frequency of words in the Sonnets, starting with the most common and ending with the least frequent. According to Zipf's law, the frequency distribution of words in an extensive text follows a power law.

In [ ]:
plot(rankOfOccurrences, xscale=:log10, yscale=:log10)
Out[0]:

Let's put the statistics in a table

Calculate the total number of occurrences of each word in sonnetWords. Calculate the number of occurrences as a percentage of the total number of words and calculate the cumulative percentage from the most frequently occurring words to the least frequently occurring ones. Write down the words and the main statistical data on them in a table.

In [ ]:
using DataFrames
In [ ]:
T = DataFrame();
T.Words = wordsByFrequency;
T.NumOccurrences = rankOfOccurrences;
T.PercentOfText = rankOfOccurrences / length(sonnetWords) * 100.0;
T.CumulativePercentOfText = cumsum(rankOfOccurrences) / length(sonnetWords) * 100.0;
T[1:10, :]
Out[0]:
10×4 DataFrame
RowWordsNumOccurrencesPercentOfTextCumulativePercentOfText
StringInt64Float64Float64
1and4902.766642.76664
2the4362.461755.22839
3to4092.30937.53769
4my3712.094749.63243
5of3702.089111.7215
6i3411.9253613.6469
7in3211.8124315.4593
8that3201.8067917.2661
9thy2801.5809418.847
10thou2331.3155720.1626

Conclusion

The most frequent word in the Sonnets is and. It occurs 490 times. Collectively, the ten most frequent words make up 20.163% of the text.
Analyzing the text of Sonnet, we worked with reading data from a file, sorted and processed information using statistical libraries.