Engee documentation
Notebook

Analysing text data with string arrays

This example shows how to save text from a file as an array of strings, sort words by their frequency of occurrence, plot a graph, and collect basic statistics on the words found in the file.

Importing a text file into an array of strings

Read the text from Shakespeare's Sonnet using the function read(). It returns the text as a vector of 100266 characters.

In [ ]:
sonnets = read("/user/Отсортированное/Base_demo/AnalizeTextDataExample/sonnets.txt", String)
sonnets[1:35]
Out[0]:
"THE SONNETS\n\nby William Shakespeare"

Convert the text to a string using the string function. Then split it into lines using split(). sonnets becomes an array of lines of size 2625 by 1, where each line contains one line from the poem. Display the first five lines of sonnets.

In [ ]:
sonnets = string(sonnets)
sonnets = split(sonnets, "\n")
sonnets[1:5]
Out[0]:
5-element Vector{SubString{String}}:
 "THE SONNETS"
 ""
 "by William Shakespeare"
 ""
 ""

String array

To calculate the frequency of words in sonnets, first clean it up by removing empty lines and punctuation marks. Then convert it into a string array containing individual words as elements.

Remove strings with zero characters ("") from the array of strings. Compare each element of sonnets with "", an empty string. You can create strings, including empty strings, using double quotes. TF is a boolean vector that contains the value true wherever sonnets contains a string with zero characters. Index sonnets with TF and delete all strings with zero characters.

In [ ]:
TF = sonnets .== ""
sonnets = sonnets[.!TF]
sonnets[1:10]
Out[0]:
10-element Vector{SubString{String}}:
 "THE SONNETS"
 "by William Shakespeare"
 "  I"
 "  From fairest creatures we desire increase,"
 "  That thereby beauty's rose might never die,"
 "  But as the riper should by time decease,"
 "  His tender heir might bear his memory:"
 "  But thou, contracted to thine own bright eyes,"
 "  Feed'st thy light's flame with self-substantial fuel,"
 "  Making a famine where abundance lies,"

Replace some punctuation marks with spaces. For example, replace full stops, commas, and semicolons. Keep apostrophes because they may be part of some words in sonnets, such as light's.

In [ ]:
p = ['.','?','!',',',';',':'];
In [ ]:
sonnets = replace.(sonnets[:],p=>" ")
Out[0]:
2311-element Vector{String}:
 "THE SONNETS"
 "by William Shakespeare"
 "  I"
 "  From fairest creatures we desire increase "
 "  That thereby beauty's rose might never die "
 "  But as the riper should by time decease "
 "  His tender heir might bear his memory "
 "  But thou  contracted to thine own bright eyes "
 "  Feed'st thy light's flame with self-substantial fuel "
 "  Making a famine where abundance lies "
 "  Thy self thy foe  to thy sweet self too cruel "
 "  Thou that art now the world's fresh ornament "
 "  And only herald to the gaudy spring "
 ⋮
 "  Whilst many nymphs that vow'd chaste life to keep"
 "  Came tripping by  but in her maiden hand"
 "  The fairest votary took up that fire"
 "  Which many legions of true hearts had warm'd "
 "  And so the general of hot desire"
 "  Was  sleeping  by a virgin hand disarm'd "
 "  This brand she quenched in a cool well by "
 "  Which from Love's fire took heat perpetual "
 "  Growing a bath and healthful remedy "
 "  For men diseas'd  but I  my mistress' thrall "
 "    Came there for cure and this by that I prove "
 "    Love's fire heats water  water cools not love "

Remove the beginning and ending space characters from each element of the array sonnets.

In [ ]:
sonnets = strip.(sonnets[:],' ')
sonnets[1:10]
Out[0]:
10-element Vector{SubString{String}}:
 "THE SONNETS"
 "by William Shakespeare"
 "I"
 "From fairest creatures we desire increase"
 "That thereby beauty's rose might never die"
 "But as the riper should by time decease"
 "His tender heir might bear his memory"
 "But thou  contracted to thine own bright eyes"
 "Feed'st thy light's flame with self-substantial fuel"
 "Making a famine where abundance lies"

Split sonnets into an array of strings whose elements are individual words. You can use split() to separate the elements of the string array by whitespace or by delimiters you specify. However, split() requires that each element of the string array be divided into an equal number of newlines. The elements of sonnets have different numbers of spaces and, therefore, are not divisible by an equal number of rows. To use the split() function for sonnets, write a for loop that calls split() for one element at a time.

Create an array of strings sonnetWords. Write a for loop that breaks each element of sonnets. Combine the output of split() with sonnetWords. Each element of sonnetWords is a different word from sonnets.

In [ ]:
sonnetWords = split(sonnets[1]);
for i = 2:length(sonnets)
    sonnetWords = [sonnetWords ; split(sonnets[i])];
 end

sonnetWords[1:10]
Out[0]:
10-element Vector{SubString{String}}:
 "THE"
 "SONNETS"
 "by"
 "William"
 "Shakespeare"
 "I"
 "From"
 "fairest"
 "creatures"
 "we"

Sort the array by frequency

Find unique words in sonnetWords. Count them and sort them by frequency of occurrence.

To count words that differ only in case as the same word, convert sonnetWords to lower case. For example, the and the are counted as the same word.

Connect the functions StatsBase and Statistics, then the functions from these libraries will be used.

In [ ]:
import Pkg; 
Pkg.add("StatsBase")
Pkg.add("Statistics")
In [ ]:
using Statistics, StatsBase

To make words of one case - lower case, use the function lowercase.

In [ ]:
sonnetWords = lowercase.(sonnetWords)
Out[0]:
17711-element Vector{String}:
 "the"
 "sonnets"
 "by"
 "william"
 "shakespeare"
 "i"
 "from"
 "fairest"
 "creatures"
 "we"
 "desire"
 "increase"
 "that"
 ⋮
 "by"
 "that"
 "i"
 "prove"
 "love's"
 "fire"
 "heats"
 "water"
 "water"
 "cools"
 "not"
 "love"

Find unique words using the function unique(). Also the function of sorting from smaller to larger is applied here. This is done for convenience of further work.

In [ ]:
words = sort(unique(sonnetWords))
Out[0]:
3435-element Vector{String}:
 "'"
 "''tis"
 "'amen'"
 "'fair"
 "'fore"
 "'gainst"
 "'greeing"
 "'had"
 "'hues'"
 "'i"
 "'love"
 "'no'"
 "'not"
 ⋮
 "you'"
 "you've"
 "young"
 "youngly"
 "your"
 "yours"
 "yourself"
 "yourself's"
 "youth"
 "youth's"
 "youthful"
 "zealous"

Then count how many times each unique word occurs, using the function countmap(). It returns a dictionary in which each unique value of the array sonnetWords is matched with the number of its occurrences.

In [ ]:
numOccurrences = sort(countmap(sonnetWords))
Out[0]:
OrderedCollections.OrderedDict{String, Int64} with 3435 entries:
  "'"        => 16
  "''tis"    => 1
  "'amen'"   => 1
  "'fair"    => 2
  "'fore"    => 1
  "'gainst"  => 6
  "'greeing" => 1
  "'had"     => 1
  "'hues'"   => 1
  "'i"       => 3
  "'love"    => 1
  "'no'"     => 1
  "'not"     => 1
  "'now"     => 1
  "'scap'd"  => 1
  "'since"   => 1
  "'this"    => 2
  "'thou"    => 1
  "'thus"    => 1
  "'thy"     => 1
  "'tis"     => 11
  "'truth"   => 2
  "'twixt"   => 2
  "'will"    => 5
  "'will'"   => 5
  ⋮          => ⋮

Sort the words in the sonnets by the number of occurrences, from most common to least common.

In [ ]:
rankOfOccurrences = sort(collect(values(numOccurrences)), rev = true)
Out[0]:
3435-element Vector{Int64}:
 490
 436
 409
 371
 370
 341
 321
 320
 280
 233
 181
 171
 168
   ⋮
   1
   1
   1
   1
   1
   1
   1
   1
   1
   1
   1
   1

Let's also write the indices of the words that are sorted by number of occurrences into an array. To do this, use sortperm().

In [ ]:
rankIndex = sortperm(collect(values(numOccurrences)), rev = true)
Out[0]:
3435-element Vector{Int64}:
  143
 2881
 2957
 1912
 1994
 1458
 1493
 2879
 2939
 2915
 3293
 1150
 1536
    ⋮
 3411
 3412
 3413
 3415
 3419
 3422
 3424
 3425
 3427
 3431
 3433
 3435

Using the recorded indices of words sorted by frequency, output 10 frequently occurring words in Sonnets.

In [ ]:
wordsByFrequency = words[rankIndex]
wordsByFrequency[1:10]
Out[0]:
10-element Vector{String}:
 "and"
 "the"
 "to"
 "my"
 "of"
 "i"
 "in"
 "that"
 "thy"
 "thou"

Word frequency graph

Create a graph showing the frequency of words in Sonnets, starting with the most frequent and ending with the least frequent. According to Zipf's law, the distribution of word frequencies in an extensive text follows a power law.

In [ ]:
plot(rankOfOccurrences, xscale=:log10, yscale=:log10)
Out[0]:

Let's compile the statistics into a table

Count the total number of occurrences of each word in sonnetWords. Count the number of occurrences as a percentage of the total number of words and calculate the cumulative percentage from the most frequent words to the least frequent words. Record the words and their key statistics in a table.

In [ ]:
using DataFrames
In [ ]:
T = DataFrame();
T.Words = wordsByFrequency;
T.NumOccurrences = rankOfOccurrences;
T.PercentOfText = rankOfOccurrences / length(sonnetWords) * 100.0;
T.CumulativePercentOfText = cumsum(rankOfOccurrences) / length(sonnetWords) * 100.0;
T[1:10, :]
Out[0]:
10×4 DataFrame
RowWordsNumOccurrencesPercentOfTextCumulativePercentOfText
StringInt64Float64Float64
1and4902.766642.76664
2the4362.461755.22839
3to4092.30937.53769
4my3712.094749.63243
5of3702.089111.7215
6i3411.9253613.6469
7in3211.8124315.4593
8that3201.8067917.2661
9thy2801.5809418.847
10thou2331.3155720.1626

Conclusion

The most frequent word in the Sonnets is and. It occurs 490 times. Together, the ten most frequent words make up 20.163% of the text. When analysing the Sonnet text, we worked with reading data from a file, sorted and processed the information using statistical libraries.