Analyzing text data with string arrays
This example shows how to save text from a file as an array of strings, sort words by frequency of occurrence, plot and collect basic statistics on the words found in the file.
Importing a text file into an array of strings
Read the text from Shakespeare's Sonnet using the function read(). It returns the text as a vector of 100266 characters.
sonnets = read("/user/start/examples/language_basics/analizetextdataexample/sonnets.txt", String)
sonnets[1:35]
Convert text to a string using string functions. Then divide it into lines using split(). sonnets it becomes an array of strings measuring 2625 by 1, where each line contains one line from poems. Display the first five lines sonnets.
sonnets = string(sonnets)
sonnets = split(sonnets, "\n")
sonnets[1:5]
String array
To calculate the frequency of words in sonnets first, clean it up by removing blank lines and punctuation marks. Then convert it to an array of strings containing individual words as elements.
Delete lines with zero characters ("") from the string array. Compare each sonnets element with "", an empty string. You can create strings, including empty ones, using double quotes. TF is a logical vector that contains the value true wherever sonnets contains a string with zero characters. Index it sonnets use TF and delete all lines with zero characters.
TF = sonnets .== ""
sonnets = sonnets[.!TF]
sonnets[1:10]
Replace some punctuation marks with spaces. For example, replace dots, commas, and semicolons. Save the apostrophes, because they may be part of some words in sonnets, such as light's.
p = ['.','?','!',',',';',':'];
sonnets = replace.(sonnets[:],p=>" ")
Remove the leading and ending space characters from each element of the array sonnets.
sonnets = strip.(sonnets[:],' ')
sonnets[1:10]
Crash sonnets to an array of strings, the elements of which are individual words. You can use split() to separate the elements of an array of strings by spaces or by the separators you specify. However split() requires that each element of the row array be divisible by an equal number of new rows. Elements sonnets they have a different number of spaces and, therefore, are not divided into an equal number of lines. To use split() a function for sonnets, write a for loop that calls split() for one element at a time.
Create an array of strings sonnetWords. Write a for loop that splits each element sonnets. Combine the output data split() with sonnetWords. Each element sonnetWords — this is a separate word from sonnets.
sonnetWords = split(sonnets[1]);
for i = 2:length(sonnets)
sonnetWords = [sonnetWords ; split(sonnets[i])];
end
sonnetWords[1:10]
Sorting an array by frequency
Find unique words in sonnetWords. Count them and sort them by frequency of occurrence.
To count words that differ only in case as the same word, convert sonnetWords in lowercase. For example, The and the are considered to be the same word.
Connect the functions StatsBase and Statistics next, functions from these libraries will be used.
import Pkg;
Pkg.add("StatsBase")
Pkg.add("Statistics")
using Statistics, StatsBase
To make words of the same lowercase, use the function lowercase.
sonnetWords = lowercase.(sonnetWords)
Find unique words using the function unique(). The sorting function from the smallest to the largest is also applied here. This is done for the convenience of further work.
words = sort(unique(sonnetWords))
Then count the number of times each unique word occurs using the function countmap(). It returns a dictionary in which each unique value of the array is sonnetWords it is compared with the number of its occurrences.
numOccurrences = sort(countmap(sonnetWords))
Sort the words in the sonnets by the number of occurrences, from the most common to the least common.
rankOfOccurrences = sort(collect(values(numOccurrences)), rev = true)
We will also write the indexes of words in the array, which are sorted by the number of occurrences. To do this, use sortperm().
rankIndex = sortperm(collect(values(numOccurrences)), rev = true)
Using the recorded indexes sorted by frequency of words, output 10 frequent words in the Sonnets.
wordsByFrequency = words[rankIndex]
wordsByFrequency[1:10]
Graph of word frequency
Create a graph showing the frequency of words in the Sonnets, starting with the most common and ending with the least frequent. According to Zipf's law, the frequency distribution of words in an extensive text follows a power law.
plot(rankOfOccurrences, xscale=:log10, yscale=:log10)
Let's put the statistics in a table
Calculate the total number of occurrences of each word in sonnetWords. Calculate the number of occurrences as a percentage of the total number of words and calculate the cumulative percentage from the most frequently occurring words to the least frequently occurring ones. Write down the words and the main statistical data on them in a table.
using DataFrames
T = DataFrame();
T.Words = wordsByFrequency;
T.NumOccurrences = rankOfOccurrences;
T.PercentOfText = rankOfOccurrences / length(sonnetWords) * 100.0;
T.CumulativePercentOfText = cumsum(rankOfOccurrences) / length(sonnetWords) * 100.0;
T[1:10, :]
Conclusion
The most frequent word in the Sonnets is and. It occurs 490 times. Collectively, the ten most frequent words make up 20.163% of the text.
Analyzing the text of Sonnet, we worked with reading data from a file, sorted and processed information using statistical libraries.