Analysing text data with string arrays
This example shows how to save text from a file as an array of strings, sort words by their frequency of occurrence, plot a graph, and collect basic statistics on the words found in the file.
Importing a text file into an array of strings
Read the text from Shakespeare's Sonnet using the function read()
. It returns the text as a vector of 100266 characters.
sonnets = read("/user/Отсортированное/Base_demo/AnalizeTextDataExample/sonnets.txt", String)
sonnets[1:35]
Convert the text to a string using the string
function. Then split it into lines using split()
. sonnets
becomes an array of lines of size 2625 by 1, where each line contains one line from the poem. Display the first five lines of sonnets
.
sonnets = string(sonnets)
sonnets = split(sonnets, "\n")
sonnets[1:5]
String array
To calculate the frequency of words in sonnets
, first clean it up by removing empty lines and punctuation marks. Then convert it into a string array containing individual words as elements.
Remove strings with zero characters ("") from the array of strings. Compare each element of sonnets with "", an empty string. You can create strings, including empty strings, using double quotes. TF is a boolean vector that contains the value true wherever sonnets
contains a string with zero characters. Index sonnets
with TF and delete all strings with zero characters.
TF = sonnets .== ""
sonnets = sonnets[.!TF]
sonnets[1:10]
Replace some punctuation marks with spaces. For example, replace full stops, commas, and semicolons. Keep apostrophes because they may be part of some words in sonnets, such as light's.
p = ['.','?','!',',',';',':'];
sonnets = replace.(sonnets[:],p=>" ")
Remove the beginning and ending space characters from each element of the array sonnets
.
sonnets = strip.(sonnets[:],' ')
sonnets[1:10]
Split sonnets
into an array of strings whose elements are individual words. You can use split()
to separate the elements of the string array by whitespace or by delimiters you specify. However, split()
requires that each element of the string array be divided into an equal number of newlines. The elements of sonnets
have different numbers of spaces and, therefore, are not divisible by an equal number of rows. To use the split()
function for sonnets
, write a for loop that calls split()
for one element at a time.
Create an array of strings sonnetWords
. Write a for loop that breaks each element of sonnets
. Combine the output of split()
with sonnetWords
. Each element of sonnetWords
is a different word from sonnets.
sonnetWords = split(sonnets[1]);
for i = 2:length(sonnets)
sonnetWords = [sonnetWords ; split(sonnets[i])];
end
sonnetWords[1:10]
Sort the array by frequency
Find unique words in sonnetWords. Count them and sort them by frequency of occurrence.
To count words that differ only in case as the same word, convert sonnetWords
to lower case. For example, the and the are counted as the same word.
Connect the functions StatsBase
and Statistics
, then the functions from these libraries will be used.
import Pkg;
Pkg.add("StatsBase")
Pkg.add("Statistics")
using Statistics, StatsBase
To make words of one case - lower case, use the function lowercase
.
sonnetWords = lowercase.(sonnetWords)
Find unique words using the function unique()
. Also the function of sorting from smaller to larger is applied here. This is done for convenience of further work.
words = sort(unique(sonnetWords))
Then count how many times each unique word occurs, using the function countmap()
. It returns a dictionary in which each unique value of the array sonnetWords
is compared with the number of its occurrences.
numOccurrences = sort(countmap(sonnetWords))
Sort the words in the sonnets by the number of occurrences, from most common to least common.
rankOfOccurrences = sort(collect(values(numOccurrences)), rev = true)
Let's also write the indices of the words that are sorted by number of occurrences into an array. To do this, use sortperm()
.
rankIndex = sortperm(collect(values(numOccurrences)), rev = true)
Using the recorded indices of words sorted by frequency, output 10 frequently occurring words in Sonnets.
wordsByFrequency = words[rankIndex]
wordsByFrequency[1:10]
Word frequency graph
Create a graph showing the frequency of words in Sonnets, starting with the most frequent and ending with the least frequent. According to Zipf's law, the distribution of word frequencies in an extensive text follows a power law.
plot(rankOfOccurrences, xscale=:log10, yscale=:log10)
Let's compile the statistics into a table
Count the total number of occurrences of each word in sonnetWords
. Count the number of occurrences as a percentage of the total number of words and calculate the cumulative percentage from the most frequent words to the least frequent words. Record the words and their key statistics in a table.
using DataFrames
T = DataFrame();
T.Words = wordsByFrequency;
T.NumOccurrences = rankOfOccurrences;
T.PercentOfText = rankOfOccurrences / length(sonnetWords) * 100.0;
T.CumulativePercentOfText = cumsum(rankOfOccurrences) / length(sonnetWords) * 100.0;
T[1:10, :]
Conclusion
The most frequent word in the Sonnets is and. It occurs 490 times. Together, the ten most frequent words make up 20.163% of the text.
When analysing the Sonnet text, we worked with reading data from a file, sorted and processed the information using statistical libraries.