Clustering of documents using the "bag of words" method

In this example, we will explore a way of presenting documents called a "bag of words" (BoW). This way of translating documents into numbers will allow us to translate each text into a vector representation. Despite the primitiveness of this vectorization method, it is possible to implement applied tasks based on it, provided that a suitable data set is available.

In this project, we will solve the clustering problem, which does not require training and test datasets, but simply divides the dataset into groups based on their coordinates in some space. First, we will study how to put texts into a homogeneous vector space, then how to saturate it with information, and finally how to see the result of clustering a set of texts.

# EngeePkg.purge()

Pkg.add(["XLSX", "Clustering", "MultivariateStats", "TextAnalysis", "SnowballStemmer", "StatsBase"])

using DataFrames, CSV, XLSX
using Statistics, StatsBase, LinearAlgebra, Clustering, Distances, MultivariateStats
using TextAnalysis, SnowballStemmer

stopwords_ru = CSV.read("$(@__DIR__)/stopwords.csv", DataFrame, header=0).Column1;
stemmer_ru = SnowballStemmer.Stemmer("russian");

We have prepared the environment and uploaded the necessary objects - a list of "stop words" that are skipped during the analysis and the simplest algorithm for detecting the root of a word (in order to get rid of word forms). Let's start downloading the texts.

Upload the texts

The data set that we will work with in this example will be a table with names and descriptions of examples from the Engee community. This is a set of small articles on various technical topics, from radio to economic calculations. The titles and short descriptions are written by the authors of the articles.

Let's see if we can use a primitive BoW to cluster these texts.

df = DataFrame(XLSX.readtable("$(@__DIR__)/Community.xlsx", "Community")) # Loading the table

titles = df.post_title
texts = df.post_title .* ". " .* df.post_description # We take the name and a short description

include("$(@__DIR__)/_scripts/doc_statistics.jl")
doc_statistics(texts)

Total documents: 862
Length of documents: average: 17.599767981438514
                  the median: 18.0
                      min: 5
                     Max: 32
                   Empty: 0

We should immediately note that the texts are quite short. With a sufficiently variable dictionary, the information content of each text will be quite low.

Creating a case

The set of analyzed texts is called a corpus. They are both objects that fix the linguistic norm, and individual instances that can break out of it. First, we need to reduce the variability of words that occur. We will convert each text to lowercase (lowercase) and assemble the object. Corpus from objects StringDocument.

using TextAnalysis
corpus = Corpus( StringDocument.(lowercase.(texts)) )
update_lexicon!(corpus)
print("Words in the dictionary: $(length(corpus.lexicon))")

Words in the dictionary: 5295

If we were to start working with this corpus now, each document could be represented by a vector of 5220 markers, each of which corresponds to the number of times a particular word occurs in the document. There are a lot more words than documents now. We need to simplify this space a bit. To do this, we will remove all numbers, spaces, and punctuation from the texts and reassemble the corpus.:

prepare!(corpus, strip_punctuation | strip_numbers | strip_whitespace )
update_lexicon!(corpus)
print("Words in the dictionary: $(length(corpus.lexicon))")

Words in the dictionary: 4793

Let's remove stop words and words that are shorter than two characters from it.:

corpus.lexicon = Dict( (String(word), Int(freq)) for (word, freq) in corpus.lexicon if (length(word) > 2 && word ∉ stopwords_ru))
print("Words in the dictionary: $(length(corpus.lexicon))")

Words in the dictionary: 4685

And the remaining words will be passed through a "stemmer" - an algorithm that cuts off typical word endings. Only cropped constructions will remain in the dictionary, which will sometimes match the root of the word. But we'll get rid of the word forms and be able to work with more unique constructions.

using StatsBase
corpus.lexicon = countmap([SnowballStemmer.stem(stemmer_ru, w) for (w,f) in corpus.lexicon for _ in 1:f])
print("Words in the dictionary: $(length(corpus.lexicon))")

Words in the dictionary: 2840

Each document is characterized by a certain set of words. But for our analysis, we won't need unique words that don't occur in any other document. Let's discard them by filtering out those words from the dictionary that do not occur more than once in any document.

corpus.lexicon = Dict( (String(word), Int(freq)) for (word, freq) in corpus.lexicon if freq >= 2)
print("Words in the dictionary: $(length(corpus.lexicon))")

Words in the dictionary: 1480

Function dtm() collects a matrix in which a set of words from the dictionary is assigned to each document.

dtm = DocumentTermMatrix(corpus) # Signs × Documents

A 862 X 1480 DocumentTermMatrix

We have 850 documents, many of which are characterized by the same words (vertical lines in the matrix), so in the final analysis we will see clusters around uninformative words like "work" or "example".

# BoW (frequency of words)
bow_dense = Matrix(dtm[:, :])'  # Признаки × Документы

# PCA
pca = fit(PCA, bow_dense; maxoutdim=3)
X_pca = MultivariateStats.predict(pca, bow_dense)  # (3, n_docs)

# Visualization (color = number of words in the document)
scatter(X_pca[1, :], X_pca[2, :], X_pca[3, :], legend=false,
        title="BoW projection", markersize=2, markerstrokewidth=0, alpha=0.7)

This is not a typical result for document clustering, although it would also be interesting to analyze it. Perhaps these are clusters along the length of the documents, and the variance arises from the fact that we projected the data into a reduced-dimensional space, losing some information about their variability, although the documents were very scattered in the original multidimensional space.

We have arranged all the documents in a 1,458-dimensional space. Naturally, to see them on the graph, we need to reduce the dimension, and the PCA analysis of the main components of the vector space will help us in this. It finds the projection in which the data is most variable.

To highlight the outstanding words, we will continue to work with a sparse representation of the text, but we will try to highlight the subject of each text. We will not establish connections between words (dense text analysis), but will raise the weight of some words relative to others.

Noise reduction using TFIDF

The next method that we need to try to apply is the method of "weighing" words in our bag, which is called TFIDF (term frequency, inverse document frequency). We will add weights to the words so that not all of them have the same meaning for analysis.

In our vector space, the documents will no longer be located in the same place as the rest of the documents where the word "work" is located. Along the axes of uninformative words, documents will slide closer to zero, and these axes will become uninformative.

The most expressive coordinates will be the numbers that characterize the presence of words in the document that are rarely found in the entire corpus. The more often a word occurs in a document and the less frequently it occurs in the corpus, the further away our document will be from zero in the coordinate corresponding to this dictionary word.

dtm = DocumentTermMatrix(corpus)
tfidf_mat = tf_idf(dtm)

862×1480 SparseArrays.SparseMatrixCSC{Float64, Int64} with 1614 stored entries:
⎡⣃⡷⣺⡔⡅⠆⠄⠣⠈⣀⠈⠀⠀⠚⣳⢨⠠⠒⠰⠀⠰⠀⠃⠂⠀⢠⡶⠰⢰⠠⠀⢪⠐⠀⠑⠦⠰⠁⣈⠄⎤
⎢⠳⡿⢯⢽⡣⢃⡀⡀⠙⣦⠀⠀⠠⣺⠁⢈⠂⢂⠀⠀⢈⠀⠣⢀⡁⢸⡷⠔⠡⢙⠀⢝⡁⢂⠈⢀⠠⠡⠡⠐⎥
⎢⡹⣭⠴⢺⡇⠀⠀⣀⠈⠐⠨⡁⠀⢸⠐⠁⠐⠄⠂⠀⠄⠁⡠⠀⡆⠈⡗⠀⠀⠁⠀⠄⠈⠀⡨⠡⠀⢀⣅⠌⎥
⎢⢫⡿⣑⡊⣖⡃⠈⢀⣤⠁⠀⡘⠀⢚⢔⣩⡰⡄⠀⠀⢐⢄⠐⡈⡁⠐⡷⢂⢲⣨⠀⣈⠀⠀⠅⠁⠰⢦⢑⠈⎥
⎢⣼⣿⣛⣅⡗⡄⠁⠡⠁⡂⠀⢀⠠⢼⢐⠼⠅⠀⠀⠀⠠⠀⠀⠀⠄⠠⡃⠏⢰⠨⣀⣌⠈⠀⣙⠁⠀⠎⡦⢡⎥
⎢⢸⡿⠡⡈⢃⡁⣂⢀⠀⠔⠀⠶⠈⢨⠈⢘⣀⠍⠄⠀⢘⢁⢁⠀⠂⠂⡇⢐⠰⢘⠀⠇⢀⡐⠠⡂⠀⠪⡐⠀⎥
⎢⢃⡿⣾⡖⡃⠓⠂⠐⠔⠀⠀⠀⠀⢹⠠⠡⢀⠖⠐⠀⠀⠀⠢⠐⠅⠀⡃⢶⡡⣠⢂⠃⠨⠐⢆⡄⠀⠣⠤⠀⎥
⎢⣜⠿⡶⠊⡁⡀⠀⡠⠠⠧⠀⢤⠀⢸⠰⠀⢚⠺⠀⠀⢀⠠⠁⠅⠆⠠⡯⢰⣸⠁⢄⠗⠀⠈⠀⡔⠀⡂⡄⠈⎥
⎢⠒⣿⢦⢈⠃⡂⡈⠀⢂⠂⠀⡡⠀⢸⠀⠐⠀⡲⠈⢀⠐⠀⠀⠀⠇⠀⡗⠐⢪⠄⠀⡄⠑⠀⠖⡀⠀⣥⡐⢐⎥
⎢⠀⣿⠠⢄⠁⠈⠐⡠⢤⡉⠐⡐⠀⢸⡩⠄⠐⡆⠍⠀⠐⡌⠪⠀⡃⠀⡇⠐⠐⢹⠀⢇⠠⠀⢄⠠⢀⡑⠐⠠⎥
⎢⣖⣿⢏⠳⢐⠤⠀⣥⠡⡽⠀⠑⠀⣸⠤⠃⠅⡁⢈⢐⠠⠡⢈⠀⡊⠀⡇⠑⡠⠂⡅⡧⠈⠐⠀⠅⠈⢃⡦⠂⎥
⎣⠒⠋⠋⠉⠃⠁⠀⠃⠈⠁⠀⠀⠀⠘⠑⠀⠀⠁⠂⠐⠀⠈⠀⠀⠁⠀⠃⠂⠃⠀⠀⠁⠉⠀⠁⠘⠀⠀⠃⠀⎦

A simple TFIDF document matrix looks the same as the frequency matrix after BoW, but we only see the presence or absence of a number, not its value. The most unique or specific words will gain more weight.

# Maximum weight per document (unique/specific words)
vocab = collect(keys(corpus.lexicon))
max_tfidf = maximum(tfidf_mat, dims=2)[:]
top_idx = sortperm(max_tfidf, rev=true)[1:10]
top_words = join(vocab[top_idx], ", ")

"modulator, three-dimensional, permutation, indexation, semitones, minimization, notation, cipher, regular, instrument"

The problem is that we work with short texts. If there are from 2 to 23 words in a text (the average length is 13), then almost any words are rare and will look like noise. Nevertheless, let's see how our texts are arranged in space now.

tfidf_dense = Matrix(tfidf_mat)'; # Dimensions: Features × Documents

mean_vec = mean(tfidf_dense, dims=2)      # Center the data first
centered_data = tfidf_dense .- mean_vec

pca = fit(PCA, centered_data; maxoutdim=3)
X_pca = MultivariateStats.predict(pca, (tfidf_dense))  # (n_docs, n_components)

scatter(X_pca[1, :], X_pca[2, :], X_pca[3, :], title="Projection of the token space", leg=false,
                markersize=2, markerstrokewidth=0, alpha=0.7)

plot!(size=(1000,400), titlefont=font(8))

Interestingly, many tests are located in the middle of the coordinate system (not representative), but there are several "rays" - groups of words that occur together and characterize their groups of documents. This is a typical picture for the thematic modeling task.

Nothing guarantees that these groups of words are related to each other, we have a fairly small data set and short texts, the grouping may be the result of a random distribution.

In addition, this is a grouping after projection, there might not have been any "rays" in the original space. But the texts were definitely located at different distances from the origin.

Let's perform clusterization for the number of clusters we set.

# Clusterization
k = 5
clusters = kmeans(X_pca, k; distance=CosineDist(), maxiter=500)

# For each cluster, we find the top 3 most remote documents.
top_ids_list = Dict{Int, Vector{Int}}()
for cluster_id in 1:k
    idx_in_cluster = findall(clusters.assignments .== cluster_id)
    if isempty(idx_in_cluster); continue; end
    
    center = clusters.centers[:, cluster_id]
    distances = [norm(X_pca[:, i] - center) for i in idx_in_cluster]
    top_idx = idx_in_cluster[sortperm(distances, rev=true)[1:min(3, end)]]
    top_ids_list[cluster_id] = top_idx
end

# We draw each cluster with its legend separately
p = plot()
for cluster_id in 1:k
    idx_in_cluster = findall(clusters.assignments .== cluster_id)
    if isempty(idx_in_cluster); continue; end
    
    # Cluster name in the legend
    label = string("Class ", cluster_id, " (", join(top_ids_list[cluster_id], ", "), ")")
    
    scatter!(p, X_pca[1, idx_in_cluster], X_pca[2, idx_in_cluster], X_pca[3, idx_in_cluster],
             label=label, markersize=3, alpha=0.6)
end

plot!(p, title="Ray clustering (k=$k)", legend=:outertopright, size=(1000,600))
display(p)

Cosine metric clustering, for centered data, allows us to cluster documents by their "spatial angle". We tried to make the individual rays fall into separate clusters.

Here are some of the most representative documents for each cluster.:

max_docs = 6

for cluster_id in 1:k
    idx_in_cluster = findall(clusters.assignments .== cluster_id)
    if isempty(idx_in_cluster); continue; end
    
    # Keywords for the cluster (peripheral documents)
    center = clusters.centers[:, cluster_id]
    distances = [norm(X_pca[:, i] - center) for i in idx_in_cluster]
    threshold = quantile(distances, 0.7)
    peripheral_idx = idx_in_cluster[distances .>= threshold]
    
    # Top 5 words from the periphery
    if !isempty(peripheral_idx)
        avg_tfidf = mean(tfidf_mat[peripheral_idx, :], dims=1)[:]
        top_idx = sortperm(avg_tfidf, rev=true)[1:min(5, end)]
        top_words = vocab[top_idx]
        keywords = join(top_words, ", ")
    else
        keywords = "—"
    end
    
    # Top 3 most deleted documents
    top_docs = idx_in_cluster[sortperm(distances, rev=true)[1:min(max_docs, end)]]
    
    println("\Cluster $cluster_id ($(length(idx_in_cluster)) dock, keywords: $keywords)")
    for (rank, idx) in enumerate(top_docs)
        println("  $rank. [$idx] $(titles[idx])")
    end
end

Cluster 1 (111 doc., keywords: snapshot, taken, projected, klyuchev, earth)
  1. [41] Dimensionality reduction
  2. [173] Bouncing ball (physical model)
  3. [222] B16 Malfunction Lockout Test Program
  4. [424] A model of a missile with a homing system
  5. [469] Permanent magnet motor
  6. [565] Automation of models for IoT using callbacks

Cluster 2 (432 doc., keywords: minimization, non-stationary, seb, fans, matrices)
  1. [19] Ensuring the reliability of measurements
  2. [84] The Julia Set
  3. [95] Christmas tree
  4. [165] Kramer's Rule
  5. [191] The Cantor set
  6. [202] Balanced brackets

Cluster 3 (24 doc., keywords: disadvantage, integrated, shift, diagram, level)
  1. [254] Conveyor model
  2. [261] Asynchronous motor model
  3. [619] Calculation of the temperature field of a flat heat exchanger
  4. [646] Payroll
  5. [648] Calculation of square level roots
  6. [649] Simulation of population growth

Cluster 4 (30 doc., keywords: diaphragm, drone, derivative, modulator, matlab)
  1. [117] Calculation of the steady-state mode of the IEEE 9 Bus model
  2. [169] CUSIP
  3. [388] The Goodwin Model: an empirical analysis in Julia
  4. [87] Rapid prototyping of a garland control algorithm
  5. [371] AFE regenerative voltage rectifier
  6. [833] Modeling of control systems using the example of a PID controller

Cluster 5 (265 doc., keywords: utg, gazprom, receiver, derivative, decimator)
  1. [81] Modeling of the Solar system
  2. [86] Detection curves part 2: Monte Carlo simulation
  3. [93] Emergency flight abort system
  4. [258] Optimization of the stochastic objective function
  5. [295] How to estimate the denominator of a geometric progression?
  6. [303] Real-time signal reconstruction using AI

It's not that the differences are obvious. Let's look at the statistics:

# Let's check the sparsity of the matrix
println("Sparsity of TF-IDF matrix: $(1 - SparseArrays.nnz(tfidf_mat)/prod(size(tfidf_mat)))%")
println("Average number of non-zero elements per document: $(SparseArrays.nnz(tfidf_mat)/size(tfidf_mat, 1))")

# Let's look at the variance
row_variances = [var(collect(row)) for row in eachrow(tfidf_mat)]
println("Variance by documents: average $(mean(row_variances))")
println("                         median $(median(row_variances))")
println("                        min/max $(minimum(row_variances)) / $(maximum(row_variances))")

Sparsity of the TF-IDF matrix: 0.998734871762714%
Average number of non-zero elements per document: 1.8723897911832947
Variance according to documents: average 0.0072900143183377945
                         median 0.004897149381269027
                        Min/max 0.0 / 0.03086995392837496

Our documents don't lend themselves very well to this kind of clustering. The high sparsity of the matrix means that very few documents can be grouped according to such a vector feature.

We also see that each document is characterized by an average of 1.8 unique words (our matrix has so many non-zero elements on average). This is very small, it is better to increase this figure to 5-10% of the vocabulary, but since our texts are very short, we will not be able to do this.

Having 1-2 significant words per document (after stemming), we get almost zero variance - all documents are very poor in information.

Conclusion

Sparse text analysis methods usually show good results in auxiliary tasks: spam filtering, identification of unique texts or symbols, comparison of documents with specific terminology, and basic clustering.

We have also studied the methods of text preprocessing that are used at every step, in every text analysis application project. As you can see, we can use a dozen functions to organize a relatively complex text processing chain.