Engee documentation
Notebook

Clusterization

Clustering (or cluster analysis) is the task of dividing a set of objects into groups called clusters. There should be "similar" objects inside each group, and objects from different groups should be as different as possible. The main difference between clustering and classification is that the list of groups is not specified and is determined during the operation of the algorithm.

This example will show how to perform clustering using the K-means method and the fuzzy clustering method (C-means).

Connecting the necessary libraries:

In [ ]:
Pkg.add(["RDatasets", "Clustering"])
In [ ]:
using Plots
using Clustering 

Uploading a dataset for cluster analysis:

In [ ]:
using RDatasets
iris = dataset("datasets", "iris");

Determining the features of observational objects from a data set into a separate variable:

In [ ]:
iris_species = iris[:,5];

Visualization of the source data on a three-dimensional graph:

In [ ]:
plotlyjs()
setosaIndex = findall(x->x=="setosa", iris_species)
versicolorIndex = findall(x->x=="versicolor", iris_species)
virginicaIndex = findall(x->x=="virginica", iris_species)
scatter(Matrix(iris[setosaIndex, [:SepalLength]]), Matrix(iris[setosaIndex, [:SepalWidth]]), Matrix(iris[setosaIndex, [:PetalLength]]), color="red")
scatter!(Matrix(iris[versicolorIndex, [:SepalLength]]), Matrix(iris[versicolorIndex, [:SepalWidth]]), Matrix(iris[versicolorIndex, [:PetalLength]]), color="blue")
scatter!(Matrix(iris[virginicaIndex, [:SepalLength]]), Matrix(iris[virginicaIndex, [:SepalWidth]]), Matrix(iris[virginicaIndex, [:PetalLength]]), color="green")
Out[0]:

Conversion to a two-dimensional graph for further comparison of clustering results with the initial data:

In [ ]:
gr()
s1 = scatter(Matrix(iris[setosaIndex, [:SepalLength]]), Matrix(iris[setosaIndex, [:SepalWidth]]), color="red", markersize = 5, label="setosa")
s2 = scatter!(Matrix(iris[versicolorIndex, [:SepalLength]]), Matrix(iris[versicolorIndex, [:SepalWidth]]), color="blue", markersize = 5, label="versicolor")
s3 = scatter!(Matrix(iris[virginicaIndex, [:SepalLength]]), Matrix(iris[virginicaIndex, [:SepalWidth]]), color="green", markersize = 5, label="virginica")
p1 = plot(s3)
Out[0]:

Preparing data in a format acceptable for processing by clusterization methods:

In [ ]:
features = collect(Matrix(iris[:, 1:4])')
Out[0]:
4×150 Matrix{Float64}:
 5.1  4.9  4.7  4.6  5.0  5.4  4.6  5.0  …  6.8  6.7  6.7  6.3  6.5  6.2  5.9
 3.5  3.0  3.2  3.1  3.6  3.9  3.4  3.4     3.2  3.3  3.0  2.5  3.0  3.4  3.0
 1.4  1.4  1.3  1.5  1.4  1.7  1.4  1.5     5.9  5.7  5.2  5.0  5.2  5.4  5.1
 0.2  0.2  0.2  0.2  0.2  0.4  0.3  0.2     2.3  2.5  2.3  1.9  2.0  2.3  1.8

The K-means method

One of the most popular clustering methods is the K—means method. The main idea of the method is an iterative repetition of two steps.:

  1. Distribution of sample objects by clusters;
  2. Recalculation of cluster centers.

Applying the clustering method to the source data:

In [ ]:
result_kmeans = kmeans(features, 3); #метод kmeans() принимает как аргументы набор данных и количество кластеров

Visualization of the solution using a dot graph:

In [ ]:
plotlyjs()
p2 = scatter(iris.SepalLength, iris.SepalWidth, 
        marker_z = result_kmeans.assignments, 
        color =:red, legend = false, markersize = 5)
Out[0]:

The assignments parameter in the variable containing the results (result_kmeans) contains a column with the cluster numbers assigned to each observation. The color of the point on the graph corresponds to the number of a particular cluster.

Comparison of the results of the K-means algorithm with the initial data:

In [ ]:
plot(p1, p2)
Out[0]:

The graphs show a satisfactory result of the algorithm, the observations are divided into clusters that are close enough to the original data.

The fuzzy optimization method

The fuzzy clustering method can be considered as an improved k-means method, in which for each element from the considered set, the degree of its belonging to each of the clusters is calculated.

Applying the clustering method to the source data:

In [ ]:
result_fuzzy = fuzzy_cmeans(features, 3, 2, maxiter=2000, display=:iter)
  Iters      center-change
----------------------------
      1       7.771477e+00
      2       4.739898e-01
      3       1.200676e+00
      4       6.755399e-01
      5       3.645266e-01
      6       2.000300e-01
      7       8.829159e-02
      8       4.021639e-02
      9       2.069955e-02
     10       1.182799e-02
     11       7.198234e-03
     12       4.523367e-03
     13       2.885398e-03
     14       1.852891e-03
     15       1.193257e-03
     16       7.693420e-04
Fuzzy C-means converged with 16 iterations (δ = 0.0007693420314963082)
Out[0]:
FuzzyCMeansResult: 3 clusters for 150 points in 4 dimensions (converged in 16 iterations)

Calculation of the centers of three clusters based on four features of observational objects:

In [ ]:
M = result_fuzzy.centers
Out[0]:
4×3 Matrix{Float64}:
 5.00396   5.88825  6.77419
 3.41412   2.76082  3.05214
 1.48276   4.36295  5.64575
 0.253522  1.3968   2.05315

Calculation of weights (degrees of belonging to a specific cluster) of observational objects:

In [ ]:
memberships = result_fuzzy.weights
Out[0]:
150×3 Matrix{Float64}:
 0.996624    0.00230446   0.00107189
 0.975835    0.016663     0.00750243
 0.979814    0.0137683    0.00641748
 0.967404    0.0224829    0.0101134
 0.99447     0.0037622    0.00176788
 0.934536    0.0448343    0.0206299
 0.97948     0.0140129    0.00650756
 0.999547    0.000311776  0.000141288
 0.930335    0.0477519    0.0219135
 0.982709    0.0119456    0.00534496
 0.968026    0.0217691    0.0102053
 0.99213     0.00543676   0.0024328
 0.97062     0.020198     0.00918222
 ⋮                        
 0.0217694   0.748803     0.229428
 0.00347922  0.0288342    0.967687
 0.00507985  0.0376451    0.957275
 0.0153794   0.129099     0.855522
 0.0293134   0.614578     0.356108
 0.00527622  0.0338633    0.96086
 0.00971574  0.0631961    0.927088
 0.0112318   0.105921     0.882848
 0.0257965   0.506297     0.467907
 0.0120658   0.155488     0.832446
 0.0215546   0.188514     0.789932
 0.0269334   0.580575     0.392492

A number from 0 to 1 characterizes proximity to the center of the cluster, where 1 is the maximum affiliation (being in the center), and 0 is the maximum distance from the center.

Visualization of the results of the fuzzy logic method:

In [ ]:
p3 = scatter(iris.SepalLength, iris.SepalWidth, 
        marker_z = memberships[:,1], 
        legend = true, markersize = 5)

p4 = scatter(iris.SepalLength, iris.SepalWidth, 
        marker_z = memberships[:,3], 
        legend = false, markersize = 5)

p5 = scatter(iris.SepalLength, iris.SepalWidth, 
        marker_z = memberships[:,2], 
        legend = false, markersize = 5)

plot(p3, p4, p5, p1, legend = false)
Out[0]:

The first three graphs show the results of the fuzzy logic method, where the color scale shows the degree of belonging to a particular cluster.

The last graph shows the initial (reliable) data.

By hovering the mouse cursor first on the graph with the initial data, and then on the graphs with the points obtained using the method, you can compare the results and visually evaluate the accuracy of the method.

Conclusion:

In this example, two of the most commonly used clustering methods were considered. The functions used to implement these methods were called from the Clustering library.jl, which also provides many other, more advanced methods.

The results obtained during their application are good, but not ideal, and more data is needed to improve the quality of the algorithms.