Engee documentation
Notebook

Clustering

Clustering (or cluster analysis) is the task of partitioning a set of objects into groups called clusters. Within each group there should be "similar" objects, and objects from different groups should be as different as possible. The main difference between clustering and classification is that the list of groups is not specified and is determined in the process of the algorithm.

This example will show how to perform clustering using the K-means method (K-means) and the fuzzy clustering method (C-means).

Connecting the necessary libraries:

In [ ]:
Pkg.add(["RDatasets", "Clustering"])
In [ ]:
using Plots
using Clustering 

Loading the dataset for cluster analysis:

In [ ]:
using RDatasets
iris = dataset("datasets", "iris");

Defining the attributes of the observational objects from the dataset into a separate variable:

In [ ]:
iris_species = iris[:,5];

Visualising the raw data on a three-dimensional graph:

In [ ]:
plotlyjs()
setosaIndex = findall(x->x=="setosa", iris_species)
versicolorIndex = findall(x->x=="versicolor", iris_species)
virginicaIndex = findall(x->x=="virginica", iris_species)
scatter(Matrix(iris[setosaIndex, [:SepalLength]]), Matrix(iris[setosaIndex, [:SepalWidth]]), Matrix(iris[setosaIndex, [:PetalLength]]), color="red")
scatter!(Matrix(iris[versicolorIndex, [:SepalLength]]), Matrix(iris[versicolorIndex, [:SepalWidth]]), Matrix(iris[versicolorIndex, [:PetalLength]]), color="blue")
scatter!(Matrix(iris[virginicaIndex, [:SepalLength]]), Matrix(iris[virginicaIndex, [:SepalWidth]]), Matrix(iris[virginicaIndex, [:PetalLength]]), color="green")
Out[0]:

Convert to a two-dimensional graph for further comparison of clustering results with the original data:

In [ ]:
gr()
s1 = scatter(Matrix(iris[setosaIndex, [:SepalLength]]), Matrix(iris[setosaIndex, [:SepalWidth]]), color="red", markersize = 5, label="setosa")
s2 = scatter!(Matrix(iris[versicolorIndex, [:SepalLength]]), Matrix(iris[versicolorIndex, [:SepalWidth]]), color="blue", markersize = 5, label="versicolor")
s3 = scatter!(Matrix(iris[virginicaIndex, [:SepalLength]]), Matrix(iris[virginicaIndex, [:SepalWidth]]), color="green", markersize = 5, label="virginica")
p1 = plot(s3)
Out[0]:

Preparation of data into a format acceptable for processing by clustering methods:

In [ ]:
features = collect(Matrix(iris[:, 1:4])')
Out[0]:
4×150 Matrix{Float64}:
 5.1  4.9  4.7  4.6  5.0  5.4  4.6  5.0  …  6.8  6.7  6.7  6.3  6.5  6.2  5.9
 3.5  3.0  3.2  3.1  3.6  3.9  3.4  3.4     3.2  3.3  3.0  2.5  3.0  3.4  3.0
 1.4  1.4  1.3  1.5  1.4  1.7  1.4  1.5     5.9  5.7  5.2  5.0  5.2  5.4  5.1
 0.2  0.2  0.2  0.2  0.2  0.4  0.3  0.2     2.3  2.5  2.3  1.9  2.0  2.3  1.8

K-means method

One of the most popular clustering methods is the K-means method (K-means). The basic idea of the method is iterative repetition of two steps:

  1. Distribution of sample objects into clusters;
  2. Recalculation of cluster centres.

Application of the clustering method to the original data:

In [ ]:
result_kmeans = kmeans(features, 3); #метод kmeans() принимает как аргументы набор данных и количество кластеров

Visualising the solution using a dot plot:

In [ ]:
plotlyjs()
p2 = scatter(iris.SepalLength, iris.SepalWidth, 
        marker_z = result_kmeans.assignments, 
        color =:red, legend = false, markersize = 5)
Out[0]:

The assignments parameter in the variable containing the results (result_kmeans) contains a column with the numbers of clusters assigned to each observation. The colour of the dot plot corresponds to the number of a particular cluster.

Comparison of the results of the K-means algorithm with the original data:

In [ ]:
plot(p1, p2)
Out[0]:

The graphs show a satisfactory result of the algorithm, the observations are clustered in clusters that are quite close to the original data.

Fuzzy optimisation method

The fuzzy clustering method can be considered as an improved k-means method, in which for each element from the set under consideration the degree of its belonging to each of the clusters is calculated.

Application of the clustering method to the original data:

In [ ]:
result_fuzzy = fuzzy_cmeans(features, 3, 2, maxiter=2000, display=:iter)
  Iters      center-change
----------------------------
      1       7.771477e+00
      2       4.739898e-01
      3       1.200676e+00
      4       6.755399e-01
      5       3.645266e-01
      6       2.000300e-01
      7       8.829159e-02
      8       4.021639e-02
      9       2.069955e-02
     10       1.182799e-02
     11       7.198234e-03
     12       4.523367e-03
     13       2.885398e-03
     14       1.852891e-03
     15       1.193257e-03
     16       7.693420e-04
Fuzzy C-means converged with 16 iterations (δ = 0.0007693420314963082)
Out[0]:
FuzzyCMeansResult: 3 clusters for 150 points in 4 dimensions (converged in 16 iterations)

Calculation of the centres of three clusters based on four attributes of the objects of observation:

In [ ]:
M = result_fuzzy.centers
Out[0]:
4×3 Matrix{Float64}:
 5.00396   5.88825  6.77419
 3.41412   2.76082  3.05214
 1.48276   4.36295  5.64575
 0.253522  1.3968   2.05315

Calculation of weights (degrees of belonging to a particular cluster) of objects of observation:

In [ ]:
memberships = result_fuzzy.weights
Out[0]:
150×3 Matrix{Float64}:
 0.996624    0.00230446   0.00107189
 0.975835    0.016663     0.00750243
 0.979814    0.0137683    0.00641748
 0.967404    0.0224829    0.0101134
 0.99447     0.0037622    0.00176788
 0.934536    0.0448343    0.0206299
 0.97948     0.0140129    0.00650756
 0.999547    0.000311776  0.000141288
 0.930335    0.0477519    0.0219135
 0.982709    0.0119456    0.00534496
 0.968026    0.0217691    0.0102053
 0.99213     0.00543676   0.0024328
 0.97062     0.020198     0.00918222
 ⋮                        
 0.0217694   0.748803     0.229428
 0.00347922  0.0288342    0.967687
 0.00507985  0.0376451    0.957275
 0.0153794   0.129099     0.855522
 0.0293134   0.614578     0.356108
 0.00527622  0.0338633    0.96086
 0.00971574  0.0631961    0.927088
 0.0112318   0.105921     0.882848
 0.0257965   0.506297     0.467907
 0.0120658   0.155488     0.832446
 0.0215546   0.188514     0.789932
 0.0269334   0.580575     0.392492

The number from 0 to 1 characterises the proximity to the centre of the cluster, where 1 is the maximum belonging (being in the centre) and 0 is the maximum distance from the centre.

Visualisation of the results of the fuzzy logic method:

In [ ]:
p3 = scatter(iris.SepalLength, iris.SepalWidth, 
        marker_z = memberships[:,1], 
        legend = true, markersize = 5)

p4 = scatter(iris.SepalLength, iris.SepalWidth, 
        marker_z = memberships[:,3], 
        legend = false, markersize = 5)

p5 = scatter(iris.SepalLength, iris.SepalWidth, 
        marker_z = memberships[:,2], 
        legend = false, markersize = 5)

plot(p3, p4, p5, p1, legend = false)
Out[0]:

The first three graphs show the results of the fuzzy logic method, where the colour scale shows the degree of belonging to a particular cluster.

The last graph shows the initial (reliable) data.

By pointing the mouse cursor first on the graph with the original data and then on the graphs with the points obtained using the method, it is possible to compare the results and visually assess the accuracy of the method.

Conclusion:

In this example, two of the most commonly used clustering methods were examined. The functions that were used to implement these methods were called from the Clustering.jl library, which also provides many other, more advanced methods.

The results obtained from their application are good but not perfect, more data is needed to improve the quality of the algorithms.