Clustering of data based on Gaussian mixtures

Introduction

This example presents a method for implementing a rigid clustering algorithm on synthesized data generated by a mixture of Gaussian distributions.

Gaussian mixture models are an effective tool for solving cluster analysis problems. The fundamental idea of the method is that the components of a multidimensional normal distribution resulting from model training can be interpreted as separate clusters in the data structure.

Each component of the mixture is characterized by its own parameters — a vector of mathematical expectations and a covariance matrix, which makes it possible to model clusters of various geometric shapes, sizes, and spatial orientations. This approach provides a statistically sound division of multidimensional data into homogeneous groups.

This technique is particularly effective for analyzing data of complex structure, where traditional clustering methods may exhibit limited accuracy due to the assumption of sphericity of clusters.

Initial data

We will import and attach the necessary libraries.

import PkgPkg.add(["Random", "Distributions", "LinearAlgebra", "Clustering", "GaussianMixtures"])using Random, Distributions, LinearAlgebra, Clustering, GaussianMixtures

Let's model data from a mixture of two two-dimensional Gaussian distributions using the function MvNormal.

Random.seed!(123)μ1 = [1.0, 2.0]σ1 = [3.0 0.2; 0.2 2.0]μ2 = [-1.0, -2.0]σ2 = [2.0 0.0; 0.0 1.0]X1 = rand(MvNormal(μ1, σ1), 200)'X2 = rand(MvNormal(μ2, σ2), 100)'X = vcat(X1, X2)points = size(X, 1)

Let's display a scatter chart of the generated data.

график1 = scatter(X[:, 1], X[:, 2], markersize=3, markercolor=:black, marker=:o, legend=false, title="A set of points")# , xlabel="X₁", ylabel="X₂")display(graph1)

Training a model for the generated data

Let's train a two-component Gaussian mixture model for the generated data. In this demo, the number of components of the mixture is known in advance and corresponds to the number of initial distributions used in data generation.

gm = GMM(2, X; kind=:full, nIter=5000, nInit=100, method=:kmeans)

GMM{Float64} with 2 components in 2 dimensions and full covariance
Mix 1: weight 0.343617
 mean: [-0.7377815809293755, -1.942919371363109]
 covariance: 2×2 Matrix{Float64}:
 2.12472   0.294642
 0.294642  0.994616
Mix 2: weight 0.656383
 mean: [0.800774284695651, 2.041802685241795]
 covariance: 2×2 Matrix{Float64}:
 3.29572   0.218722
 0.218722  1.8319

Let's create a script for simultaneous display of a set of points and contours.

function display(X, gm)    k = size(gm.μ, 1)    μs = gm.μ    Σ_ = gm.Σ    weights = gm.w         function gmPDF(x, y)        point = [x, y]        amount = 0.0        for k in 1:k            μ = μs[k, :]            L = Σ_[k]            Σ = L * L'            distribution = MvNormal(μ, Σ)            sum += weights[k] * pdf(distribution, point)        end        return amount    end        x_set = range(-5, 6, length=80)    y_set = range(-5, 6, length=80)        plot(size=(800, 600))    scatter!(X[:, 1], X[:, 2], label="", color=:black, markersize=3, marker=:circle, alpha=0.7)    contour!(x_сетка, y_сетка, (x,y)->gmPDF(x,y), levels=20, color=:jet, linewidth=3.5, colorbar=false, label="", alpha=0.8)    title!("A set of points and contours")    # xlabel!("X₁")    # ylabel!("X₂")    return current()end

отобразить (generic function with 1 method)

We visualize the contours of the estimated probability density for a two-component mixed distribution.

graph2 = display(X, gm)display(graph2)

The observed separation of the regions of maximum probability density for each component indicates that the data structure allows them to be meaningfully divided into two clusters. This visual evidence confirms the validity of using a two-component Gaussian mixture model for clustering the data set under consideration.

Clustering data using a trained model

Let's create a set of functions: предсказать and кластеризация. These functions perform rigid clustering, a method in which each data point uniquely belongs to exactly one cluster. For the Gaussian mixture model, clustering assigns each observation to one of the two components of the mixture. The center of each selected cluster corresponds to the expectation vector of the corresponding component of the mixture, which provides a meaningful interpretation of the clustering results.

function predict(gm::GMM, X::Matrix{Float64})    points = size(X, 1)    k = size(gm.μ, 1)    placemarks = zeros(Int, dots)    For I in 1:points        point = X[i, :]        max_ probability = -Inf        best = 1        for k in 1:k            μ = gm.μ[k, :]            L = gm.Σ[k]            Σ = L * L'            distribution = MvNormal(μ, Σ)            density logarithm = logpdf(distribution, point)            probability logarithm = log(gm.w[k]) + density logarithm            if the logarithm of probability > max_ probability                max_ probability = logarithm_ probability                best = k            end        end        tags[i] = best    end    return tagsendfunction clustering(gm::GMM, X::Matrix{Float64})    return predict(gm, X)end

кластеризация (generic function with 1 method)

The function returns a vector of integer labels, where each value corresponds to the cluster number to which the corresponding observation belongs. This allows you to explicitly divide the initial sample into subsets corresponding to the selected components of the mixture.

idx = clustering(gm, X)

300-element Vector{Int64}:
 2
 2
 2
 2
 2
 2
 2
 1
 2
 2
 2
 2
 2
 ⋮
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1

Let's display the clustering result.

function диаграмма(X, idx; colors=[:red, :blue], markers=[:+, :o])# , titles=["Cluster 1", "Cluster 2"])    p = plot(size=(800, 600), legend=:best)    unique = unique(idx)    for (i, cluster) in enumerate(unique)        mask = idx .== cluster        x_clast = X[mask, 1]        y_clast = X[mask, 2]        form_marker = markers[i] == :+ ? :circle : :dtriangle        scatter!(x_класт, y_класт, markershape=форма_маркера, markercolor=colors[i], markersize=5, label="")# , label=titles[i])    end    return pendgraph3 = chart(X, idx)title!(график3, "Clusterization")# xlabel!(graph3, "x₁")# ylabel!(graph3, "x₂")display(graph3)

Each selected cluster corresponds to one of the two-dimensional normal components in the mixed distribution. The clustering algorithm assigns each point to the component of the mixture that corresponds to the maximum estimated probability, thus, this approach provides a statistically sound data separation.

Estimation of cluster membership probabilities

Let's create a special function for the direct calculation of the matrix of estimated probabilities.

function score(gm, X)    points = size(X, 1)    k = size(gm.μ, 1)    P = zeros(points, k)    For I in 1:points        logarithms = zeros(k)        for k in 1:k            μ = gm.μ[k, :]            L = gm.Σ[k]            Σ = L * L'            distribution = MvNormal(μ, Σ)            density logarithm = logpdf(distribution, X[i, :])            logarithms[k] = log(gm.w[k]) + logarithm_disness        end        max_log = maximum(logarithms)        shifted = logarithms .- max_log        exp_probabilities = exp.(displaced)        P[i, :] = exp_ probabilities ./ sum(exp_ probabilities)    end    return Pend

оценка (generic function with 1 method)

Let's calculate and display the probabilities of each observation belonging to the first component of the mixture.

P = score(gm, X)cluster1 = idx .== 1cluster2 = idx .== 2graph4 = plot(size=(800, 600))scatter!(X[кластер1, 1], X[кластер1, 2], marker_z = P[кластер1, 2], markersize = 5, marker = :dtriangle, linewidth = 1.5, clims = (0, 1), color = :jet, colorbar = true, colorbar_title = "Estimated probability", label="")# , label = "Cluster 1")scatter!(X[кластер2, 1], X[кластер2, 2], marker_z = P[кластер2, 2], markersize = 5, marker = :circle, linewidth = 1.5, clims = (0, 1), color = :jet, label="")# , label = "Cluster 2", colorbar = true, colorbar_title = "Estimated probability")title!("Estimated probabilities")# xlabel!("X₁")# ylabel!("X₂")display(graph4)

The matrix It is an array of dimensions , containing the probabilities of observations belonging to clusters.

Assigning new data to clusters

The model trained on the initial sample is used to cluster additional observations. Each new point is compared with the component of the distribution that was identified at the stage of analyzing the source data.

Let's create a Gaussian mixture model using the function MixtureModel using previously defined parameters — vectors of mathematical expectations and standard deviations of the components. This method provides a higher level of abstraction in modeling, since it allows you to work with the distribution model as a complete object that encapsulates all the parameters of the mixture, including component weights and covariance matrices.

mix = MixtureModel([MvNormal(μ1, σ1), MvNormal(μ2, σ2)], [0.75, 0.25])X0 = collect(rand(mix, 75)')

We will assign new data to the clusters and visualize the result.

idx0 = clustering(gm, X0)graph5 = plot(size=(800, 600))k = size(gm.μ, 1)μs = gm.μΣ_ = gm.Σweights = gm.wfunction gmPDF(x, y)    point = [x, y]    amount = 0.0    for k in 1:k        μ = μs[k, :]        L = Σ_[k]        Σ = L * L'        distribution = MvNormal(μ, Σ)        sum += weights[k] * pdf(distribution, point)    end    return amountendx_min, x_max = minimum(X0[:, 1])-1, maximum(X0[:, 1])+1y_min, y_max = minimum(X0[:, 2])-1, maximum(X0[:, 2])+1x_set = range(x_min, x_max, length=80)y_set = range(y_min, y_max, length=80)contour!(x_сетка, y_сетка, (x,y)->gmPDF(x,y), levels=20, color=:jet, linewidth=1.5, alpha=0.7, label="", colorbar=false)for (i, cluster) in enumerate(unique(idx0))    mask = idx0 .== cluster    x_clast = X0[mask, 1]    y_clast = X0[mask, 2]    form_marker = i == 1 ? :dtriangle : :circle    marker color = i == 1 ? :blue : :red    # cluster number = "Cluster $i"    scatter!(x_класт, y_класт, markershape=форма_маркера, markercolor=цвет_маркера, markersize=5, alpha=0.7, label="")# , label=cluster numberendtitle!("Clustering of new data")# xlabel!("X₁")# ylabel!("X₂")# legend!(position=:best)display(graph5)

When calculating the probabilities of new observations belonging to clusters, the estimated proportions of the mixture components determined at the model training stage are used. Thus, the distribution of new data across clusters should correspond to the proportions established for the original sample.

Conclusion

This paper demonstrates a methodology for cluster analysis of multidimensional data based on Gaussian mixture models using a rigid clustering algorithm. Using the example of synthetic data generated from a mixture of two-dimensional normal distributions, the full cycle of analysis is shown: from modeling the initial sample and training the parameters of the mixture to classifying both training and new observations.

The main conclusions:

Gaussian mixture models are an effective and statistically sound clustering tool that allows describing clusters of various shapes, sizes, and orientations by approximating the empirical distribution with a set of normal components.
Rigid clustering, implemented by assigning each observation to a component with the highest estimated probability, provides an unambiguous and interpretable division of data into groups, the centers of which correspond to the mathematical expectations of the corresponding components of the mixture.
The calculation of membership probabilities provides the researcher not only with deterministic clustering, but also with a quantitative measure of classification uncertainty, which opens up the possibility of using soft clustering methods in tasks requiring a probabilistic approach.
An important advantage of the considered approach is its generalizing ability: the trained model can be used to cluster new data, provided that the statistical homogeneity of the main population is maintained, which is confirmed by the use of estimated mixing weights obtained at the training stage.

Practical significance:
The presented methodology can be recommended for use in a wide range of applied tasks, including segmentation of client databases, classification of biomedical data, analysis of multidimensional time series and other areas where statistically justified allocation of homogeneous groups of objects in a complex data structure is required.