Engee documentation
Notebook

Gaussian mixture model

Introduction

Modeling probability distributions of complex, multimodal shapes is a fundamental task in statistical analysis and machine learning. One of the most powerful and mathematically elegant tools for solving this problem is the Gaussian mixture model.

A Gaussian mixture is a probabilistic model that approximates a complex data distribution as a weighted sum of a finite number of underlying distributions. Each component of the mixture is defined by its own expectation vector, covariance matrix, and weight, which determines its contribution to the overall distribution. The key idea of the Gaussian mixture model is that each subgroup in the data can be described by its own Gaussian, and their superposition forms a flexible model capable of describing asymmetry, multimodality, and complex correlation structures.

Initial data

We will import and attach the necessary libraries.

In [ ]:
import Pkg
Pkg.add("Distributions")
Pkg.add("PlotlyBase")
using Distributions
plotly()

Let's determine the average values of the components, covariance matrices, and proportions for a two-component mixture of a two-dimensional Gaussian distribution.

In [ ]:
mu = [[1.0, 2.0], [-3.0, -5.0]] # average values
sigma = [[2.0 0.0; 0.0 0.5], [1.0 0.0; 0.0 1.0]] # covariance matrices
p = [0.5, 0.5] # proportions

Rows of the matrix mu correspond to the vectors of the average values of each component, a three-dimensional array sigma It contains the covariance matrices of each component.

Creating a model

Let's create a Gaussian mixture model using the function MixtureModel.

In [ ]:
gm = MixtureModel([MvNormal(μ, Σ) for (μ, Σ) in zip(mu, sigma)], p)
Out[0]:
MixtureModel{FullNormal}(K = 2)
components[1] (prior = 0.5000): FullNormal(
dim: 2
μ: [1.0, 2.0]
Σ: [2.0 0.0; 0.0 0.5]
)

components[2] (prior = 0.5000): FullNormal(
dim: 2
μ: [-3.0, -5.0]
Σ: [1.0 0.0; 0.0 1.0]
)

Let's display the properties of the created model.

In [ ]:
println("Number of components: ", length(gm.components))
println("Dimension: ", length(gm.components[1]))
println("Weights: ", gm.prior.p)
println("Types of components: ", typeof(gm.components[1]))
for (i, k) in enumerate(gm.components)
     println("\n Component $i:")
     println("    Average: ", mean(к))
     println("    Covariance: ", cov(к))
end
Количество компонент: 2
Размерность: 2
Веса: [0.5, 0.5]
Типы компонент: FullNormal

  Компонента 1:
    Среднее: [1.0, 2.0]
    Ковариация: [2.0 0.0; 0.0 0.5]

  Компонента 2:
    Среднее: [-3.0, -5.0]
    Ковариация: [1.0 0.0; 0.0 1.0]

Visualization

Let's calculate the distribution density function and visualize its three-dimensional shape.

In [ ]:
gmPDF(x, y) = pdf(gm, [x, y])
gmPDF_vec(x, y) = [pdf(gm, [x[i], y[i]]) for i in 1:length(x)]
xs = range(-10, 10, length=100)
ys = range(-10, 10, length=100)
X = [x for x in xs, y in ys]
Y = [y for x in xs, y in ys]
Z = [gmPDF(x, y) for x in xs, y in ys]
p1 = surface(xs, ys, Z, title="Distribution density", xlabel="X", ylabel="Y", zlabel="", camera=(45, 30), colorbar=false, color=:viridis, grid=:on)
display(p1)

Using the Monte Carlo method, we calculate the distribution function.

In [ ]:
function gmCDF(x, y)
    points = 3*(10^4)
    points = rand(gm, points)
    sum = sum(points[1, :] .≤ x .&& points[2, :] .≤ y)
    return amount / points
end

function gmCDF_vec(xs, ys)
    n = length(xs)
    Z_cdf = zeros(n, n)
    for i in 1:n
        for j in 1:n
            Z_cdf[i, j] = gmCDF(xs[i], ys[j])
        end
    end
    return Z_cdf
end

Z_cdf = gmCDF_vec(xs, ys)

And we visualize her three-dimensional figure.

In [ ]:
p2 = surface(xs, ys, Z_cdf, title="Distribution function", xlabel="X", ylabel="Y", zlabel="", camera=(45, 30), colorbar=false, color=:viridis, legend=false)
display(p2)

Conclusion

The method of explicit assignment of a Gaussian mixture demonstrates the principle of constructing complex probabilistic models from simple components. Direct parameter determination makes it possible to design hypothetical distributions for planning experiments and testing algorithms.

Application areas:

  • Synthesis of test data for evaluation of clustering and classification algorithms

  • Setting a priori distributions in Bayesian analysis

  • Modeling of multimodal distributions in finance and signal processing

  • Initialization of an EM algorithm for training models on real data

This approach forms the basis for the transition to practical training of the model on data and optimization of the number of components using information criteria.