Engee documentation
Notebook

Dimensionality reduction in machine learning

Introduction

Dimensionality reduction is a set of methods aimed at transforming a multidimensional feature space into a space of significantly smaller dimension while maintaining a meaningful structure of the source data. This approach is widely used in visualization tasks of multidimensional arrays of information, allowing to identify groups of similar objects and identify hidden patterns.

The relevance of dimensionality reduction methods is due to several factors. First, they provide the ability to graphically represent multidimensional data in two- or three-dimensional space, which greatly facilitates the interpretation of analysis results. Secondly, these methods help to detect the cluster structure of a set of observations, which may remain unnoticeable when considering the initial features individually.

In this example, we will look at three different algorithms.:

  • PCA (Principal Component Analysis) is a principal component method for linear data projection;

  • t-SNE (t-distributed Stochastic Neighbor Embedding) is an algorithm for nonlinear dimensionality reduction while preserving local relationships;

  • UMAP (Uniform Manifold Approximation and Projection) is a nonlinear projection method based on the theory of manifolds.

Initial data

We will attach the necessary libraries.

In [ ]:
# EngeePkg.purge()
import Pkg
Pkg.add(["UMAP", "Makie", "XLSX", "VegaDatasets", "DataFrames", "MultivariateStats", "RDatasets", "StatsBase", "Statistics", "LinearAlgebra", "ScikitLearn", "MLBase", "Distances", "TSne", "PyCall"])
using UMAP, Makie, XLSX, VegaDatasets, DataFrames, MultivariateStats, RDatasets, StatsBase, Statistics, LinearAlgebra, ScikitLearn, MLBase, Distances, TSne, PyCall

For the analysis, we will use the data set from the VegaDatasets package, which includes the parameters of 406 different car models.

In [ ]:
C = DataFrame(VegaDatasets.dataset("cars"))
Out[0]:
406×9 DataFrame
381 rows omitted
RowNameMiles_per_GallonCylindersDisplacementHorsepowerWeight_in_lbsAccelerationYearOrigin
StringFloat64?Int64Float64Int64?Int64Float64StringString
1chevrolet chevelle malibu18.08307.0130350412.01970-01-01USA
2buick skylark 32015.08350.0165369311.51970-01-01USA
3plymouth satellite18.08318.0150343611.01970-01-01USA
4amc rebel sst16.08304.0150343312.01970-01-01USA
5ford torino17.08302.0140344910.51970-01-01USA
6ford galaxie 50015.08429.0198434110.01970-01-01USA
7chevrolet impala14.08454.022043549.01970-01-01USA
8plymouth fury iii14.08440.021543128.51970-01-01USA
9pontiac catalina14.08455.0225442510.01970-01-01USA
10amc ambassador dpl15.08390.019038508.51970-01-01USA
11citroen ds-21 pallasmissing4133.0115309017.51970-01-01Europe
12chevrolet chevelle concours (sw)missing8350.0165414211.51970-01-01USA
13ford torino (sw)missing8351.0153403411.01970-01-01USA
395buick century limited25.06181.0110294516.41982-01-01USA
396oldsmobile cutlass ciera (diesel)38.06262.085301517.01982-01-01USA
397chrysler lebaron medallion26.04156.092258514.51982-01-01USA
398ford granada l22.06232.0112283514.71982-01-01USA
399toyota celica gt32.04144.096266513.91982-01-01Japan
400dodge charger 2.236.04135.084237013.01982-01-01USA
401chevrolet camaro27.04151.090295017.31982-01-01USA
402ford mustang gl27.04140.086279015.61982-01-01USA
403vw pickup44.0497.052213024.61982-01-01Europe
404dodge rampage32.04135.084229511.61982-01-01USA
405ford ranger28.04120.079262518.61982-01-01USA
406chevy s-1031.04119.082272019.41982-01-01USA

We will delete the rows with the missing data and display the column names of the dataset.

In [ ]:
dropmissing!(C)
M = Matrix(C[:,2:7])
names(C)
Out[0]:
9-element Vector{String}:
 "Name"
 "Miles_per_Gallon"
 "Cylinders"
 "Displacement"
 "Horsepower"
 "Weight_in_lbs"
 "Acceleration"
 "Year"
 "Origin"

Principal Component Analysis (PCA)

At the first stage, we will perform data centering.

In [ ]:
car_origin = C[:,:Origin]
carmap = labelmap(car_origin)
uniqueids = labelencode(carmap,car_origin)
data = M
data = (data .- mean(data,dims = 1))./ std(data,dims=1)
Out[0]:
392×6 Matrix{Float64}:
 -0.697747   1.48205    1.07591    0.663285   0.619748   -1.28362
 -1.08212    1.48205    1.48683    1.57258    0.842258   -1.46485
 -0.697747   1.48205    1.18103    1.18288    0.539692   -1.64609
 -0.953992   1.48205    1.04725    1.18288    0.53616    -1.28362
 -0.82587    1.48205    1.02813    0.923085   0.554997   -1.82732
 -1.08212    1.48205    2.24177    2.42992    1.60515    -2.00855
 -1.21024    1.48205    2.48068    3.00148    1.62045    -2.37102
 -1.21024    1.48205    2.34689    2.87158    1.57101    -2.55226
 -1.21024    1.48205    2.49023    3.13138    1.70404    -2.00855
 -1.08212    1.48205    1.86908    2.22208    1.02709    -2.55226
 -1.08212    1.48205    1.80219    1.70248    0.689209   -2.00855
 -1.21024    1.48205    1.39127    1.44268    0.743365   -2.73349
 -1.08212    1.48205    1.96464    1.18288    0.922314   -2.18979
  ⋮                                                       ⋮
  0.199113   0.309571  -0.128168   0.143685  -0.0383613   0.311242
  1.86471    0.309571   0.645885  -0.505815   0.0440496   0.528722
  0.327236  -0.862911  -0.367073  -0.323955  -0.462189   -0.377448
 -0.185255   0.309571   0.359199   0.195645  -0.167864   -0.304954
  1.09597   -0.862911  -0.481748  -0.220035  -0.368005   -0.594928
  1.60847   -0.862911  -0.567753  -0.531795  -0.715308   -0.92115
  0.455359  -0.862911  -0.414854  -0.375915  -0.0324748   0.637463
  0.455359  -0.862911  -0.519972  -0.479835  -0.220842    0.0212673
  2.63345   -0.862911  -0.930889  -1.36315   -0.997859    3.28348
  1.09597   -0.862911  -0.567753  -0.531795  -0.803605   -1.4286
  0.583482  -0.862911  -0.711097  -0.661694  -0.415097    1.10867
  0.967851  -0.862911  -0.720653  -0.583754  -0.303253    1.39865

The PCA algorithm interprets each column as a separate observation, so we will need to transpose the original matrix.

In [ ]:
data'
Out[0]:
6×392 adjoint(::Matrix{Float64}) with eltype Float64:
 -0.697747  -1.08212   -0.697747  …   1.09597    0.583482   0.967851
  1.48205    1.48205    1.48205      -0.862911  -0.862911  -0.862911
  1.07591    1.48683    1.18103      -0.567753  -0.711097  -0.720653
  0.663285   1.57258    1.18288      -0.531795  -0.661694  -0.583754
  0.619748   0.842258   0.539692     -0.803605  -0.415097  -0.303253
 -1.28362   -1.46485   -1.64609   …  -1.4286     1.10867    1.39865

Let's build a model using the principal component method. Parameter maxoutdim sets the final dimension of the data. For two-dimensional visualization, we set its value to 2.

In [ ]:
p = fit(PCA, data', maxoutdim=2)
Out[0]:
PCA(indim = 6, outdim = 2, principalratio = 0.9194828785333574)

Pattern matrix (unstandardized loadings):
───────────────────────
         PC1        PC2
───────────────────────
1  -0.873037  -0.20899
2   0.942277   0.126601
3   0.97054    0.092613
4   0.94995   -0.141833
5   0.941156   0.244211
6  -0.638795   0.761967
───────────────────────

Importance of components:
─────────────────────────────────────────────
                                PC1       PC2
─────────────────────────────────────────────
SS Loadings (Eigenvalues)  4.78827   0.728631
Variance explained         0.798044  0.121439
Cumulative variance        0.798044  0.919483
Proportion explained       0.867927  0.132073
Cumulative proportion      0.867927  1.0
─────────────────────────────────────────────

The projection matrix can be obtained using the function projection.

In [ ]:
P = projection(p)
Out[0]:
6×2 Matrix{Float64}:
  0.398973  -0.244835
 -0.430615   0.148314
 -0.443531   0.108497
 -0.434122  -0.166158
 -0.430103   0.286095
  0.291926   0.892652

Having a projection matrix we can apply it to an individual car as follows:

In [ ]:
P'*(data[1,:]-mean(p))
Out[0]:
2-element Vector{Float64}:
 -2.323001696522692
 -0.571351964264469

You can also convert the entire data using the function transform.

In [ ]:
Yte = MultivariateStats.transform(p, data')
Out[0]:
2×392 Matrix{Float64}:
 -2.323     -3.20196  -2.66658   -2.60214   …   1.22011  1.70921   1.86951
 -0.571352  -0.68187  -0.992744  -0.621975     -1.87471  0.632857  0.815607

We can also perform an inverse transformation from a two-dimensional space to the original six-dimensional space using the function reconstruct. However, this time the recovery will be close.

In [ ]:
Xr = reconstruct(p, Yte)
Out[0]:
6×392 Matrix{Float64}:
 -0.786928  -1.11055  -0.820834  …   0.945785   0.526984   0.546196
  0.91558    1.27768   1.00103      -0.803445  -0.64215   -0.684075
  0.968334   1.34619   1.075        -0.744559  -0.689425  -0.740696
  1.1034     1.50334   1.32257      -0.218179  -0.847159  -0.947116
  0.835669   1.18209   0.862883     -1.06112   -0.554079  -0.570742
 -1.18816   -1.54341  -1.66462   …  -1.31728    1.06388    1.27381

Let's estimate the error of data recovery after the reverse conversion.

In [ ]:
norm(Xr-data')
Out[0]:
13.743841055569009

We visualize the results using a scatter chart.:

In [ ]:
p1 = Plots.scatter(Yte[1,car_origin.=="USA"],Yte[2,car_origin.=="USA"],color=1,label="USA")
Plots.xlabel!(p1,"The first sign")
Plots.ylabel!(p1,"The second sign")
Plots.scatter!(p1,Yte[1,car_origin.=="Japan"],Yte[2,car_origin.=="Japan"],color=2,label="Japan")
Plots.scatter!(p1,Yte[1,car_origin.=="Europe"],Yte[2,car_origin.=="Europe"],color=3,label="Europe")
display(p1)

Visualization reveals three distinct clusters.

Let's train the PCA model to project data into three-dimensional space, perform the transformation, and visualize the results on a three-dimensional graph.

In [ ]:
p = fit(PCA,data',maxoutdim=3)
Yte = MultivariateStats.transform(p, data')
p2 = scatter3d(Yte[1,:],Yte[2,:],Yte[3,:],color=uniqueids,legend=false)
display(p2)

Nonlinear dimensionality reduction while maintaining local relationships (t-SNE)

Let's apply the t-SNE algorithm to reduce the data dimension to two components and visualize the result on a scatter plot.

In [ ]:
Y2 = tsne(data, 2, 30, 1000, verbose=false, eta=200.0) 
In [ ]:
p3 = Plots.scatter(Y2[:,1], Y2[:,2], 
color=uniqueids, legend=false, size=(400, 300), 
markersize=3, title="t-SNE visualization")
Plots.xlabel!(p3,"The first sign")
Plots.ylabel!(p3,"The second sign")
display(p3)

The same data structure is observed here, although the graph itself looks different due to the peculiarities of the t-SNE algorithm.

Nonlinear projection method based on the theory of manifolds (UMAP)

Let's calculate the correlation matrix of features and apply UMAP to reduce the dimension to two components.

In [ ]:
L = cor(data,data,dims=2)
emb = umap(L, 2)
Out[0]:
2×392 Matrix{Float64}:
 9.05422  8.63462  8.63881  9.05669  8.83635  …  -3.6627   -6.24698  -6.4616
 2.78768  2.53555  3.36606  2.57112  3.25251      5.45083  -1.99136  -1.71899

We visualize the obtained UMAP projections on a dot graph.

In [ ]:
p4 = Plots.scatter(emb[1,:],emb[2,:],color=uniqueids,legend=false)
Plots.xlabel!(p4,"The first sign")
Plots.ylabel!(p4,"The second sign")
display(p4)

The UMAP algorithm allows the use of alternative methods for calculating pairwise distances between objects. Let's calculate the Euclidean distances between all pairs of observations and apply UMAP to the similarity matrix.

In [ ]:
L = pairwise(Euclidean(), data, data,dims=1) 
emb = umap(-L, 2)
Out[0]:
2×392 Matrix{Float64}:
 -6.60819  -8.88751  -7.2234   -7.1497   …   5.48309   2.84922   2.73662
  5.93806   3.48894   5.82943   5.80849     -4.54102  -3.01769  -3.13033

We visualize UMAP projections on a dot graph.

In [ ]:
p5 = Plots.scatter(emb[1,:],emb[2,:],color=uniqueids, legend=false)
Plots.xlabel!(p5,"The first sign")
Plots.ylabel!(p5,"The second sign")
display(p5)

Conclusion

In this example, three methods of dimensionality reduction were considered when applied to a set of data on vehicle specifications. Each of the algorithms made it possible to project the original multidimensional feature space onto a plane for visual analysis of the data structure.
The results obtained demonstrate a high degree of consistency between the different approaches. Three clusters are clearly visible in all visualizations, with American-made cars forming two large clusters, while Japanese and European models form separate mixed groups.

This pattern indicates that the technical characteristics of American cars are more variable and fall into several different types, while Japanese and European models of the period under review are characterized by greater uniformity of parameters.
Thus, dimensionality reduction is an effective data analysis tool, allowing you to identify hidden structures and relationships between objects that are inaccessible by direct examination of the initial features. The consistency of the results obtained by various methods confirms the stability of the identified patterns and the reliability of the conclusions drawn.