Dimensionality reduction in machine learning

Introduction

Dimensionality reduction is a set of methods aimed at transforming a multidimensional feature space into a space of significantly smaller dimension while maintaining a meaningful structure of the source data. This approach is widely used in visualization tasks of multidimensional arrays of information, allowing to identify groups of similar objects and identify hidden patterns.

The relevance of dimensionality reduction methods is due to several factors. First, they provide the ability to graphically represent multidimensional data in two- or three-dimensional space, which greatly facilitates the interpretation of analysis results. Secondly, these methods help to detect the cluster structure of a set of observations, which may remain unnoticeable when considering the initial features individually.

In this example, we will look at three different algorithms.:

PCA (Principal Component Analysis) is a principal component method for linear data projection;
t-SNE (t-distributed Stochastic Neighbor Embedding) is an algorithm for nonlinear dimensionality reduction while preserving local relationships;
UMAP (Uniform Manifold Approximation and Projection) is a nonlinear projection method based on the theory of manifolds.

Initial data

We will attach the necessary libraries.

# EngeePkg.purge()
import Pkg
Pkg.add(["UMAP", "Makie", "XLSX", "VegaDatasets", "DataFrames", "MultivariateStats", "RDatasets", "StatsBase", "Statistics", "LinearAlgebra", "ScikitLearn", "MLBase", "Distances", "TSne", "PyCall"])
using UMAP, Makie, XLSX, VegaDatasets, DataFrames, MultivariateStats, RDatasets, StatsBase, Statistics, LinearAlgebra, ScikitLearn, MLBase, Distances, TSne, PyCall

For the analysis, we will use the data set from the VegaDatasets package, which includes the parameters of 406 different car models.

C = DataFrame(VegaDatasets.dataset("cars"))

We will delete the rows with the missing data and display the column names of the dataset.

dropmissing!(C)
M = Matrix(C[:,2:7])
names(C)

9-element Vector{String}:
 "Name"
 "Miles_per_Gallon"
 "Cylinders"
 "Displacement"
 "Horsepower"
 "Weight_in_lbs"
 "Acceleration"
 "Year"
 "Origin"

Principal Component Analysis (PCA)

At the first stage, we will perform data centering.

car_origin = C[:,:Origin]
carmap = labelmap(car_origin)
uniqueids = labelencode(carmap,car_origin)
data = M
data = (data .- mean(data,dims = 1))./ std(data,dims=1)

392×6 Matrix{Float64}:
 -0.697747   1.48205    1.07591    0.663285   0.619748   -1.28362
 -1.08212    1.48205    1.48683    1.57258    0.842258   -1.46485
 -0.697747   1.48205    1.18103    1.18288    0.539692   -1.64609
 -0.953992   1.48205    1.04725    1.18288    0.53616    -1.28362
 -0.82587    1.48205    1.02813    0.923085   0.554997   -1.82732
 -1.08212    1.48205    2.24177    2.42992    1.60515    -2.00855
 -1.21024    1.48205    2.48068    3.00148    1.62045    -2.37102
 -1.21024    1.48205    2.34689    2.87158    1.57101    -2.55226
 -1.21024    1.48205    2.49023    3.13138    1.70404    -2.00855
 -1.08212    1.48205    1.86908    2.22208    1.02709    -2.55226
 -1.08212    1.48205    1.80219    1.70248    0.689209   -2.00855
 -1.21024    1.48205    1.39127    1.44268    0.743365   -2.73349
 -1.08212    1.48205    1.96464    1.18288    0.922314   -2.18979
  ⋮                                                       ⋮
  0.199113   0.309571  -0.128168   0.143685  -0.0383613   0.311242
  1.86471    0.309571   0.645885  -0.505815   0.0440496   0.528722
  0.327236  -0.862911  -0.367073  -0.323955  -0.462189   -0.377448
 -0.185255   0.309571   0.359199   0.195645  -0.167864   -0.304954
  1.09597   -0.862911  -0.481748  -0.220035  -0.368005   -0.594928
  1.60847   -0.862911  -0.567753  -0.531795  -0.715308   -0.92115
  0.455359  -0.862911  -0.414854  -0.375915  -0.0324748   0.637463
  0.455359  -0.862911  -0.519972  -0.479835  -0.220842    0.0212673
  2.63345   -0.862911  -0.930889  -1.36315   -0.997859    3.28348
  1.09597   -0.862911  -0.567753  -0.531795  -0.803605   -1.4286
  0.583482  -0.862911  -0.711097  -0.661694  -0.415097    1.10867
  0.967851  -0.862911  -0.720653  -0.583754  -0.303253    1.39865

The PCA algorithm interprets each column as a separate observation, so we will need to transpose the original matrix.

data'

6×392 adjoint(::Matrix{Float64}) with eltype Float64:
 -0.697747  -1.08212   -0.697747  …   1.09597    0.583482   0.967851
  1.48205    1.48205    1.48205      -0.862911  -0.862911  -0.862911
  1.07591    1.48683    1.18103      -0.567753  -0.711097  -0.720653
  0.663285   1.57258    1.18288      -0.531795  -0.661694  -0.583754
  0.619748   0.842258   0.539692     -0.803605  -0.415097  -0.303253
 -1.28362   -1.46485   -1.64609   …  -1.4286     1.10867    1.39865

Let's build a model using the principal component method. Parameter maxoutdim sets the final dimension of the data. For two-dimensional visualization, we set its value to 2.

p = fit(PCA, data', maxoutdim=2)

PCA(indim = 6, outdim = 2, principalratio = 0.9194828785333574)

Pattern matrix (unstandardized loadings):
───────────────────────
         PC1        PC2
───────────────────────
1  -0.873037  -0.20899
2   0.942277   0.126601
3   0.97054    0.092613
4   0.94995   -0.141833
5   0.941156   0.244211
6  -0.638795   0.761967
───────────────────────

Importance of components:
─────────────────────────────────────────────
                                PC1       PC2
─────────────────────────────────────────────
SS Loadings (Eigenvalues)  4.78827   0.728631
Variance explained         0.798044  0.121439
Cumulative variance        0.798044  0.919483
Proportion explained       0.867927  0.132073
Cumulative proportion      0.867927  1.0
─────────────────────────────────────────────

The projection matrix can be obtained using the function projection.

P = projection(p)

6×2 Matrix{Float64}:
  0.398973  -0.244835
 -0.430615   0.148314
 -0.443531   0.108497
 -0.434122  -0.166158
 -0.430103   0.286095
  0.291926   0.892652

Having a projection matrix we can apply it to an individual car as follows:

P'*(data[1,:]-mean(p))

2-element Vector{Float64}:
 -2.323001696522692
 -0.571351964264469

You can also convert the entire data using the function transform.

Yte = MultivariateStats.transform(p, data')

2×392 Matrix{Float64}:
 -2.323     -3.20196  -2.66658   -2.60214   …   1.22011  1.70921   1.86951
 -0.571352  -0.68187  -0.992744  -0.621975     -1.87471  0.632857  0.815607

We can also perform an inverse transformation from a two-dimensional space to the original six-dimensional space using the function reconstruct. However, this time the recovery will be close.

Xr = reconstruct(p, Yte)

6×392 Matrix{Float64}:
 -0.786928  -1.11055  -0.820834  …   0.945785   0.526984   0.546196
  0.91558    1.27768   1.00103      -0.803445  -0.64215   -0.684075
  0.968334   1.34619   1.075        -0.744559  -0.689425  -0.740696
  1.1034     1.50334   1.32257      -0.218179  -0.847159  -0.947116
  0.835669   1.18209   0.862883     -1.06112   -0.554079  -0.570742
 -1.18816   -1.54341  -1.66462   …  -1.31728    1.06388    1.27381

Let's estimate the error of data recovery after the reverse conversion.

norm(Xr-data')

13.743841055569009

We visualize the results using a dot diagram.:

p1 = Plots.scatter(Yte[1,car_origin.=="USA"],Yte[2,car_origin.=="USA"],color=1,label="USA")
Plots.xlabel!(p1,"The first sign")
Plots.ylabel!(p1,"The second sign")
Plots.scatter!(p1,Yte[1,car_origin.=="Japan"],Yte[2,car_origin.=="Japan"],color=2,label="Japan")
Plots.scatter!(p1,Yte[1,car_origin.=="Europe"],Yte[2,car_origin.=="Europe"],color=3,label="Europe")
display(p1)

Visualization reveals three distinct clusters.

Let's train the PCA model to project data into three-dimensional space, perform the transformation, and visualize the results on a three-dimensional graph.

p = fit(PCA,data',maxoutdim=3)
Yte = MultivariateStats.transform(p, data')
p2 = scatter3d(Yte[1,:],Yte[2,:],Yte[3,:],color=uniqueids,legend=false)
display(p2)

Nonlinear dimensionality reduction while maintaining local relationships (t-SNE)

Let's apply the t-SNE algorithm to reduce the data dimension to two components and visualize the result on a scatter plot.

Y2 = tsne(data, 2, 30, 1000, verbose=false, eta=200.0)

p3 = Plots.scatter(Y2[:,1], Y2[:,2], 
color=uniqueids, legend=false, size=(400, 300), 
markersize=3, title="t-SNE visualization")
Plots.xlabel!(p3,"The first sign")
Plots.ylabel!(p3,"The second sign")
display(p3)

The same data structure is observed here, although the graph itself looks different due to the peculiarities of the t-SNE algorithm.

Nonlinear projection method based on the theory of manifolds (UMAP)

Let's calculate the correlation matrix of features and apply UMAP to reduce the dimension to two components.

L = cor(data,data,dims=2)
emb = umap(L, 2)

2×392 Matrix{Float64}:
 9.05422  8.63462  8.63881  9.05669  8.83635  …  -3.6627   -6.24698  -6.4616
 2.78768  2.53555  3.36606  2.57112  3.25251      5.45083  -1.99136  -1.71899

We visualize the obtained UMAP projections on a dot graph.

p4 = Plots.scatter(emb[1,:],emb[2,:],color=uniqueids,legend=false)
Plots.xlabel!(p4,"The first sign")
Plots.ylabel!(p4,"The second sign")
display(p4)

The UMAP algorithm allows the use of alternative methods for calculating pairwise distances between objects. Let's calculate the Euclidean distances between all pairs of observations and apply UMAP to the similarity matrix.

L = pairwise(Euclidean(), data, data,dims=1) 
emb = umap(-L, 2)

2×392 Matrix{Float64}:
 -6.60819  -8.88751  -7.2234   -7.1497   …   5.48309   2.84922   2.73662
  5.93806   3.48894   5.82943   5.80849     -4.54102  -3.01769  -3.13033

We visualize UMAP projections on a dot graph.

p5 = Plots.scatter(emb[1,:],emb[2,:],color=uniqueids, legend=false)
Plots.xlabel!(p5,"The first sign")
Plots.ylabel!(p5,"The second sign")
display(p5)

Conclusion

In this example, three methods of dimensionality reduction were considered when applied to a set of data on vehicle specifications. Each of the algorithms made it possible to project the original multidimensional feature space onto a plane for visual analysis of the data structure.
The results obtained demonstrate a high degree of consistency between the different approaches. Three clusters are clearly visible in all visualizations, with American-made cars forming two large clusters, while Japanese and European models form separate mixed groups.

This pattern indicates that the technical characteristics of American cars are more variable and fall into several different types, while Japanese and European models of the period under review are characterized by greater uniformity of parameters.
Thus, dimensionality reduction is an effective data analysis tool, allowing you to identify hidden structures and relationships between objects that are inaccessible when directly examining the initial features. The consistency of the results obtained by various methods confirms the stability of the identified patterns and the reliability of the conclusions drawn.

Row	Name	Miles_per_Gallon	Cylinders	Displacement	Horsepower	Weight_in_lbs	Acceleration	Year	Origin
	String	Float64?	Int64	Float64	Int64?	Int64	Float64	String	String
1	chevrolet chevelle malibu	18.0	8	307.0	130	3504	12.0	1970-01-01	USA
2	buick skylark 320	15.0	8	350.0	165	3693	11.5	1970-01-01	USA
3	plymouth satellite	18.0	8	318.0	150	3436	11.0	1970-01-01	USA
4	amc rebel sst	16.0	8	304.0	150	3433	12.0	1970-01-01	USA
5	ford torino	17.0	8	302.0	140	3449	10.5	1970-01-01	USA
6	ford galaxie 500	15.0	8	429.0	198	4341	10.0	1970-01-01	USA
7	chevrolet impala	14.0	8	454.0	220	4354	9.0	1970-01-01	USA
8	plymouth fury iii	14.0	8	440.0	215	4312	8.5	1970-01-01	USA
9	pontiac catalina	14.0	8	455.0	225	4425	10.0	1970-01-01	USA
10	amc ambassador dpl	15.0	8	390.0	190	3850	8.5	1970-01-01	USA
11	citroen ds-21 pallas	missing	4	133.0	115	3090	17.5	1970-01-01	Europe
12	chevrolet chevelle concours (sw)	missing	8	350.0	165	4142	11.5	1970-01-01	USA
13	ford torino (sw)	missing	8	351.0	153	4034	11.0	1970-01-01	USA
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
395	buick century limited	25.0	6	181.0	110	2945	16.4	1982-01-01	USA
396	oldsmobile cutlass ciera (diesel)	38.0	6	262.0	85	3015	17.0	1982-01-01	USA
397	chrysler lebaron medallion	26.0	4	156.0	92	2585	14.5	1982-01-01	USA
398	ford granada l	22.0	6	232.0	112	2835	14.7	1982-01-01	USA
399	toyota celica gt	32.0	4	144.0	96	2665	13.9	1982-01-01	Japan
400	dodge charger 2.2	36.0	4	135.0	84	2370	13.0	1982-01-01	USA
401	chevrolet camaro	27.0	4	151.0	90	2950	17.3	1982-01-01	USA
402	ford mustang gl	27.0	4	140.0	86	2790	15.6	1982-01-01	USA
403	vw pickup	44.0	4	97.0	52	2130	24.6	1982-01-01	Europe
404	dodge rampage	32.0	4	135.0	84	2295	11.6	1982-01-01	USA
405	ford ranger	28.0	4	120.0	79	2625	18.6	1982-01-01	USA
406	chevy s-10	31.0	4	119.0	82	2720	19.4	1982-01-01	USA