Dimensionality reduction in machine learning
Introduction
Dimensionality reduction is a set of methods aimed at transforming a multidimensional feature space into a space of significantly smaller dimension while maintaining a meaningful structure of the source data. This approach is widely used in visualization tasks of multidimensional arrays of information, allowing to identify groups of similar objects and identify hidden patterns.
The relevance of dimensionality reduction methods is due to several factors. First, they provide the ability to graphically represent multidimensional data in two- or three-dimensional space, which greatly facilitates the interpretation of analysis results. Secondly, these methods help to detect the cluster structure of a set of observations, which may remain unnoticeable when considering the initial features individually.
In this example, we will look at three different algorithms.:
-
PCA (Principal Component Analysis) is a principal component method for linear data projection;
-
t-SNE (t-distributed Stochastic Neighbor Embedding) is an algorithm for nonlinear dimensionality reduction while preserving local relationships;
-
UMAP (Uniform Manifold Approximation and Projection) is a nonlinear projection method based on the theory of manifolds.
Initial data
We will attach the necessary libraries.
# EngeePkg.purge()
import Pkg
Pkg.add(["UMAP", "Makie", "XLSX", "VegaDatasets", "DataFrames", "MultivariateStats", "RDatasets", "StatsBase", "Statistics", "LinearAlgebra", "ScikitLearn", "MLBase", "Distances", "TSne", "PyCall"])
using UMAP, Makie, XLSX, VegaDatasets, DataFrames, MultivariateStats, RDatasets, StatsBase, Statistics, LinearAlgebra, ScikitLearn, MLBase, Distances, TSne, PyCall
For the analysis, we will use the data set from the VegaDatasets package, which includes the parameters of 406 different car models.
C = DataFrame(VegaDatasets.dataset("cars"))
We will delete the rows with the missing data and display the column names of the dataset.
dropmissing!(C)
M = Matrix(C[:,2:7])
names(C)
Principal Component Analysis (PCA)
At the first stage, we will perform data centering.
car_origin = C[:,:Origin]
carmap = labelmap(car_origin)
uniqueids = labelencode(carmap,car_origin)
data = M
data = (data .- mean(data,dims = 1))./ std(data,dims=1)
The PCA algorithm interprets each column as a separate observation, so we will need to transpose the original matrix.
data'
Let's build a model using the principal component method. Parameter maxoutdim sets the final dimension of the data. For two-dimensional visualization, we set its value to 2.
p = fit(PCA, data', maxoutdim=2)
The projection matrix can be obtained using the function projection.
P = projection(p)
Having a projection matrix we can apply it to an individual car as follows:
P'*(data[1,:]-mean(p))
You can also convert the entire data using the function transform.
Yte = MultivariateStats.transform(p, data')
We can also perform an inverse transformation from a two-dimensional space to the original six-dimensional space using the function reconstruct. However, this time the recovery will be close.
Xr = reconstruct(p, Yte)
Let's estimate the error of data recovery after the reverse conversion.
norm(Xr-data')
We visualize the results using a scatter chart.:
p1 = Plots.scatter(Yte[1,car_origin.=="USA"],Yte[2,car_origin.=="USA"],color=1,label="USA")
Plots.xlabel!(p1,"The first sign")
Plots.ylabel!(p1,"The second sign")
Plots.scatter!(p1,Yte[1,car_origin.=="Japan"],Yte[2,car_origin.=="Japan"],color=2,label="Japan")
Plots.scatter!(p1,Yte[1,car_origin.=="Europe"],Yte[2,car_origin.=="Europe"],color=3,label="Europe")
display(p1)
Visualization reveals three distinct clusters.
Let's train the PCA model to project data into three-dimensional space, perform the transformation, and visualize the results on a three-dimensional graph.
p = fit(PCA,data',maxoutdim=3)
Yte = MultivariateStats.transform(p, data')
p2 = scatter3d(Yte[1,:],Yte[2,:],Yte[3,:],color=uniqueids,legend=false)
display(p2)
Nonlinear dimensionality reduction while maintaining local relationships (t-SNE)
Let's apply the t-SNE algorithm to reduce the data dimension to two components and visualize the result on a scatter plot.
Y2 = tsne(data, 2, 30, 1000, verbose=false, eta=200.0)
p3 = Plots.scatter(Y2[:,1], Y2[:,2],
color=uniqueids, legend=false, size=(400, 300),
markersize=3, title="t-SNE visualization")
Plots.xlabel!(p3,"The first sign")
Plots.ylabel!(p3,"The second sign")
display(p3)
The same data structure is observed here, although the graph itself looks different due to the peculiarities of the t-SNE algorithm.
Nonlinear projection method based on the theory of manifolds (UMAP)
Let's calculate the correlation matrix of features and apply UMAP to reduce the dimension to two components.
L = cor(data,data,dims=2)
emb = umap(L, 2)
We visualize the obtained UMAP projections on a dot graph.
p4 = Plots.scatter(emb[1,:],emb[2,:],color=uniqueids,legend=false)
Plots.xlabel!(p4,"The first sign")
Plots.ylabel!(p4,"The second sign")
display(p4)
The UMAP algorithm allows the use of alternative methods for calculating pairwise distances between objects. Let's calculate the Euclidean distances between all pairs of observations and apply UMAP to the similarity matrix.
L = pairwise(Euclidean(), data, data,dims=1)
emb = umap(-L, 2)
We visualize UMAP projections on a dot graph.
p5 = Plots.scatter(emb[1,:],emb[2,:],color=uniqueids, legend=false)
Plots.xlabel!(p5,"The first sign")
Plots.ylabel!(p5,"The second sign")
display(p5)
Conclusion
In this example, three methods of dimensionality reduction were considered when applied to a set of data on vehicle specifications. Each of the algorithms made it possible to project the original multidimensional feature space onto a plane for visual analysis of the data structure.
The results obtained demonstrate a high degree of consistency between the different approaches. Three clusters are clearly visible in all visualizations, with American-made cars forming two large clusters, while Japanese and European models form separate mixed groups.
This pattern indicates that the technical characteristics of American cars are more variable and fall into several different types, while Japanese and European models of the period under review are characterized by greater uniformity of parameters.
Thus, dimensionality reduction is an effective data analysis tool, allowing you to identify hidden structures and relationships between objects that are inaccessible by direct examination of the initial features. The consistency of the results obtained by various methods confirms the stability of the identified patterns and the reliability of the conclusions drawn.