Creating dot charts using grouped data
Introduction
This example shows how to create scatter plots using grouped sample data. A scatter plot is a simple graph of how one variable depends on another.
Functionscatter creates dot charts. We will create a function gplotmatrix, which can create a matrix of such graphs showing the relationship between several pairs of variables. Various graphic symbols will be used to indicate membership in the group. This way, grouped versions of these graphs will be created. This is useful for determining whether the values of two variables are the same or the relationship between them in each group.
Initial data
We will import and attach the necessary libraries.
import Pkg
Pkg.add(["PlotlyKaleido", "StatsPlots", "DataFrames", "StatsBase", "CSV", "RDatasets", "Statistics", "Random"])
using StatsPlots, DataFrames, StatsBase, CSV, RDatasets, Statistics, Random
plotly()
Suppose we need to study the weight and mileage of cars from three different years of manufacture.
Importing and displaying a data set about cars.
автомобили = dataset("datasets", "mtcars")
Since there is no information about the release years in the presented dataset, let's assume that these are 1970, 1976, and 1982. We will add this data manually.
n = nrow(cars)
observations in_year = div(n, 3)
release year = repeat([1970, 1976, 1982], inner=observation_in_year)
if length(release year) < n
append!(release years, release years[1:(n-length(release years))])
end
shuffle!(release year)
cars[!, :Release Year] = release years_;
cars
Let's build a dot diagram of the dependence of fuel consumption on the mass of cars.
p1 = scatter(cars.WT, cars.MPG,
group=cars.Release Year,
markershape=[:x :o :square],
markercolor=[:blue :green :red],
xlabel="Weight", ylabel="Specific power reserve",
title="Dependence of fuel consumption on mass",
legend_title="Year of release",
legend=:best,
markersize=7)
display(p1)
Functionscattercreates a dot chart on which each group is represented by a specific symbol.
The data setавтомобилиIt contains other variables describing various characteristics of cars. We can explore several of them in one window by creating a matrix of diagrams.
Let's create a function that generates data to display the chart matrix.
xvars = [:WT, :Disp, :HP]
yvars = [:MPG, :QSec]
gplotmatrix function(df, xvars, yvars, group)
nx = length(xvars)
ny = length(yvars)
plt = plot(layout=(ny, nx), size=(800, 600), dpi=150)
xlabels = ["Weight", "Engine capacity", "Power (hp)"]
ylabels = ["Specific power reserve", "Acceleration time (seconds)"]
for i in 1:ny
for j in 1:nx
subplot_idx = (i-1)*nx + j
colors = [:blue, :green, :red]
icons = [:x, :o, :square]
обозначения = ["1970", "1976", "1982"]
for (k, grp) in enumerate([1970 1976 1982])
mask = df[!, group] .== grp
scatter!(df[mask, xvars[j]], df[mask, yvars[i]],
subplot=subplot_idx,
marker=icons[k],
color=colors[k],
label=designations[k],
markersize=6,
alpha=0.7,
legend=(i==1 && j==1) ? :best : false)
end
if i == ny
xlabel!(plt.subplots[subplot_idx], xlabels[j])
end
if j == 1
ylabel!(plt.subplots[subplot_idx], ylabels[i])
end
end
end
return plt
end
And we will display a matrix of diagrams.
matrix_plot = gplotmatrix(cars, xvars, yvars, :Release year)
display(matrix_plot)
The matrix of diagrams shows the dependence of some different parameters on others. For example, based on the upper-left diagram, we can generalize that the lower the weight of the car, the higher the specific power reserve.
Conclusion
The presented method of visualization of grouped data is an important tool for analysis for statistics and machine learning.
The matrix of diagrams allows you to evaluate the uniformity of relationships between variables in different groups. This helps determine whether the model needs to include the effects of the interaction between quantitative and categorical variables, which is crucial for the correct specification of statistical models.
In machine learning, such visualization helps to identify the group structure of data and the imbalance of classes, which is especially important for algorithms that are sensitive to data distribution. It also allows you to choose the optimal level of complexity of the model — from a single global dependency to separate training in subgroups, reducing the risk of overfitting.
Thus, grouped diagrams serve as a bridge between primary data analysis and the construction of formal models, contributing to the creation of more accurate and interpretable solutions.