Creating dot charts using grouped data

Introduction

This example shows how to create scatter plots using grouped sample data. A scatter plot is a simple graph of how one variable depends on another.

Functionscatter creates dot charts. We will create a function gplotmatrix, which can create a matrix of such graphs showing the relationship between several pairs of variables. Various graphic symbols will be used to indicate membership in the group. This way, grouped versions of these graphs will be created. This is useful for determining whether the values of two variables are the same or the relationship between them in each group.

Initial data

We will import and attach the necessary libraries.

import Pkg 
Pkg.add(["PlotlyKaleido", "StatsPlots", "DataFrames", "StatsBase", "CSV", "RDatasets", "Statistics", "Random"])
using StatsPlots, DataFrames, StatsBase, CSV, RDatasets, Statistics, Random
plotly()

Suppose we need to study the weight and mileage of cars from three different years of manufacture.

Importing and displaying a data set about cars.

автомобили = dataset("datasets", "mtcars")

Since there is no information about the release years in the presented dataset, let's assume that these are 1970, 1976, and 1982. We will add this data manually.

n = nrow(cars)
observations in_year = div(n, 3)
release year = repeat([1970, 1976, 1982], inner=observation_in_year)
if length(release year) < n
    append!(release years, release years[1:(n-length(release years))])
end
shuffle!(release year)
cars[!, :Release Year] = release years_;
cars

Let's build a dot diagram of the dependence of fuel consumption on the mass of cars.

p1 = scatter(cars.WT, cars.MPG, 
        group=cars.Release Year,
        markershape=[:x :o :square],
        markercolor=[:blue :green :red],
        xlabel="Weight", ylabel="Specific power reserve",
        title="Dependence of fuel consumption on mass",
        legend_title="Year of release",
        legend=:best,
        markersize=7)
display(p1)

Functionscattercreates a dot chart on which each group is represented by a specific symbol.

The data setавтомобилиIt contains other variables describing various characteristics of cars. We can explore several of them in one window by creating a matrix of diagrams.

Let's create a function that generates data to display the chart matrix.

xvars = [:WT, :Disp, :HP] 
yvars = [:MPG, :QSec]   

gplotmatrix function(df, xvars, yvars, group)
    nx = length(xvars)
    ny = length(yvars)
    plt = plot(layout=(ny, nx), size=(800, 600), dpi=150)
    xlabels = ["Weight", "Engine capacity", "Power (hp)"]
    ylabels = ["Specific power reserve", "Acceleration time (seconds)"]
    
    for i in 1:ny  
        for j in 1:nx  
            subplot_idx = (i-1)*nx + j
            colors = [:blue, :green, :red]
            icons = [:x, :o, :square]
            обозначения = ["1970", "1976", "1982"]
            
            for (k, grp) in enumerate([1970 1976 1982])
                mask = df[!, group] .== grp
                scatter!(df[mask, xvars[j]], df[mask, yvars[i]],
                        subplot=subplot_idx,
                        marker=icons[k],
                        color=colors[k],
                        label=designations[k],
                        markersize=6,
                        alpha=0.7,
                        legend=(i==1 && j==1) ? :best : false)
            end
            
            if i == ny
                xlabel!(plt.subplots[subplot_idx], xlabels[j])
            end
            if j == 1
                ylabel!(plt.subplots[subplot_idx], ylabels[i])
            end
        end
    end
    
    return plt
end

gplotmatrix (generic function with 1 method)

And we will display a matrix of diagrams.

matrix_plot = gplotmatrix(cars, xvars, yvars, :Release year)
display(matrix_plot)

The matrix of diagrams shows the dependence of some different parameters on others. For example, based on the upper-left diagram, we can generalize that the lower the mass of the car, the higher the specific power reserve.

Conclusion

The presented method of visualization of grouped data is an important tool for analysis for statistics and machine learning.

The matrix of diagrams allows you to evaluate the uniformity of relationships between variables in different groups. This helps determine whether the model needs to include the effects of the interaction between quantitative and categorical variables, which is crucial for the correct specification of statistical models.

In machine learning, such visualization helps to identify the group structure of data and the imbalance of classes, which is especially important for algorithms that are sensitive to data distribution. It also allows you to choose the optimal level of complexity of the model — from a single global dependency to separate training in subgroups, reducing the risk of overfitting.

Thus, grouped diagrams serve as a bridge between primary data analysis and the construction of formal models, contributing to the creation of more accurate and interpretable solutions.

Row	Model	MPG	Cyl	Disp	HP	DRat	WT	QSec	VS	AM	Gear	Carb
	String31	Float64	Int64	Float64	Int64	Float64	Float64	Float64	Int64	Int64	Int64	Int64
1	Mazda RX4	21.0	6	160.0	110	3.9	2.62	16.46	0	1	4	4
2	Mazda RX4 Wag	21.0	6	160.0	110	3.9	2.875	17.02	0	1	4	4
3	Datsun 710	22.8	4	108.0	93	3.85	2.32	18.61	1	1	4	1
4	Hornet 4 Drive	21.4	6	258.0	110	3.08	3.215	19.44	1	0	3	1
5	Hornet Sportabout	18.7	8	360.0	175	3.15	3.44	17.02	0	0	3	2
6	Valiant	18.1	6	225.0	105	2.76	3.46	20.22	1	0	3	1
7	Duster 360	14.3	8	360.0	245	3.21	3.57	15.84	0	0	3	4
8	Merc 240D	24.4	4	146.7	62	3.69	3.19	20.0	1	0	4	2
9	Merc 230	22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
10	Merc 280	19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
11	Merc 280C	17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
12	Merc 450SE	16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
13	Merc 450SL	17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
21	Toyota Corona	21.5	4	120.1	97	3.7	2.465	20.01	1	0	3	1
22	Dodge Challenger	15.5	8	318.0	150	2.76	3.52	16.87	0	0	3	2
23	AMC Javelin	15.2	8	304.0	150	3.15	3.435	17.3	0	0	3	2
24	Camaro Z28	13.3	8	350.0	245	3.73	3.84	15.41	0	0	3	4
25	Pontiac Firebird	19.2	8	400.0	175	3.08	3.845	17.05	0	0	3	2
26	Fiat X1-9	27.3	4	79.0	66	4.08	1.935	18.9	1	1	4	1
27	Porsche 914-2	26.0	4	120.3	91	4.43	2.14	16.7	0	1	5	2
28	Lotus Europa	30.4	4	95.1	113	3.77	1.513	16.9	1	1	5	2
29	Ford Pantera L	15.8	8	351.0	264	4.22	3.17	14.5	0	1	5	4
30	Ferrari Dino	19.7	6	145.0	175	3.62	2.77	15.5	0	1	5	6
31	Maserati Bora	15.0	8	301.0	335	3.54	3.57	14.6	0	1	5	8
32	Volvo 142E	21.4	4	121.0	109	4.11	2.78	18.6	1	1	4	2

Row	Model	MPG	Cyl	Disp	HP	DRat	WT	QSec	VS	AM	Gear	Carb	Год_выпуска
	String31	Float64	Int64	Float64	Int64	Float64	Float64	Float64	Int64	Int64	Int64	Int64	Int64
1	Mazda RX4	21.0	6	160.0	110	3.9	2.62	16.46	0	1	4	4	1982
2	Mazda RX4 Wag	21.0	6	160.0	110	3.9	2.875	17.02	0	1	4	4	1970
3	Datsun 710	22.8	4	108.0	93	3.85	2.32	18.61	1	1	4	1	1970
4	Hornet 4 Drive	21.4	6	258.0	110	3.08	3.215	19.44	1	0	3	1	1976
5	Hornet Sportabout	18.7	8	360.0	175	3.15	3.44	17.02	0	0	3	2	1970
6	Valiant	18.1	6	225.0	105	2.76	3.46	20.22	1	0	3	1	1982
7	Duster 360	14.3	8	360.0	245	3.21	3.57	15.84	0	0	3	4	1976
8	Merc 240D	24.4	4	146.7	62	3.69	3.19	20.0	1	0	4	2	1976
9	Merc 230	22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	1982
10	Merc 280	19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	1970
11	Merc 280C	17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	1982
12	Merc 450SE	16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	1970
13	Merc 450SL	17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	1976
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
21	Toyota Corona	21.5	4	120.1	97	3.7	2.465	20.01	1	0	3	1	1970
22	Dodge Challenger	15.5	8	318.0	150	2.76	3.52	16.87	0	0	3	2	1976
23	AMC Javelin	15.2	8	304.0	150	3.15	3.435	17.3	0	0	3	2	1976
24	Camaro Z28	13.3	8	350.0	245	3.73	3.84	15.41	0	0	3	4	1970
25	Pontiac Firebird	19.2	8	400.0	175	3.08	3.845	17.05	0	0	3	2	1976
26	Fiat X1-9	27.3	4	79.0	66	4.08	1.935	18.9	1	1	4	1	1976
27	Porsche 914-2	26.0	4	120.3	91	4.43	2.14	16.7	0	1	5	2	1970
28	Lotus Europa	30.4	4	95.1	113	3.77	1.513	16.9	1	1	5	2	1970
29	Ford Pantera L	15.8	8	351.0	264	4.22	3.17	14.5	0	1	5	4	1982
30	Ferrari Dino	19.7	6	145.0	175	3.62	2.77	15.5	0	1	5	6	1970
31	Maserati Bora	15.0	8	301.0	335	3.54	3.57	14.6	0	1	5	8	1982
32	Volvo 142E	21.4	4	121.0	109	4.11	2.78	18.6	1	1	4	2	1970