Engee documentation
Notebook

Creating dot charts using grouped data

Introduction

This example shows how to create scatter plots using grouped sample data. A scatter plot is a simple graph of how one variable depends on another.

Functionscatter creates dot charts. We will create a function gplotmatrix, which can create a matrix of such graphs showing the relationship between several pairs of variables. Various graphic symbols will be used to indicate membership in the group. This way, grouped versions of these graphs will be created. This is useful for determining whether the values of two variables are the same or the relationship between them in each group.

Initial data

We will import and attach the necessary libraries.

In [ ]:
import Pkg 
Pkg.add(["PlotlyKaleido", "StatsPlots", "DataFrames", "StatsBase", "CSV", "RDatasets", "Statistics", "Random"])
using StatsPlots, DataFrames, StatsBase, CSV, RDatasets, Statistics, Random
plotly()

Suppose we need to study the weight and mileage of cars from three different years of manufacture.

Importing and displaying a data set about cars.

In [ ]:
автомобили = dataset("datasets", "mtcars")
Out[0]:
32×12 DataFrame
7 rows omitted
RowModelMPGCylDispHPDRatWTQSecVSAMGearCarb
String31Float64Int64Float64Int64Float64Float64Float64Int64Int64Int64Int64
1Mazda RX421.06160.01103.92.6216.460144
2Mazda RX4 Wag21.06160.01103.92.87517.020144
3Datsun 71022.84108.0933.852.3218.611141
4Hornet 4 Drive21.46258.01103.083.21519.441031
5Hornet Sportabout18.78360.01753.153.4417.020032
6Valiant18.16225.01052.763.4620.221031
7Duster 36014.38360.02453.213.5715.840034
8Merc 240D24.44146.7623.693.1920.01042
9Merc 23022.84140.8953.923.1522.91042
10Merc 28019.26167.61233.923.4418.31044
11Merc 280C17.86167.61233.923.4418.91044
12Merc 450SE16.48275.81803.074.0717.40033
13Merc 450SL17.38275.81803.073.7317.60033
21Toyota Corona21.54120.1973.72.46520.011031
22Dodge Challenger15.58318.01502.763.5216.870032
23AMC Javelin15.28304.01503.153.43517.30032
24Camaro Z2813.38350.02453.733.8415.410034
25Pontiac Firebird19.28400.01753.083.84517.050032
26Fiat X1-927.3479.0664.081.93518.91141
27Porsche 914-226.04120.3914.432.1416.70152
28Lotus Europa30.4495.11133.771.51316.91152
29Ford Pantera L15.88351.02644.223.1714.50154
30Ferrari Dino19.76145.01753.622.7715.50156
31Maserati Bora15.08301.03353.543.5714.60158
32Volvo 142E21.44121.01094.112.7818.61142

Since there is no information about the release years in the presented dataset, let's assume that these are 1970, 1976, and 1982. We will add this data manually.

In [ ]:
n = nrow(cars)
observations in_year = div(n, 3)
release year = repeat([1970, 1976, 1982], inner=observation_in_year)
if length(release year) < n
    append!(release years, release years[1:(n-length(release years))])
end
shuffle!(release year)
cars[!, :Release Year] = release years_;
cars
Out[0]:
32×13 DataFrame
7 rows omitted
RowModelMPGCylDispHPDRatWTQSecVSAMGearCarbГод_выпуска
String31Float64Int64Float64Int64Float64Float64Float64Int64Int64Int64Int64Int64
1Mazda RX421.06160.01103.92.6216.4601441982
2Mazda RX4 Wag21.06160.01103.92.87517.0201441970
3Datsun 71022.84108.0933.852.3218.6111411970
4Hornet 4 Drive21.46258.01103.083.21519.4410311976
5Hornet Sportabout18.78360.01753.153.4417.0200321970
6Valiant18.16225.01052.763.4620.2210311982
7Duster 36014.38360.02453.213.5715.8400341976
8Merc 240D24.44146.7623.693.1920.010421976
9Merc 23022.84140.8953.923.1522.910421982
10Merc 28019.26167.61233.923.4418.310441970
11Merc 280C17.86167.61233.923.4418.910441982
12Merc 450SE16.48275.81803.074.0717.400331970
13Merc 450SL17.38275.81803.073.7317.600331976
21Toyota Corona21.54120.1973.72.46520.0110311970
22Dodge Challenger15.58318.01502.763.5216.8700321976
23AMC Javelin15.28304.01503.153.43517.300321976
24Camaro Z2813.38350.02453.733.8415.4100341970
25Pontiac Firebird19.28400.01753.083.84517.0500321976
26Fiat X1-927.3479.0664.081.93518.911411976
27Porsche 914-226.04120.3914.432.1416.701521970
28Lotus Europa30.4495.11133.771.51316.911521970
29Ford Pantera L15.88351.02644.223.1714.501541982
30Ferrari Dino19.76145.01753.622.7715.501561970
31Maserati Bora15.08301.03353.543.5714.601581982
32Volvo 142E21.44121.01094.112.7818.611421970

Let's build a dot diagram of the dependence of fuel consumption on the mass of cars.

In [ ]:
p1 = scatter(cars.WT, cars.MPG, 
        group=cars.Release Year,
        markershape=[:x :o :square],
        markercolor=[:blue :green :red],
        xlabel="Weight", ylabel="Specific power reserve",
        title="Dependence of fuel consumption on mass",
        legend_title="Year of release",
        legend=:best,
        markersize=7)
display(p1)

Functionscattercreates a dot chart on which each group is represented by a specific symbol.

The data setавтомобилиIt contains other variables describing various characteristics of cars. We can explore several of them in one window by creating a matrix of diagrams.

Let's create a function that generates data to display the chart matrix.

In [ ]:
xvars = [:WT, :Disp, :HP] 
yvars = [:MPG, :QSec]   

gplotmatrix function(df, xvars, yvars, group)
    nx = length(xvars)
    ny = length(yvars)
    plt = plot(layout=(ny, nx), size=(800, 600), dpi=150)
    xlabels = ["Weight", "Engine capacity", "Power (hp)"]
    ylabels = ["Specific power reserve", "Acceleration time (seconds)"]
    
    for i in 1:ny  
        for j in 1:nx  
            subplot_idx = (i-1)*nx + j
            colors = [:blue, :green, :red]
            icons = [:x, :o, :square]
            обозначения = ["1970", "1976", "1982"]
            
            for (k, grp) in enumerate([1970 1976 1982])
                mask = df[!, group] .== grp
                scatter!(df[mask, xvars[j]], df[mask, yvars[i]],
                        subplot=subplot_idx,
                        marker=icons[k],
                        color=colors[k],
                        label=designations[k],
                        markersize=6,
                        alpha=0.7,
                        legend=(i==1 && j==1) ? :best : false)
            end
            
            if i == ny
                xlabel!(plt.subplots[subplot_idx], xlabels[j])
            end
            if j == 1
                ylabel!(plt.subplots[subplot_idx], ylabels[i])
            end
        end
    end
    
    return plt
end
Out[0]:
gplotmatrix (generic function with 1 method)

And we will display a matrix of diagrams.

In [ ]:
matrix_plot = gplotmatrix(cars, xvars, yvars, :Release year)
display(matrix_plot)

The matrix of diagrams shows the dependence of some different parameters on others. For example, based on the upper-left diagram, we can generalize that the lower the weight of the car, the higher the specific power reserve.

Conclusion

The presented method of visualization of grouped data is an important tool for analysis for statistics and machine learning.

The matrix of diagrams allows you to evaluate the uniformity of relationships between variables in different groups. This helps determine whether the model needs to include the effects of the interaction between quantitative and categorical variables, which is crucial for the correct specification of statistical models.

In machine learning, such visualization helps to identify the group structure of data and the imbalance of classes, which is especially important for algorithms that are sensitive to data distribution. It also allows you to choose the optimal level of complexity of the model — from a single global dependency to separate training in subgroups, reducing the risk of overfitting.

Thus, grouped diagrams serve as a bridge between primary data analysis and the construction of formal models, contributing to the creation of more accurate and interpretable solutions.