Engee documentation
Notebook

Multiple linear regression: predicting car fuel consumption

This example demonstrates the construction of a multiple linear regression model for predicting car fuel consumption based on their power and mass using the least squares method. The source data is taken from the mtcars dataset.

The basic formula of multiple linear regression:

where:

fuel consumption (l/100km),

normalized engine power,

normalized vehicle weight,

free member (intercept),

power factor,

the coefficient of mass,

an accidental mistake.

The mtcars dataset is a classic data set on the characteristics of 32 1973-74-year-old cars, collected from Motor Trend magazine. The dataset contains information about fuel consumption, engine power, weight, number of cylinders, and other vehicle specifications.

This dataset is widely used in academic courses on statistics, econometrics, and machine learning as a standard example for demonstrating regression analysis.

Installing and running the necessary libraries:

In [ ]:
Pkg.add(["RDatasets", "Statistics", "DataFrames"])
   Resolving package versions...
  No Changes to `~/.project/Project.toml`
  No Changes to `~/.project/Manifest.toml`
In [ ]:
using RDatasets        # Для загрузки данных
using Plots            # Визуализация
using Statistics       # Статистические функции
using DataFrames       # Работа с таблицами

Loading and converting data:

In [ ]:
cars = dataset("datasets", "mtcars")  # Загружаем встроенный набор данных

# Выбираем предикторы и целевую переменную
cars[:, "weight_tons"] = cars[:, "WT"] .* 0.453592  # Вес: из тысяч фунтов в тонны
X_raw = cars[:, ["HP", "weight_tons"]]   # Мощность (hp) и вес (тонны)
cars[:, "fuel_l100km"] = 235.22 ./ cars[:, "MPG"]  # Расход: из миль/галлон в л/100км
y = cars[:, "fuel_l100km"]               # Расход топлива (л/100км)

# Преобразуем названия колонок для ясности
rename!(X_raw, [:HP, :weight_tons])
Out[0]:
32×2 DataFrame
7 rows omitted
RowHPweight_tons
Int64Float64
11101.18841
21101.30408
3931.05233
41101.4583
51751.56036
61051.56943
72451.61932
8621.44696
9951.42881
101231.56036
111231.56036
121801.84612
131801.6919
21971.1181
221501.59664
231501.55809
242451.74179
251751.74406
26660.877701
27910.970687
281130.686285
292641.43789
301751.25645
313351.61932
321091.26099

Download the classic mtcars dataset and convert American units of measurement to metric units for a better understanding. Weight is converted from thousands of pounds to tons (multiplied by 0.453592), and fuel consumption is converted from miles per gallon to liters per 100 km. We select power and weight as predictors for predicting fuel consumption.

Normalization and preparation of matrices:

In [ ]:
normalize(x) = (x .- mean(x)) ./ std(x)
X = hcat(ones(nrow(X_raw)), normalize(X_raw.HP), normalize(X_raw.weight_tons))

y = float(y)  # Преобразуем в вещественные числа
Out[0]:
32-element Vector{Float64}:
 11.200952380952382
 11.200952380952382
 10.316666666666666
 10.99158878504673
 12.578609625668449
 12.995580110497237
 16.44895104895105
  9.64016393442623
 10.316666666666666
 12.251041666666667
 13.214606741573034
 14.34268292682927
 13.596531791907514
  ⋮
 10.94046511627907
 15.175483870967742
 15.475000000000001
 17.685714285714283
 12.251041666666667
  8.616117216117216
  9.046923076923077
  7.737500000000001
 14.887341772151897
 11.94010152284264
 15.681333333333333
 10.99158878504673

We apply z-score normalization to the predictors, reducing them to the standard form (mean=0, standard deviation=1). This improves the numerical stability of the algorithm and makes the coefficients comparable to each other. We form the matrix X by adding a column of units for the free regression term.

Least squares solution:

In [ ]:
#Реализация МНК (аналитическое решение)
β = inv(X'X) * X'y  # (XᵀX)⁻¹Xᵀy
Out[0]:
3-element Vector{Float64}:
 12.755331278523013
  1.2062112633104227
  2.643362422121514

We apply the classical OLS formula to obtain optimal regression coefficients. This formula minimizes the sum of the squares of the residuals and provides the only analytical solution for linear regression. The result β contains three coefficients: the free term, the coefficient for power, and the coefficient for mass.

Predictions and evaluation of model quality:

In [ ]:
y_pred = X * β
residuals = y - y_pred

# Метрики качества
mse = mean(residuals.^2)
rmse = sqrt(mse)
r2 = 1 - sum(residuals.^2) / sum((y .- mean(y)).^2);

We calculate the predicted fuel consumption values and find the difference between the actual and predicted values. We calculate the key quality metrics: the root mean square error (MSE), its root (RMSE) for estimating accuracy in initial units, and the coefficient of determination (R2) for estimating the explained variance.

Visualization of multidimensional regression:

In [ ]:
# 3D-визуализация для обоих признаков
p1 = plot(title="Многомерная регрессия", legend=:none, size=(800, 600))
scatter!(X_raw.HP, X_raw.weight_tons, y, marker=:circle, color=:blue, 
         xlabel="Мощность (л.с.)", ylabel="Масса (тонны)", zlabel="Расход (л/100км)")
surface!(sort(unique(X_raw.HP)), sort(unique(X_raw.weight_tons)), 
         (x,y) -> β[1] + β[2]*(x - mean(X_raw.HP))/std(X_raw.HP) + β[3]*(y - mean(X_raw.weight_tons))/std(X_raw.weight_tons),
         alpha=0.5)

display(p1)

The three-dimensional graph shows the effect of both variables on fuel consumption. The blue dots represent actual observations, while the translucent surface represents the predictions of the regression model.

In [ ]:
# Вывод результатов
println("\nРезультаты регрессионного анализа:")
println("==================================================")
println("Коэффициенты модели:")
println(" - Константа (b0): ", round(β[1], digits=4), " л/100км")
println(" - Мощность (b1): ", round(β[2], digits=4), " л/100км на 1 std мощности")
println(" - Масса (b2):      ", round(β[3], digits=4), " л/100км на 1 std веса")
println("\nМетрики качества:")
println(" - MSE:  ", round(mse, digits=4), " (л/100км)²")
# @markdown Результат регрессионного анализа:
println(" - RMSE: ", round(rmse, digits=4), " л/100км")
println(" - R²:   ", round(r2, digits=4))
println("==================================================")

# Интерпретация коэффициентов
hp_std = std(X_raw.HP)
weight_std = std(X_raw.weight_tons)

println("\nИнтерпретация:")
println("- При увеличении мощности на 1 стандартное отклонение (≈$(round(hp_std, digits=1)) л.с.),")
println("  расход топлива изменяется на ", round(β[2], digits=2), " л/100км")
println("- При увеличении массы на 1 стандартное отклонение (≈$(round(weight_std, digits=2)) тонны),")
println("  расход топлива изменяется на ", round(β[3], digits=2), " л/100км")
println("- Intercept (", round(β[1], digits=1), " л/100км) - средний расход при средних мощности и массе")
Результаты регрессионного анализа:
==================================================
Коэффициенты модели:
 - Константа (b0): 12.7553 л/100км
 - Мощность (b1): 1.2062 л/100км на 1 std мощности
 - Масса (b2):      2.6434 л/100км на 1 std веса

Метрики качества:
 - MSE:  2.2109 (л/100км)²
 - RMSE: 1.4869 л/100км
 - R²:   0.8471
==================================================

Интерпретация:
- При увеличении мощности на 1 стандартное отклонение (≈68.6 л.с.),
  расход топлива изменяется на 1.21 л/100км
- При увеличении массы на 1 стандартное отклонение (≈0.44 тонны),
  расход топлива изменяется на 2.64 л/100км
- Intercept (12.8 л/100км) - средний расход при средних мощности и массе

Conclusions:

In this example, we considered the use of the least squares method to build a multiple linear regression model based on the classic mtcars dataset, which contains the characteristics of 32 cars manufactured in 1973-74.

The dependent variable was the fuel consumption of cars (l/100 km), and the normalized engine power and vehicle weight were the independent variables.