Multiple linear regression: predicting car fuel consumption
This example demonstrates the construction of a multiple linear regression model for predicting car fuel consumption based on their power and mass using the least squares method. The source data is taken from the mtcars dataset.
The basic formula of multiple linear regression:
where:
fuel consumption (l/100km),
normalized engine power,
normalized vehicle weight,
free member (intercept),
power factor,
coefficient for weight,
an accidental mistake.
The mtcars dataset is a classic data set on the characteristics of 32 1973-74-year-old cars, collected from Motor Trend magazine. The dataset contains information about fuel consumption, engine power, weight, number of cylinders, and other vehicle specifications.
This dataset is widely used in academic courses on statistics, econometrics, and machine learning as a standard example for demonstrating regression analysis.
Installing and running the necessary libraries:
Pkg.add(["RDatasets", "Statistics", "DataFrames"])
using RDatasets # To download the data
using Plots # Visualization
using Statistics # Statistical functions
using DataFrames # Working with tables
Loading and converting data:
cars = dataset("datasets", "mtcars") # Loading the embedded data set
# Selecting predictors and a target variable
cars[:, "weight_tons"] = cars[:, "WT"] .* 0.453592 # Weight: from thousands of pounds to tons
X_raw = cars[:, ["HP", "weight_tons"]] # Power (hp) and weight (tons)
cars[:, "fuel_l100km"] = 235.22 ./ cars[:, "MPG"] # Consumption: from miles/gallon to L/100km
y = cars[:, "fuel_l100km"] # Fuel consumption (l/100km)
# Let's convert the column names for clarity
rename!(X_raw, [:HP, :weight_tons])
Download the classic mtcars dataset and convert American units of measurement to metric units for a better understanding. Weight is converted from thousands of pounds to tons (multiplied by 0.453592), and fuel consumption is converted from miles per gallon to liters per 100 km. We select power and weight as predictors for predicting fuel consumption.
Normalization and preparation of matrices:
normalize(x) = (x .- mean(x)) ./ std(x)
X = hcat(ones(nrow(X_raw)), normalize(X_raw.HP), normalize(X_raw.weight_tons))
y = float(y) # Convert to real numbers
We apply z-score normalization to the predictors, reducing them to the standard form (mean=0, standard deviation=1). This improves the numerical stability of the algorithm and makes the coefficients comparable to each other. We form the matrix X by adding a column of units for the free regression term.
Least squares solution:
# MNC Implementation (analytical solution)
β = inv(X'X) * X'y # (XᵀX)⁻¹Xᵀy
We use the classical OLS formula to obtain optimal regression coefficients. This formula minimizes the sum of the squares of the residuals and provides the only analytical solution for linear regression. The result β contains three coefficients: the free term, the coefficient for power, and the coefficient for mass.
Predictions and evaluation of model quality:
y_pred = X * β
residuals = y - y_pred
# Quality metrics
mse = mean(residuals.^2)
rmse = sqrt(mse)
r2 = 1 - sum(residuals.^2) / sum((y .- mean(y)).^2);
We calculate the predicted fuel consumption values and find the difference between the actual and predicted values. We calculate the key quality metrics: the root mean square error (MSE), its root (RMSE) for estimating accuracy in initial units, and the coefficient of determination (R2) for estimating the explained variance.
Visualization of multidimensional regression:
# 3D visualization for both signs
p1 = plot(title="Multidimensional regression", legend=:none, size=(800, 600))
scatter!(X_raw.HP, X_raw.weight_tons, y, marker=:circle, color=:blue,
xlabel="Power (hp)", ylabel="Weight (tons)", zlabel="Consumption (l/100km)")
surface!(sort(unique(X_raw.HP)), sort(unique(X_raw.weight_tons)),
(x,y) -> β[1] + β[2]*(x - mean(X_raw.HP))/std(X_raw.HP) + β[3]*(y - mean(X_raw.weight_tons))/std(X_raw.weight_tons),
alpha=0.5)
display(p1)
The three-dimensional graph shows the effect of both variables on fuel consumption. The blue dots represent actual observations, while the translucent surface represents the predictions of the regression model.
# Output of results
println("\Regression analysis results:")
println("==================================================")
println("Coefficients of the model:")
println(" - Constant (b0): ", round(β[1], digits=4), " l/100km")
println(" - Power (b1): ", round(β[2], digits=4), " l/100km at 1 std capacity")
println(" - Mass (b2): ", round(β[3], digits=4), " l/100km per 1 std of weight")
println("\nMetrics of quality:")
println(" - MSE: ", round(mse, digits=4), " (l/100km)2")
# @markdown Regression analysis result:
println(" - RMSE: ", round(rmse, digits=4), " l/100km")
println(" - R²: ", round(r2, digits=4))
println("==================================================")
# Interpretation of coefficients
hp_std = std(X_raw.HP)
weight_std = std(X_raw.weight_tons)
println("\Pinterpretation:")
println("- With an increase in power by 1 standard deviation (≈$(round(hp_std, digits=1)) hp),")
println(" fuel consumption changes to ", round(β[2], digits=2), " l/100km")
println("- With an increase in mass by 1 standard deviation (≈$(round(weight_std, digits=2)) tons),")
println(" fuel consumption changes to ", round(β[3], digits=2), " l/100km")
println("- Intercept (", round(β[1], digits=1), " l/100km) - average consumption with average power and weight")
Conclusions:
In this example, we considered the application of the least squares method to build a multiple linear regression model based on the classic mtcars dataset, which contains the characteristics of 32 cars manufactured in 1973-74.
The dependent variable was the fuel consumption of cars (l/100 km), and the normalized engine power and vehicle weight were the independent variables.