Engee documentation
Notebook

Determining influential features for a random forest model

Introduction

In modern data analysis, the task of predicting continuous quantities occupies a central place in a variety of applied fields, from economics and bioinformatics to automotive engineering and energy. One of the most widely used methods for solving such problems is Random Forest, an ensemble machine learning algorithm proposed by Leo Breiman in 2001.

A random forest is an ensemble of regression trees, each of which is based on an independent sample from the source data. The basic element here is a regression tree, a hierarchical structure of the "decision tree" type, at the nodes of which the feature space is sequentially divided into more homogeneous regions. Ensembling is based on the idea that combining many simple models into a single composition allows for higher accuracy and stability of forecasts than using any of them individually.

A separate problem when building decision trees is the presence of missing values in the data. The surrogate splitting mechanism is used to solve it. When optimal partitioning by the main feature is impossible due to omission, the algorithm automatically selects an alternative feature, the partitioning of which mimics the original one as much as possible.

This example shows a strategy for choosing a partitioning criterion when building a regression random forest. The analysis solves the problem of identifying the key predictors that make the greatest contribution to the predictive ability of the model, which justifies their mandatory inclusion in the final training dataset.

Initial data

We will attach the necessary files and libraries.

In [ ]:
# EngeePkg.purge()
# import Pkg
# Pkg.add(["DataFrames", "XLSX", "CategoricalArrays", "MLJ", "MLJDecisionTreeInterface", "StableRNGs", "EvoTrees", "DecisionTree", "Statistics", "Random", "PyPlot"])
using DataFrames, XLSX, CategoricalArrays, MLJ, MLJDecisionTreeInterface, StableRNGs, EvoTrees, DecisionTree, Statistics, Random, PyPlot
foreach(include, filter(contains(r"\.jl$"), readdir()))

A data set containing the characteristics of passenger cars is used for the analysis. As part of the study, a regression model is being built that predicts fuel consumption based on the following parameters:

  • number of cylinders;
  • Engine displacement;
  • Power;
  • Vehicle weight;
  • Acceleration time;
  • year of release;
  • Country of origin.
In [ ]:
X = XLSX.readdata("автомобили.xlsx", "Sheet1", "A:G")
X = DataFrame(X[2:end, :], Symbol.(X[1, :])) 
Out[0]:
406×7 DataFrame
381 rows omitted
RowЦилиндровОбъёмМощностьМассаРазгонГодСтрана
AnyAnyAnyAnyAnyAnyAny
1830713035041270USA
28350165369311.570USA
3831815034361170USA
4830415034331270USA
58302140344910.570USA
6842919843411070USA
784542204354970USA
8844021543128.570USA
9845522544251070USA
10839019038508.570USA
114133115309017.570France
128350165414211.570USA
13835115340341170USA
3956181110294516.482USA
39662628530151782USA
397415692258514.582USA
3986232112283514.782USA
399414496266513.982Japan
40041358423701382USA
401415190295017.382USA
402414086279015.682USA
40349752213024.682Germany
404413584229511.682USA
405412079262518.682USA
406411982272019.482USA

Download the fuel consumption data.

In [ ]:
расход_ = XLSX.readdata("expenditure.xlsx", "Sheet1", "A:A") 
Expense = parse.(Float64, flow_[2:end])
Out[0]:
406-element Vector{Float64}:
  18.0
  15.0
  18.0
  16.0
  17.0
  15.0
  14.0
  14.0
  14.0
  15.0
 NaN
 NaN
 NaN
   ⋮
  25.0
  38.0
  26.0
  22.0
  32.0
  36.0
  27.0
  27.0
  44.0
  32.0
  28.0
  31.0

There is a lack of fuel consumption data for some vehicles. This will be taken into account in further calculations.

Determining the number of unique feature values

Let's determine the number of unique values of each feature from the dataset.

In [ ]:
for column in names(X)
    try
        X[!, column] = [getdata(val) for val in X[!, column]]
    catch e
        X[!, column] = categorical(X[!, column])
    end
end
unique = [length(unique(skipmissing(X[!, column]))) for column in names(X)]
Out[0]:
7-element Vector{Int64}:
   5
  83
  94
 356
  96
  13
   7

Let's compare the unique values using a bar chart.

In [ ]:
graph1 = Plots.bar(1:length(unique), unique, 
    title = "Number of unique values",
    ylabel = "Unique values",
    xticks = (1:length(unique), names(X)[1:end]),
    ylims = (0, maximum(unique) * 1.1),
    xrotation = 45,
    legend = false,
    bar_width = 0.7,
    color = :steelblue)
display(graph1)

The diagram shows that there are significant differences in the number of unique feature values. Such a disparity creates the risk of biasing estimates when using a standard algorithm for selecting splitting variables at the nodes of random forest trees, therefore, to form an ensemble of regression trees, it is necessary to take into account the relationship between the features.

Formation of an ensemble of regression trees

To evaluate the indicators of the importance of features, it is necessary to train an ensemble consisting of regression trees, taking into account the relationship between the features. Let's create a training sample.

In [ ]:
X_matrix = zeros(Float64, nrow(X), ncol(X))
col_names = names(X)

for (j, col) in enumerate(eachcol(X))
    if eltype(col) <: String || eltype(col) <: CategoricalValue
        unique_vals = unique(col)
        val_to_num = Dict(val => i for (i, val) in enumerate(unique_vals))
        X_matrix[:, j] = [Float64(val_to_num[x]) for x in col]
    else
        X_matrix[:, j] = Float64.(col)
    end
end

train_idx = .!isnan.(Expense)
X_train = X_matrix[train_idx, :]
y_train = Flow rate[train_idx]

println("Training sample size: $(size(X_train, 1)) rows")
println("Number of attributes: $(size(X_train, 2))")
Размер обучающей выборки: 398 строк
Количество признаков: 7

Let's complete the ensemble training.

In [ ]:
Random.seed!(1)
trees = 200
trees, yHat_train = build_forest(y_train, X_train, 0, trees, 0.632, -1, 5, 2)
valid_pred_idx = .!isnan.(yHat_train)
if sum(valid_pred_idx) > 0
    R2 = cor(y_train[valid_pred_idx], yHat_train[valid_pred_idx])^2
    println("R² = ", R2)
end
R² = 0.8713331235365577

The value of the coefficient of determination This indicates that the model explains 87% of the spread of the target variable relative to the average value.

Evaluation of the attribute's impact

The influence of features is estimated by rearranging out-of-sample observations between the ensemble trees.

In [ ]:
importance = permutation_importance(trees, X_train, y_train, 5);

Meaning важность It is a 1×7 vector containing estimates of the influence of the initial features. A feature of the estimates obtained is the absence of bias towards features with a large number of unique values. Let's compare the obtained indicators of the influence of the signs.

In [ ]:
graph2 = Plots.bar(importance, 
    title = "Indicators of the influence of signs",
    xlabel = "Signs",
    ylabel = "Influence",
    xticks = (1:length(unique), names(X)[1:end]),
    ylims = (0, maximum(importance) * 1.1),
    xrotation = 45,
    legend = false,
    bar_width = 0.7,
    color = :steelblue)
display(graph2)

Higher values of estimates correspond to more influential predictors. According to the bar chart, the year of manufacture of the car has the greatest predictive value, followed by the weight of the car.

Let's display the estimates of the relationship of the features in the form of a color matrix.

In [ ]:
график3 = imshow(predAssociation, cmap="viridis", aspect="auto", interpolation="nearest")
title("Evaluation of the relationship of features")
colorbar(label="The relationship")
PyPlot.xticks(0:length(col_names)-1, col_names, rotation=45, ha="right")
PyPlot.yticks(0:length(col_names)-1, col_names)
display(graph3)
PyObject <matplotlib.image.AxesImage object at 0x7f317892a990>

A predictive measure of the relationship is an indicator that characterizes the degree of similarity between the decision rules used to divide observations. The maximum value of this measure is achieved for the best surrogate splitting.

The matrix elements allow us to conclude about the strength of the relationship between the features: higher values indicate a stronger correlation between the corresponding features.

Conclusion

This example demonstrates an approach to building a regression random forest with an emphasis on the correct selection of influential features. Using the example of a set of car characteristics, the problem of predicting fuel consumption is solved — a typical problem for the automotive industry, where the accuracy of the model affects engineering solutions.
The surrogate splitting mechanism made it possible to correctly process data gaps and construct a matrix of feature relationships that provides a meaningful interpretation of the dependency structure in the data.

The presented methodology is universal and applicable in a wide range of fields: from mechanical engineering to bioinformatics. The main practical conclusion is that the correct consideration of the nature of variables when choosing a partitioning criterion makes it possible to increase the accuracy of forecasts and obtain meaningful conclusions about the significance of factors free from statistical biases.