Engee documentation
Notebook

Classification

This example shows how to perform classification using naive Bayesian classifiers and decision trees. Suppose you have a dataset containing observations with measurements of different variables (called predictors) and their known class labels. When you obtain predictor values for new observations, can you determine which classes these observations are likely to belong to? This is the problem of classification.

An internet connection is required to run this demonstration

Fisher's Irises

The Iris Fischer dataset consists of measurements of sepal length and width as well as petal length and width for 150 specimens of irises. There are 50 specimens of each of the three species. Download the data and see how sepal size differs between species. You can use the first two columns that contain their measurements.

Downloading libraries for statistical analysis

To work with statistical data it is necessary to load the specified libraries.

In [ ]:
Pkg.add(["RDatasets", "NaiveBayes", "StatsBase", "StatsPlots", "DecisionTree", "ScikitLearn"])
In [ ]:
Pkg.add("DecisionTree")#загрузка библиотеки решающих деревьев
Pkg.add("NaiveBayes")#загрузка библиотеки с Баесовскими классификаторами
Pkg.add("StatsBase")#для использования predict в баесовском классификаторе
Pkg.add("StatsPlots")#для построения точечных графиков

Connection of loaded and auxiliary libraries for unloading the dataset, for plotting and for training classifiers.

In [ ]:
using NaiveBayes
using RDatasets
using StatsBase
using Random
using StatsPlots
using Plots, RDatasets
using DecisionTree
using ScikitLearn: fit!, predict
using ScikitLearn.CrossValidation: cross_val_score

Generating predictor matrices and a set of class labels.

In [ ]:
features, labels = load_data("iris")
features = float.(features);
labels   = string.(labels);

Connection of the Plots library method for displaying graphs, as well as definition of a dataset for displaying the distribution of observations by sepal width and length (SepalWidth, SepalLength).

In [ ]:
plotlyjs();
iris = dataset("datasets", "iris");

Dataset with predictors and classes.

In [ ]:
first(iris[:,1:5],5)
Out[0]:

5 rows × 5 columns

SepalLengthSepalWidthPetalLengthPetalWidthSpecies
Float64Float64Float64Float64Cat…
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
55.03.61.40.2setosa

Graph of the distribution of observations by sepal width and length.

In [ ]:
@df iris scatter(:SepalLength, :SepalWidth, group = :Species)
Out[0]:

Preparation of the predictor matrix for all hypothetical observations, which will be needed when constructing a dot plot showing the classifier's performance in classifying objects into classes.

In [ ]:
h = 10000;
x1 = rand(4:0.1:8, h)
x1t = x1'
x2 = rand(2:0.1:4.5, h)
x2t = x2'
x12 = vcat(x1t,x2t);

Naive Bayesian classifier

Naive Bayesian classifier is a simple probabilistic classifier based on the application of Bayes' theorem with strict (naive) independence assumptions.

The advantage of the naive Bayesian classifier is the small amount of data required for training, parameter estimation and classification.

Data preprocessing for use in training the Bayesian classifier model.

In [ ]:
Xx = Matrix(iris[:,1:2])';#features';#
Yy = [species for species in iris[:, 5]];#labels;
p, n = size(Xx)
train_frac = 0.8
k = floor(Int, train_frac * n)
idxs = randperm(n)
train_idxs = idxs[1:k];
test_idxs = idxs[k+1:end];

Determination of the model structure by GaussianNB method and its training by fit method. Accuracy calculation.

In [ ]:
modelNB = GaussianNB(unique(Yy), p)
fit(modelNB, Xx[:, train_idxs], Yy[train_idxs])
accuracyNB = count(!iszero, StatsBase.predict(modelNB, Xx[:,test_idxs]) .== Yy[test_idxs]) / count(!iszero, test_idxs)
println("Accuracy: $accuracyNB")
Accuracy: 0.7333333333333333

Formation of the prediction matrix for new observations.

In [ ]:
predNB = fill("", 0);#Определение пустого массива
predNB = (NaiveBayes.predict(modelNB, x12[:,1:h]));#наполнение массива значениями предсказаний

Formation of predictor-forecast matrix for new observations.

In [ ]:
x_pred_NB = hcat(x12',predNB);
pred_df_NB = DataFrame(x_pred_NB,:auto);

Display of classifier partitioning of the predictor field into classes corresponding to iris species.

In [ ]:
gr()
@df pred_df_NB scatter(:x1, :x2, group = :x3, markersize = 7)
Out[0]:

Decision tree

A decision tree is a set of simple rules such as "if sepal length is less than 5.45, classify the specimen as setosa". Decision trees are also non-parametric because they do not require any assumptions about the distribution of variables in each class.

The process of constructing decision trees is to sequentially, recursively partition the training set into subsets by applying decision rules at the nodes.

Defining the model structure using DecisionTreeClassifier method and training it using fit! Accuracy calculation using cross-validation.

Decision tree display.

In [ ]:
modelDT = DecisionTreeClassifier(max_depth=10)#определение структуры модели
fit!(modelDT, features[:,1:2], labels)#обучение модели
print_tree(modelDT, 5)#отображение дерева решений
accuracy = cross_val_score(modelDT, features[:,1:2], labels, cv=8)#вычисление точности
Feature 1 < 5.55 ?
├─ Feature 2 < 2.8 ?
    ├─ Feature 1 < 4.95 ?
        ├─ Feature 2 < 2.35 ?
            ├─ Iris-setosa : 1/1
            └─ Feature 2 < 2.45 ?
                ├─ Iris-versicolor : 1/1
                └─ Iris-virginica : 1/1
        └─ Iris-versicolor : 9/9
    └─ Feature 1 < 5.35 ?
        ├─ Iris-setosa : 39/39
        └─ Feature 2 < 3.2 ?
            ├─ Iris-versicolor : 1/1
            └─ Iris-setosa : 7/7
└─ Feature 2 < 3.7 ?
    ├─ Feature 1 < 6.25 ?
        ├─ Feature 1 < 5.75 ?
            ├─ Feature 2 < 2.85 ?
                ├─ 
                └─ Iris-versicolor : 5/5
            └─ Feature 2 < 2.95 ?
                ├─ 
                └─ 
        └─ Feature 1 < 7.05 ?
            ├─ Feature 2 < 2.4 ?
                ├─ Iris-versicolor : 1/1
                └─ 
            └─ Iris-virginica : 10/10
    └─ Feature 1 < 6.75 ?
        ├─ Iris-setosa : 3/3
        └─ Iris-virginica : 2/2
Out[0]:
8-element Vector{Float64}:
 0.6190476190476191
 0.7619047619047619
 0.5
 0.7222222222222222
 0.7222222222222222
 0.7222222222222222
 0.7777777777777778
 0.7222222222222222

This clutter-looking tree uses a series of rules of the form "Feature 1 < 5.55" to classify each sample into one of 11 final nodes. To determine the species of an observed iris, start with the top condition and apply the rules. If the observation satisfies the rule, you select the top branch, and if not, you select the bottom branch. Eventually you will reach a final node that assigns the observed instance a value of one of the three species.

Forming a prediction matrix for new observations.

In [ ]:
predDT = fill("", 0);#определение массива 
#формирование матрицы значение-прогноз для решающего дерева
for i in 1:h
    predDT = vcat(predDT, predict(modelDT, x12[:,i]))
end

Formation of predictor-forecast matrix for new observations.

In [ ]:
x_pred_DT = hcat(x12',predDT);
pred_df_DT = DataFrame(x_pred_DT,:auto);

Display of classifier partitioning of the predictor field into classes corresponding to iris species.

In [ ]:
@df pred_df_DT scatter(:x1, :x2, group = :x3, markersize = 7)
Out[0]:

Conclusion

This example solved a classification problem using a naive Bayesian classifier and a decision tree. The use of the DecisionTree and NaiveBayes libraries was demonstrated. They were used to train the classifiers.

The classifiers, in turn, with a fairly small data set, showed good accuracy and partitioned the predictor field into classes.