AnyMath 文档
Notebook

Julia和Matlab ensemble方法的性能比较

导言

现代机器学习任务不仅需要高模型精度,还需要高效利用计算资源。 工具的选择直接影响开发速度和训练时间,这在处理大量数据和以其资源强度而闻名的集成方法时尤为关键。

Ensemble方法是机器学习方法,它结合了几个基本模型(如决策树),以产生比每个模型单独更准确和稳定的预测。

此示例提供了基于随机森林学习分类模型的三种方法的比较分析。:

Julia上使用MLJDecisionTree*,
以及fitcensemble插件核心中的函数Matlab*。

所有方法都解决了合成数据上二元分类的相同问题,这允许对它们的性能进行客观比较。

导入库

我们将附加必要的库。

In [ ]:
import Pkg 
Pkg.add(["Random", "Distributions", "LinearAlgebra", "Statistics", "DecisionTree", "MLJ", "MLJDecisionTreeInterface"])
using Random, Distributions, LinearAlgebra, Statistics

数据生成

创建十个红色和十个蓝色底点。 请注意,在Julia中,您可以在变量名称中使用表情符号字符。

In [ ]:
🔴 = rand(MvNormal([0.0, 1.0], 1.0I), 10)'
🔵 = rand(MvNormal([1.0, 0.0], 1.0I), 10)' 
Warning: detected a stack overflow; program state may be corrupted, so further execution might be unreliable.
Out[0]:
10×2 adjoint(::Matrix{Float64}) with eltype Float64:
  1.68076   -0.39267
 -1.04332   -0.524557
  0.403396   0.711854
  1.40658    0.840808
 -0.271765   0.203544
  0.706984   0.700587
  2.65888    0.695686
 -0.822003  -0.871902
  0.170193  -0.256672
  0.884981  -0.725199

让我们在坐标平面上显示基点。

In [ ]:
gr()
график1 = scatter(🔴[:, 1], 🔴[:, 2], color=:red, marker=:circle, label="红色", legend=:topright)
scatter!(🔵[:, 1], 🔵[:, 2], color=:blue, marker=:circle, label="蓝色")
显示(图1)
No description has been provided for this image

创建以随机基点为中心的每种颜色的50,000个点。

In [ ]:
N = 50000

🔴 = 🔴[rand(1:10, N), :] + randn(N, 2) .* sqrt(0.02)
🔵 = 🔵[rand(1:10, N), :] + randn(N, 2) .* sqrt(0.02)

график2  =scatter(🔴[:, 1], 🔴[:, 2], color=:red, marker=:circle, markersize=1, label="红色", alpha=0.36, legend=:topright)
scatter!(🔵[:, 1], 🔵[:, 2], color=:blue, marker=:circle, markersize=1, label="蓝色", alpha=0.36)
显示(图2)
No description has been provided for this image

合并数据并为分类任务创建类标签。 创建标签的单个向量,并为蓝点分配-1标签。

In [ ]:
数据=[🔴;🔵] 
标签=1(2*N) 
标签[N+1:2*N]= -1
# 红色1,蓝色-1
显示(数据)
100000×2 Matrix{Float64}:
 -1.17727     0.706835
 -1.57707     0.774681
  0.178602   -1.29926
 -1.01894     1.71275
 -0.0370812  -0.457162
  0.0653013  -0.420637
 -1.09482     1.26002
 -1.60749     0.494051
  0.410089    0.599331
  0.186491   -1.19344
 -0.91909     0.618676
 -1.10581     1.17909
  0.0183241   2.10268
  ⋮          
  0.430693    0.724032
  1.72878    -0.155705
 -1.02294    -0.301795
 -0.722527   -0.963421
  0.0536655  -0.328961
  1.5339     -0.571084
  0.865277   -0.797651
  2.67771     0.722167
 -0.971874   -0.562102
 -0.630223   -0.546642
 -0.179057    0.319442
  0.231351    0.864316

分类模型的比较

MLJ

让我们使用MLJ库(Julia中的机器学习)的工具来训练模型。

In [ ]:
using MLJ, MLJDecisionTreeInterface
=@load DecisionTreeClassifier pkg=DecisionTree
时间=@经过开始
   模型=MLJ。合身!(机器(EnsembleModel(模型=树(max_depth=-1),n=100,bagging_fraction=1.0,rng=1234, 
    DataFrame(data,:auto),强制(ifelse.(标签。==1.0,1,2),Multiclass)))
end

显示(模型)
println("训练时间: ", время, " 秒数")
[ Info: For silent loading, specify `verbosity=0`. 
[ Info: Training machine(ProbabilisticEnsembleModel(model = DecisionTreeClassifier(max_depth = -1, …), …), …).
import MLJDecisionTreeInterface ✔
Training ensemble:   3%[=>                                                ]  ETA: 0:00:22
Training ensemble:   5%[==>                                               ]  ETA: 0:00:25
Training ensemble:   7%[===>                                              ]  ETA: 0:00:27
Training ensemble:   8%[====>                                             ]  ETA: 0:00:44
Training ensemble:  10%[=====>                                            ]  ETA: 0:00:40
Training ensemble:  12%[======>                                           ]  ETA: 0:00:37
Training ensemble:  14%[=======>                                          ]  ETA: 0:00:35
Training ensemble:  16%[========>                                         ]  ETA: 0:00:33
Training ensemble:  18%[=========>                                        ]  ETA: 0:00:31
Training ensemble:  20%[==========>                                       ]  ETA: 0:00:30
Training ensemble:  22%[===========>                                      ]  ETA: 0:00:28
Training ensemble:  24%[============>                                     ]  ETA: 0:00:27
Training ensemble:  26%[=============>                                    ]  ETA: 0:00:26
Training ensemble:  28%[==============>                                   ]  ETA: 0:00:25
Training ensemble:  30%[===============>                                  ]  ETA: 0:00:24
Training ensemble:  32%[================>                                 ]  ETA: 0:00:24
Training ensemble:  34%[=================>                                ]  ETA: 0:00:23
Training ensemble:  36%[==================>                               ]  ETA: 0:00:22
Training ensemble:  38%[===================>                              ]  ETA: 0:00:21
Training ensemble:  40%[====================>                             ]  ETA: 0:00:21
Training ensemble:  42%[=====================>                            ]  ETA: 0:00:20
Training ensemble:  44%[======================>                           ]  ETA: 0:00:19
Training ensemble:  46%[=======================>                          ]  ETA: 0:00:19
Training ensemble:  48%[========================>                         ]  ETA: 0:00:18
Training ensemble:  50%[=========================>                        ]  ETA: 0:00:17
Training ensemble:  52%[==========================>                       ]  ETA: 0:00:16
Training ensemble:  54%[===========================>                      ]  ETA: 0:00:16
Training ensemble:  56%[============================>                     ]  ETA: 0:00:15
Training ensemble:  58%[=============================>                    ]  ETA: 0:00:14
Training ensemble:  60%[==============================>                   ]  ETA: 0:00:13
Training ensemble:  62%[===============================>                  ]  ETA: 0:00:13
Training ensemble:  64%[================================>                 ]  ETA: 0:00:12
Training ensemble:  66%[=================================>                ]  ETA: 0:00:11
Training ensemble:  68%[==================================>               ]  ETA: 0:00:10
Training ensemble:  70%[===================================>              ]  ETA: 0:00:10
Training ensemble:  72%[====================================>             ]  ETA: 0:00:09
Training ensemble:  74%[=====================================>            ]  ETA: 0:00:08
Training ensemble:  76%[======================================>           ]  ETA: 0:00:08
Training ensemble:  78%[=======================================>          ]  ETA: 0:00:07
Training ensemble:  80%[========================================>         ]  ETA: 0:00:07
Training ensemble:  82%[=========================================>        ]  ETA: 0:00:06
Training ensemble:  84%[==========================================>       ]  ETA: 0:00:05
Training ensemble:  86%[===========================================>      ]  ETA: 0:00:05
Training ensemble:  88%[============================================>     ]  ETA: 0:00:04
Training ensemble:  90%[=============================================>    ]  ETA: 0:00:03
Training ensemble:  92%[==============================================>   ]  ETA: 0:00:03
Training ensemble:  94%[===============================================>  ]  ETA: 0:00:02
Training ensemble:  96%[================================================> ]  ETA: 0:00:01
Training ensemble:  98%[=================================================>]  ETA: 0:00:01
Training ensemble: 100%[==================================================] Time: 0:00:31

trained Machine; caches model-specific representations of data
  model: ProbabilisticEnsembleModel(model = DecisionTreeClassifier(max_depth = -1, …), …)
  args: 
    1:	Source @591 ⏎ Table{AbstractVector{ScientificTypesBase.Continuous}}
    2:	Source @049 ⏎ AbstractVector{Multiclass{2}}
训练时间:33.510742957秒

使用MLJ的模型的训练时间为:33.51 几秒钟。

Matlab fitensemble

让我们使用函数在Matlab内核内训练模型 fitcensemble,并测量执行时间。

In [ ]:
using MATLAB

cdata=数据
grp=标签
@mput cdata grp N

mat"""
tic
mdl = fitcensemble(cdata, grp, 'Method', 'Bag');
stime = toc;
disp(mdl)
"""
@mget(stime)
println("训练时间: ", stime, " 秒数")
Warning: detected a stack overflow; program state may be corrupted, so further execution might be unreliable.
>> >> >> >> >> >>   ClassificationBaggedEnsemble
             ResponseName: 'Y'
    CategoricalPredictors: []
               ClassNames: [-1 1]
           ScoreTransform: 'none'
          NumObservations: 100000
               NumTrained: 100
                   Method: 'Bag'
             LearnerNames: {'Tree'}
     ReasonForTermination: 'Terminated normally after completing the requested number of training cycles.'
                  FitInfo: []
       FitInfoDescription: 'None'
                FResample: 1
                  Replace: 1
         UseObsForLearner: [100000x100 logical]


训练时间:29.569731秒

使用Matlab的模型的训练时间为:29.57 几秒钟。

决定论

让我们使用DecisionTree库工具训练模型。

In [ ]:
using DecisionTree
时间=@经过开始
模型=build_forest(标签,数据,2,100,1.0,rng=随机。GLOBAL_RNG)
end
显示(模型)
println("训练时间: ", время, " 秒数")
Ensemble of Decision Trees
Trees:      100
Avg Leaves: 2290.95
Avg Depth:  30.27
训练时间:6.06898517秒

使用DecisionTree的模型的训练时间为:6.07 几秒钟。 这是研究的最好结果。

结论

在这项研究中,对合成数据进行了100个决策树的训练。 对这三种方法的比较表明,在性能上有显着的差异.
Julia DecisionTree本机库(6.07c)展示了最佳性能,这使其成为高负载任务和原型设计的绝佳选择。
结果证实,与Matlab相比,Engee提供了显着的性能提升,同时保持了与其语法的兼容性。 所提出的分类方法适用于计算机视觉、预测分析、医疗诊断和财务建模等需要高精度和快速处理大量信息的任务。