PyInv

プログラミングのメモ、海外投資のメモ

第二回：Ensemble Model - Gradient Boosted Decistion Trees（GBDT)

Python Python-Machine Learning

概要

引き続いてMachine Learning のEnsemble Model。Gradient Boosted Decision Trees

概要
Gradient-Boosted Decision Trees（GBDT)
- Pros and Cons
  - Pros
  - Cons
- Structure
- 主なParameters
- Usage

Gradient-Boosted Decision Trees（GBDT)

連続したいくつもの小さいDecision Tree。Decision Treeごとに前回のDecision Treeのエラーを修正していく。一方、Random Forestsは並行したいくつものDecision Treeでそれぞれ独立している。
最も多く使われるMachine Learning Model。
後述のようにLearning Rateが最も大事なパラメータ。

Pros and Cons

Pros

既存のパッケージの中で多くの対象に汎用的に最も高いAccuracyをたたき出す
メモリの使用量がそれほど高くなく、計算が速い
特徴量に対して正規化が必要でない
Decision Treeと同様に、多種多様な特徴量を加工なく使える

Cons

Random Forestsと同様に、モデルの解釈が難しい
Learning Rateやその他のパラメータのチューニングが必要
学習のための計算量が大きい
Decision Treeと同様にText Classificationのような多次元の特徴量の取り扱いがうまくない

Structure

sklearn.ensemble.GradientBoostingClassifier(loss='deviance', learning_rate=0.1, n_estimators=100, subsample=1.0, criterion='friedman_mse', min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, min_impurity_decrease=0.0, min_impurity_split=None, init=None, random_state=None, max_features=None, verbose=0, max_leaf_nodes=None, warm_start=False, presort='deprecated', validation_fraction=0.1, n_iter_no_change=None, tol=0.0001, ccp_alpha=0.0)

from sklearn.ensemble import GradientBoostingClassifier

主なParameters

n_estimatorsとlearning_rateはセットで調整する

Parameters	Description
learning_rate	前回のTreeからエラーを修正する High: More complex model, overfitting Low: simpler model (Default = 0.1)
n_estimators	連続させるDecision Treeの個数。 (Default = 100)
max_depth	各Treeの最大深度。通常は小さな値（3-5） (Default = 3)

Usage

%matplotlib notebook
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_blobs
from adspy_shared_utilities import plot_class_regions_for_classifier_subplot #パッケージ
import matplotlib.pyplot as plt

X_D2, y_D2 = make_blobs(n_samples = 100, n_features = 2,
                       centers = 8, cluster_std = 1.3,
                       random_state = 4)
y_D2 = y_D2 % 2

X_train, X_test, y_train, y_test  = train_test_split(X_D2, y_D2, random_state=0)
fig, subaxes = plt.subplots(1,1,figsize = (6,6))

clf = GradientBoostingClassifier(learning_rate=0.01, max_depth=2).fit(X_train, y_train)
title = 'GBDT, complex binary dataset, default settings'
plot_class_regions_for_classifier_subplot(clf, X_train, y_train, X_test, y_test, title, subaxes)

plt.show()

f:id:singapp:20200221021527p:plain