第一回：Ensemble Model - Random Forestsとはなんぞや

Ensemble Model
Random Forests

Ensemble Model

たくさんのMahcine Learning Model をたばねたもの。ひとつひとつにのOverfittingのような問題があっても、全体としては安定。
サンプルはBootstrapによって集められる。

Random Forests

Decision Tree をいくつも束ねたもの。それぞれのTreeではRandom_stateによって異なるモデルが生成される。
singapp.hatenablog.com

Regression: それぞれのtreeのaverage
Classification: それぞれのtreeが返すprobabilityのaverage, 最後に最も高いprobabilityで予想する

Pros and Cons

Pros

Normalization やParameterのチューニングが必要ない
簡単で幅広い対象に対してすごくよいパフォーマンス
並列処理化が簡単

Cons

できたモデルの解釈が困難
text classififationのような高次の処理が得意でない

Structure

Classification: RandomForestClassifier

sklearn.ensemble.RandomForestClassifier(n_estimators=100, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)

Regression: RansomForestRegressor

sklearn.ensemble.RandomForestRegressor(n_estimators=100, criterion='mse', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, ccp_alpha=0.0, max_samples=None)

Parmeters

decistion treeとだいたい同じだがmax_featuresが一番大事

Parameter	Description
max_features	i.e. set as 1: ランダムに選ばれた特徴量が一つだけ使われる→more complex set as close to the number of features: 通常のDecision Treeと同じようになる (Default = 'auto')
n_estimators	ensembleで何個のTreeを使うか (Default = 10)
max_depth	各Treeの深さ (Default = 'None')
n_jobs	何個のCPU coreを学習中に使うか
random_state	再現性のために毎回同じ特定の値を使うこと

Usage

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification, make_blobs
import matplotlib.pyplot as plt

X_D2, y_D2 = make_blobs(n_samples = 100, n_features = 2,
                       centers = 8, cluster_std = 1.3,
                       random_state = 4)

X_train, X_test, y_train, y_test =train_test_split(X_D2, y_D2, random_state=0)
fig , subaxes=plt.subplots(1,1, figsize=(6,6))

clf = RandomForestClassifier().fit(X_train, y_train)
title = 'Random Forest Classifier, complex binary dataset, default settings'

plot_class_regions_for_classifier_subplot(clf, X_train, y_train, X_test,
                                         y_test, title, subaxes)

plt.show()

f:id:singapp:20200219172936p:plain

PyInv

プログラミングのメモ、海外投資のメモ