第一回:Ensemble Model - Random Forestsとはなんぞや
Ensemble Model
たくさんのMahcine Learning Model をたばねたもの。ひとつひとつにのOverfittingのような問題があっても、全体としては安定。
サンプルはBootstrapによって集められる。
Random Forests
Decision Tree をいくつも束ねたもの。それぞれのTreeではRandom_stateによって異なるモデルが生成される。
singapp.hatenablog.com
Regression: それぞれのtreeのaverage
Classification: それぞれのtreeが返すprobabilityのaverage, 最後に最も高いprobabilityで予想する
Pros and Cons
Pros
- Normalization やParameterのチューニングが必要ない
- 簡単で幅広い対象に対してすごくよいパフォーマンス
- 並列処理化が簡単
Cons
- できたモデルの解釈が困難
- text classififationのような高次の処理が得意でない
Structure
- Classification: RandomForestClassifier
sklearn.ensemble.RandomForestClassifier(n_estimators=100, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)
- Regression: RansomForestRegressor
sklearn.ensemble.RandomForestRegressor(n_estimators=100, criterion='mse', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, ccp_alpha=0.0, max_samples=None)
Parmeters
decistion treeとだいたい同じだがmax_featuresが一番大事
Parameter | Description |
max_features | i.e. set as 1: ランダムに選ばれた特徴量が一つだけ使われる→more complex set as close to the number of features: 通常のDecision Treeと同じようになる (Default = 'auto') |
n_estimators | ensembleで何個のTreeを使うか (Default = 10) |
max_depth | 各Treeの深さ (Default = 'None') |
n_jobs | 何個のCPU coreを学習中に使うか |
random_state | 再現性のために毎回同じ特定の値を使うこと |
Usage
from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.datasets import make_classification, make_blobs import matplotlib.pyplot as plt X_D2, y_D2 = make_blobs(n_samples = 100, n_features = 2, centers = 8, cluster_std = 1.3, random_state = 4) X_train, X_test, y_train, y_test =train_test_split(X_D2, y_D2, random_state=0) fig , subaxes=plt.subplots(1,1, figsize=(6,6)) clf = RandomForestClassifier().fit(X_train, y_train) title = 'Random Forest Classifier, complex binary dataset, default settings' plot_class_regions_for_classifier_subplot(clf, X_train, y_train, X_test, y_test, title, subaxes) plt.show()