PyInv

プログラミングのメモ、海外投資のメモ

続Python Machine Laerning Models

モデルの書き忘れ.

Logistic Regression, Decision Tree

singapp.hatenablog.com

Logistic Regression

非線形回帰の一種。 各説明変数を入れると0から1間の値、つまり確率分布を返す。
これを使ってBinary Classificationを行う。(ある確率以上は1という判断)

Structure

sklearn.linear_model.LogisticRegression(penalty='l2', dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='auto', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)

from sklearn.linear_model import LogisticRegression


Parameters

Parameter Discription
Penalty 説明変数自体の大きさによりモデル内での重みが変わってしまうので、正規化する必要がある。ここで説明変数の標準化の手法を選択 ['l1' or 'l2']
L1は絶対値をとり、小さい説明変数への重みがゼロになる。
(Default as l2)
C 正則化の速さ(強さ)を指定する。RidgeやLassoのαと逆向き。
Lower: more regularization, simpler model
Larger: less regulalization, more complex model -> overfitting
(Default as 1.0)
random_state random generatorのseed
毎回同じのを使うとよい

Sample

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_fruits, y_fruits, random_state=0)
clf = LogisticRegression(C=this_C).fit(X_train, y_train)


f:id:singapp:20200219134353p:plain


Decision Tree

条件分岐によってグループを分割していく。一回の分岐で一つのグループをきれいに抽出できるとスコアが高い(不純度)。

Structure

sklearn.tree.DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, class_weight=None, presort='deprecated', ccp_alpha=0.0)

from sklearn.tree import DecisionTreeClassifier


主なParameters

Parameter Discription
Criterion 分割基準、'gini' or 'entropy'
(Default = 'gini')
max_depth Treeの最大深度
多すぎるとoverfitting
(Default = None)
min_samples_split 分岐する際の最低サンプル量
(Default = 2)
min_samples_leaf Leaf nodeでの最低サンプル量
(Default = 1)

ほかのは
scikit-learn.org


Sample

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split


iris = load_iris()

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state = 3)
clf = DecisionTreeClassifier(max_depth = 3,min_samples_leaf = 8,
                            random_state = 0).fit(X_train, y_train)

Visualization

sklearn.tree.export_graphviz(decision_tree, out_file=None, max_depth=None, feature_names=None, class_names=None, label='all', filled=False, leaves_parallel=False, impurity=True, node_ids=False, proportion=False, rotate=False, rounded=False, special_characters=False, precision=3)

import pydotplus
from IPython.display import Image
from graphviz import Digraph
from sklearn.externals.six import StringIO

dot_data = StringIO()
tree.export_graphviz(clf, out_file=dot_data,feature_names=train_X.columns, max_depth=3)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_pdf("graph.pdf")
Image(graph.create_png())


f:id:singapp:20200219143745p:plain

Feature importance

どの変数が重要だったかを返すメソッド

clf.feature_importances_)