続Python Machine Laerning Models

モデルの書き忘れ.
Logistic Regression
Decision Tree

モデルの書き忘れ.

Logistic Regression, Decision Tree

singapp.hatenablog.com

Logistic Regression

非線形回帰の一種。各説明変数を入れると0から1間の値、つまり確率分布を返す。
これを使ってBinary Classificationを行う。（ある確率以上は１という判断）

Structure

sklearn.linear_model.LogisticRegression(penalty='l2', dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='auto', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)

from sklearn.linear_model import LogisticRegression

Parameters

Parameter	Discription
Penalty	説明変数自体の大きさによりモデル内での重みが変わってしまうので、正規化する必要がある。ここで説明変数の標準化の手法を選択　['l1' or 'l2'] L1は絶対値をとり、小さい説明変数への重みがゼロになる。 (Default as l2)
C	正則化の速さ（強さ）を指定する。RidgeやLassoのαと逆向き。 Lower: more regularization, simpler model Larger: less regulalization, more complex model -> overfitting (Default as 1.0)
random_state	random generatorのseed 毎回同じのを使うとよい

Sample

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_fruits, y_fruits, random_state=0)
clf = LogisticRegression(C=this_C).fit(X_train, y_train)

f:id:singapp:20200219134353p:plain

Decision Tree

条件分岐によってグループを分割していく。一回の分岐で一つのグループをきれいに抽出できるとスコアが高い（不純度）。

Structure

sklearn.tree.DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, class_weight=None, presort='deprecated', ccp_alpha=0.0)

from sklearn.tree import DecisionTreeClassifier

主なParameters

Parameter	Discription
Criterion	分割基準、'gini' or 'entropy' (Default = 'gini')
max_depth	Treeの最大深度多すぎるとoverfitting (Default = None)
min_samples_split	分岐する際の最低サンプル量 (Default = 2)
min_samples_leaf	Leaf nodeでの最低サンプル量 (Default = 1)

ほかのは
scikit-learn.org

Sample

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split


iris = load_iris()

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state = 3)
clf = DecisionTreeClassifier(max_depth = 3,min_samples_leaf = 8,
                            random_state = 0).fit(X_train, y_train)

Visualization

sklearn.tree.export_graphviz(decision_tree, out_file=None, max_depth=None, feature_names=None, class_names=None, label='all', filled=False, leaves_parallel=False, impurity=True, node_ids=False, proportion=False, rotate=False, rounded=False, special_characters=False, precision=3)

import pydotplus
from IPython.display import Image
from graphviz import Digraph
from sklearn.externals.six import StringIO

dot_data = StringIO()
tree.export_graphviz(clf, out_file=dot_data,feature_names=train_X.columns, max_depth=3)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_pdf("graph.pdf")
Image(graph.create_png())

f:id:singapp:20200219143745p:plain

Feature importance

どの変数が重要だったかを返すメソッド

clf.feature_importances_)

PyInv

プログラミングのメモ、海外投資のメモ

続Python Machine Laerning Models

モデルの書き忘れ.

Logistic Regression

Structure

Parameters

Sample

Decision Tree

Structure

主なParameters

Sample

Visualization

Feature importance