続Python Machine Laerning Models
モデルの書き忘れ.
Logistic Regression, Decision Tree
singapp.hatenablog.com
Logistic Regression
非線形回帰の一種。 各説明変数を入れると0から1間の値、つまり確率分布を返す。
これを使ってBinary Classificationを行う。(ある確率以上は1という判断)
Structure
sklearn.linear_model.LogisticRegression(penalty='l2', dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='auto', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)
from sklearn.linear_model import LogisticRegression
Parameters
Parameter | Discription |
Penalty | 説明変数自体の大きさによりモデル内での重みが変わってしまうので、正規化する必要がある。ここで説明変数の標準化の手法を選択 ['l1' or 'l2'] L1は絶対値をとり、小さい説明変数への重みがゼロになる。 (Default as l2) |
C | 正則化の速さ(強さ)を指定する。RidgeやLassoのαと逆向き。 Lower: more regularization, simpler model Larger: less regulalization, more complex model -> overfitting (Default as 1.0) |
random_state | random generatorのseed 毎回同じのを使うとよい |
Sample
from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X_fruits, y_fruits, random_state=0) clf = LogisticRegression(C=this_C).fit(X_train, y_train)
Decision Tree
条件分岐によってグループを分割していく。一回の分岐で一つのグループをきれいに抽出できるとスコアが高い(不純度)。
Structure
sklearn.tree.DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, class_weight=None, presort='deprecated', ccp_alpha=0.0)
from sklearn.tree import DecisionTreeClassifier
主なParameters
Parameter | Discription |
Criterion | 分割基準、'gini' or 'entropy' (Default = 'gini') |
max_depth | Treeの最大深度 多すぎるとoverfitting (Default = None) |
min_samples_split | 分岐する際の最低サンプル量 (Default = 2) |
min_samples_leaf | Leaf nodeでの最低サンプル量 (Default = 1) |
ほかのは
scikit-learn.org
Sample
from sklearn.datasets import load_iris from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import train_test_split iris = load_iris() X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state = 3) clf = DecisionTreeClassifier(max_depth = 3,min_samples_leaf = 8, random_state = 0).fit(X_train, y_train)
Visualization
sklearn.tree.export_graphviz(decision_tree, out_file=None, max_depth=None, feature_names=None, class_names=None, label='all', filled=False, leaves_parallel=False, impurity=True, node_ids=False, proportion=False, rotate=False, rounded=False, special_characters=False, precision=3)
import pydotplus from IPython.display import Image from graphviz import Digraph from sklearn.externals.six import StringIO dot_data = StringIO() tree.export_graphviz(clf, out_file=dot_data,feature_names=train_X.columns, max_depth=3) graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) graph.write_pdf("graph.pdf") Image(graph.create_png())
Feature importance
どの変数が重要だったかを返すメソッド
clf.feature_importances_)