Udacity机器学习入门笔记——决策树

监督学习算法第三种——决策树decision trees

决策树可以通过核技巧把简单的线性决策面转换为非线性决策面

百度百科：决策树是一个预测模型；他代表的是对象属性与对象值之间的一种映射关系。树中每个节点表示某个对象，而每个分叉路径则代表的某个可能的属性值，而每个叶结点则对应从根节点到该叶节点所经历的路径所表示的对象的值

通过坐标数据进行多次分割，找出分界线，绘制决策树。在机器学习中，决策树学习算法就是根据数据，使用计算机算法自动找出决策边界。

Udacity机器学习入门笔记——决策树 -----------

决策树的python代码（sklearn）

链接：http://scikit-learn.org/stable/modules/tree.html

>>> from sklearn import tree
>>> X = [[0, 0], [1, 1]]
>>> Y = [0, 1]
>>> clf = tree.DecisionTreeClassifier()
>>> clf = clf.fit(X, Y)

代码练习： class_vis.py与prep_terrain_data.py代码与朴素贝叶斯代码相同

studentMain.py

#!/usr/bin/python

""" lecture and example code for decision tree unit """

import sys
from class_vis import prettyPicture, output_image
from prep_terrain_data import makeTerrainData

import matplotlib.pyplot as plt
import numpy as np
import pylab as pl
from classifyDT import classify

features_train, labels_train, features_test, labels_test = makeTerrainData()

### the classify() function in classifyDT is where the magic
### happens--fill in this function in the file 'classifyDT.py'!
clf = classify(features_train, labels_train)


#### grader code, do not modify below this line

prettyPicture(clf, features_test, labels_test)
output_image("test.png", "png", open("test.png", "rb").read())

classifyDT.py

def classify(features_train, labels_train):
	### your code goes here--should return a trained decision tree classifer
	from sklearn import tree
	clf = tree.DecisionTreeClassifier()
	clf = clf.fit(features_train,labels_train)
	return clf

Udacity机器学习入门笔记——决策树

狭长区域为过拟合

class_vis.py与prep_terrain_data代码不变

决策树准确性代码：0.908

import sys
from class_vis import prettyPicture
from prep_terrain_data import makeTerrainData

import numpy as np
import pylab as pl

features_train, labels_train, features_test, labels_test = makeTerrainData()

#################################################################################

########################## DECISION TREE #################################


#### your code goes here
from sklearn import tree
clf=tree.DecisionTreeClassifier()
clf = clf.fit(features_train,labels_train)

pre = clf.predict(features_test)

from sklearn.metrics import accuracy_score
acc = accuracy_score(labels_test,pre)
### be sure to compute the accuracy on the test set
    
def submitAccuracies():
  return {"acc":round(acc,3)}

决策树参数

http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

min_samples_split可分割的样本数量下限，默认值为2

对于决策树最下层的每一个节点，是否还要继续分割，min_samples_split决定了能够继续进行分割的最少分割样本 Udacity机器学习入门笔记——决策树

min_samples_split=50时，准确率91.2%，效果图： min_samples_split=2时，准确率90.8%，效果图：

Udacity机器学习入门笔记——决策树

熵（entropy）和杂质

熵作用：主要控制决策树在何处分割数据

熵定义：它是一系列样本中的不纯度的测量值（measure of impurity in a bunch of examples）

建立决策树的过程就是找到变量划分点从而产生尽可能的单一的子集，实际上决策树做决策的过程，就是对这个过程的递归重复

例子：当某个路段有限速的时候，不论坡度如何，这里（YES）都会是红叉，箭头所指为单一子集

Udacity机器学习入门笔记——决策树

熵公式

Udacity机器学习入门笔记——决策树 Pi是第i类中的样本占总样本数的比例

熵与数据单一性呈负相关关系（0~1）

所有样本属于同一类，熵为0；样本均匀分布在所有类中，熵为1.0

Udacity机器学习入门笔记——决策树

例子：四个数据点，每个数据点有三个特征：坡度、颠簸程度、是否超过限速，以及车的行驶速度

Udacity机器学习入门笔记——决策树

计算其熵值为1.0

信息增益：父节点的熵-子节点的熵的加权平均（分割父节点后生成的子节点）

Udacity机器学习入门笔记——决策树

决策树算法会最大程度地提高信息增益，通过这种方法选择进行划分的特征，如果特征可以取多个不同值，该方法帮助它找出在何处进行划分，它会尝试最大程度提高信息增益。（为了得到数据单一性最佳的划分分支）

用信息增益确定对哪个变量进行划分，首先利用信息增益对坡度进行计算

父节点包含四个样本，按照坡度划分得出左ssf,右f（熵为0，样本只有一类）

Udacity机器学习入门笔记——决策树

然后计算左熵：

Udacity机器学习入门笔记——决策树

然后根据信息增益公式计算子节点熵的加权平均：entropy=3/4 * 0.9184 + 1/4 * 0 = 0.6888

然后得出基于坡度进行数据划分时所得到的信息增益=父节点熵1.0-子节点熵0.6888=0.3112

基于颠簸程度进行数据划分所得到的信息增益：

右上上图信息可知，按颠簸程度划分左右子节点各为sf

计算bumpy、smooth的熵

entropy(bumpy)=-1/2 * ㏒₂½-1/2*㏒₂½ =1

entropy(smooth)=-1/2 * ㏒₂½-1/2*㏒₂½ =1

计算子节点熵的加权平均：entropy = 2/4 *1 +2/4 *1 = 1

得信息增益=1-1=0

基于是否限速进行数据划分所得到的信息增益：1

偏差bias与方差variance

高偏差机器学习算法实际上会忽略训练数据，它几乎没有能力学习任何数据，这被称为偏差。所有对一个有偏差的汽车进行训练，无论训练通过何种方式进行，它的操作都不会有任何区别；

另一个极端情况，汽车对数据高度敏感，它只能复现曾经见过的东西，那样就会是一个方差极高的算法，但是对于之前未见过的情况，它的反应非常差，因为没有适当的偏差让它泛化新的东西。

通过调整参数让偏差与方差平衡，使算法具有一定泛化能力，但仍然对训练数据开放，能根据数据调整模型。

决策树优缺点

易于使用，易于理解

容易过拟合，尤其对于具有包含大量特征的数据时，复杂的决策树可能会过拟合数据，通过仔细调整参数，避免过拟合（对于节点上只有单个数据点的决策树，几乎肯定是过拟合）

dt_author_id

#!/usr/bin/python

""" 
    This is the code to accompany the Lesson 3 (decision tree) mini-project.

    Use a Decision Tree to identify emails from the Enron corpus by author:    
    Sara has label 0
    Chris has label 1
"""
    
import sys
from time import time
sys.path.append("../tools/")
from email_preprocess import preprocess

### features_train and features_test are the features for the training
### and testing datasets, respectively
### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()

#########################################################
### your code goes here ###
from sklearn import tree
clf = tree.DecisionTreeClassifier(min_samples_split=40)
clf = clf.fit(features_train,labels_train)
acc = clf.score(features_test,labels_test)
print acc
print len(features_train[0])
#########################################################

准确率0.9789

从你的数据中找出特征的数量，数据是以 numpy 数组的形式排列的，其中数组的行数代表数据点的数量，列数代表特征的数量；为了码提取这个数值，可以写一行这样的代码len(features_train[0])：3785

加入 tools/email_preprocess.py，会看到这样的代码：selector =SelectPercentile(f_classif, percentile=1)，将 percentile 从 10 改为 1。

l 现在的特征数量是多少呢？379

l 你认为 SelectPercentile 起到什么作用？其他所有的都不变的情况下，赋予 percentile 的值较大是否得到一棵更加复杂的或者简化的决策树？选择排名排在前n%的变量，赋予 percentile 的值较大会得到一棵更加复杂的树

I 注意训练时间的不同取决于特征的数量。

l 当 percentile 等于 1 时，准确度是多少？0.967

秒客网

Udacity机器学习入门笔记——决策树

决策树的python代码（sklearn）

相关文章