数据挖掘算法(二)--决策树

时间:2021-02-15 21:55:26

数据挖掘算法学习笔记汇总
数据挖掘算法(一)–K近邻算法 (KNN)
数据挖掘算法(二)–决策树
数据挖掘算法(三)–logistic回归

算法简介

首先我们来看一组数据,如下所示,

Age Education   Income  Status  Purchase?
36-55 master's high single will buy
18-35 high school low single will buy
36-55 master's low single won't buy
18-35 bachelor's high single won't buy
gt 18 high school low single will buy
18-35 bachelor's high married won't buy

假设我们要对数据进行分类,然后 根据年龄、教育程度、收入、婚姻状况等条件判断一个人是否会购买。假设我们先按照是否结婚的来对上面的几组数据进行划分:

#未婚
Age Education Income Status Purchase?
36-55 master's high single will buy
18-35 high school low single will buy
36-55 master's low single won't buy
18-35 bachelor's high single won't buy
gt 18 high school low single will buy
#已婚
Age Education Income Status Purchase?
18-35 bachelor's high married won't buy

下面可以再按照收入进一步划分数据

Age Education   Income  Status  Purchase?
18-35 bachelor's high single won't buy
36-55 master's high single will buy
Age Education   Income  Status  Purchase?
18-35 high school low single will buy
36-55 master's low single won't buy
gt 18 high school low single will buy

然后我们需要再按照其他的特征(如年龄,或者收入)进行再次分类,直到每组数据都归类了为止。这样我们构建出来了一个颗树形结构的图。
数据挖掘算法(二)--决策树

这个分析过程就是一个决策树的过程。那每一次分类过程中我们都需要选择一个特征(年龄、教育程度、收入、婚姻状况)进行分类,到底选哪一个好?这就需要用到熵作为一个选取标准了。

信息熵 :它是随机变量不确定度的度量。一个离散随机变量X的概率密度函数为 p(x) ,那么X的熵可以定义为 H(x)=xp(x)log2p(x) 使用2为底的对数函数,熵的量纲一般情况下为比特(bite)。当对数底位e的情况,熵的单位为奈特(nat)。在平均意义下,它是为了描述改随机变量所需的比特数。

怎么使用熵,就需要利用ID3算法了

ID3算法以原始数据集合 S 作为根节点开始。在算法的每次迭代中,它遍历数据集合 S 的每个未使用的特征,并计算熵 HS (或信息增益 IGS )。然后选择具有最小熵(或最大信息增益)值的属性。然后将集合 S 分割为所选属性(例如收入高和低),以产生数据的子集。算法继续在每个子集上递归,仅考虑之前从未选择的属性。

算法流程:
1、使用数据集 S 计算每个属性的熵
2、将数据集 S 分解为子集,其中使用生成熵(分割后)最小的属性(或等效地,信息增益最大)
3、制作包含该属性的决策树节点
4、使用剩余属性重新计算子集。

ID3算法不保证最佳解决方案;它可能会陷入局部最佳状态。 它使用贪心的方法,在每次迭代中选择最佳属性来分割数据集。
ID3可以对训练数据进行补充。 为了避免过度拟合,较大的决策树应优先于较小的决策树。 该算法通常产生小树,但并不总是产生尽可能小的树。
ID3难用于连续数据。 如果任何给定属性的值是连续的,那么还有更多的地方可以分割此属性上的数据,并且搜索最佳值进行拆分可能是耗时的。

代码实现

本文代码运行环境:
python:3.5.1
pandas:0.19.2
sklearn:0.18.1
其他环境可能有细微差别

第一步读取数据

# -*coding:utf-8*-
import numpy as np
import pandas as pd
import operator
import math

# 读取数据
dataSet = pd.read_csv("./tree.txt", sep='\t')
labels = list(dataSet) #获取所有列的名字
labels = labels[0:-1] #获取特征名字

里面前面给出的计算香农熵的公式,计算熵。

def calcShannonEnt(dataFrame):
numEntries = len(dataFrame)
labelCounts = {}
for index, row in dataFrame.iterrows():
currentLabel = row[-1]
labelCounts[currentLabel] = labelCounts.get(currentLabel, 0) + 1
shannonEnt = 0.0
for key in labelCounts:
prob = float(labelCounts[key]) / numEntries
shannonEnt -= prob * math.log(prob, 2) # log base 2
return shannonEnt

分隔dataFrame

def splitDataSet(dataFrame, column_name, value):
retDataSet = dataFrame[dataFrame[column_name] != value] # 选取clomun_name列不等于value的行
retDataSet = retDataSet.drop(column_name, 1) # 1删除列,0删列行
return retDataSet

选择最佳的特征和值进行分类,并且返回最佳的列名

def chooseBestFeatureToSplit(dataFrame):
baseEntropy = calcShannonEnt(dataFrame)
bestInfoGain = -100#此处由于计算出来的增益可能为负的
bestFeature = ""
FeatureNames = list(dataFrame)
for FeatureName in FeatureNames[0:-1]: # 最后一列为标签
featList = dataFrame[FeatureName] # create a list of all the examples of this feature
uniqueVals = set(featList) # get a set of unique values
newEntropy = 0.0
for value in uniqueVals:
subDataSet = splitDataSet(dataFrame, FeatureName, value)
prob = subDataSet.shape[0] / dataFrame.shape[0]
newEntropy += prob * calcShannonEnt(subDataSet)
infoGain = baseEntropy - newEntropy
if (infoGain > bestInfoGain): # compare this to the best gain so far
bestInfoGain = infoGain # if better than current best, set to best
bestFeature = FeatureName
return bestFeature #返回列名
def majorityCnt(classList):
classCount = {}
for vote in classList:
classCount[vote] = classCount.get(vote, 0) + 1
sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
return sortedClassCount[0][0]

#生成决策树
def createTree(dataFrame):
classList = dataFrame.iloc[:, -1] # 标签列表
if len(classList.unique()) == 1: # 标签只剩下一种值
return classList.unique()[0] # stop splitting when all of the classes are equal
if len(list(dataFrame)) == 2: # 只剩下一种特征了
return majorityCnt(classList)
bestFeatLabel = chooseBestFeatureToSplit(dataFrame)
myTree = {bestFeatLabel: {}}
featValues = dataFrame[bestFeatLabel]
uniqueVals = set(featValues)
for value in uniqueVals:
DataSubset = splitDataSet(dataFrame, bestFeatLabel, value)
myTree[bestFeatLabel][value] = createTree(DataSubset)
return myTree


def classify(inputTree, featLabels, testVec):
keys_list = list(inputTree.keys())
firstStr = keys_list[0]
secondDict = inputTree[firstStr]
featIndex = featLabels.index(firstStr)
key = testVec[featIndex]
valueOfFeat = secondDict[key]
if isinstance(valueOfFeat, dict):
classLabel = classify(valueOfFeat, featLabels, testVec)
else:
classLabel = valueOfFeat
return classLabel

tree = createTree(dataSet)
print(classify(tree, labels, ["36-55", "master's", "high", "single"]))

构建出来的树存在一个字典里面,下面是格式化之后好看点的数据。如下所示:

{
'Status': {
'married': {
'Income': {
'high': {
'Education': {
'highschool': 'willbuy',
"master's": 'willbuy',
"bachelor's": 'willbuy'
}
},
'low': {
'Education': {
'highschool': 'willbuy',
"master's": "won't buy",
"bachelor's": 'willbuy'
}
}
}
},
'single': {
'Income': {
'high': {
'Age': {
'36-55': 'willbuy',
'gt55': "won't buy",
'lt18': "won't buy"
}
},
'low': {
'Education': {
'highschool': "won't buy",
"bachelor's": 'willbuy'
}
}
}
}
}
}

代码和测试数据下载地址:
链接:http://pan.baidu.com/s/1hsOTpfm 密码:en6g

参考资料:
1、《机器学习实战》
2、http://www.onlamp.com/pub/a/python/2006/02/09/ai_decision_trees.html?page=1