本文实例讲述了Python实现的随机森林算法。分享给大家供大家参考,具体如下:
随机森林是数据挖掘中非常常用的分类预测算法,以分类或回归的决策树为基分类器。算法的一些基本要点:
*对大小为m的数据集进行样本量同样为m的有放回抽样;
*对K个特征进行随机抽样,形成特征的子集,样本量的确定方法可以有平方根、自然对数等;
*每棵树完全生成,不进行剪枝;
*每个样本的预测结果由每棵树的预测投票生成(回归的时候,即各棵树的叶节点的平均)
著名的python机器学习包scikit learn的文档对此算法有比较详尽的介绍: http://scikit-learn.org/stable/modules/ensemble.html#random-forests
出于个人研究和测试的目的,基于经典的Kaggle 101泰坦尼克号乘客的数据集,建立模型并进行评估。比赛页面及相关数据集的下载:https://www.kaggle.com/c/titanic
泰坦尼克号的沉没,是历史上非常著名的海难。突然感到,自己面对的不再是冷冰冰的数据,而是用数据挖掘的方法,去研究具体的历史问题,也是饶有兴趣。言归正传,模型的主要的目标,是希望根据每个乘客的一系列特征,如性别、年龄、舱位、上船地点等,对其是否能生还进行预测,是非常典型的二分类预测问题。数据集的字段名及实例如下:
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.25 | S | |
2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.925 | S | |
4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.1 | C123 | S |
5 | 0 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.05 | S |
值得说明的是,SibSp是指sister brother spouse,即某个乘客随行的兄弟姐妹、丈夫、妻子的人数,Parch指parents,children
下面给出整个数据处理及建模过程,基于ubuntu+python 3.4( anaconda科学计算环境已经集成一系列常用包,pandas numpy sklearn等,这里强烈推荐)
懒得切换输入法,写的时候主要的注释都是英文,中文的注释是后来补充的:-)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
|
# -*- coding: utf-8 -*-
"""
@author: kim
"""
from model import * #载入基分类器的代码
#ETL:same procedure to training set and test set
training = pd.read_csv( 'train.csv' ,index_col = 0 )
test = pd.read_csv( 'test.csv' ,index_col = 0 )
SexCode = pd.DataFrame([ 1 , 0 ],index = [ 'female' , 'male' ],columns = [ 'Sexcode' ]) #将性别转化为01
training = training.join(SexCode,how = 'left' ,on = training.Sex)
training = training.drop([ 'Name' , 'Ticket' , 'Embarked' , 'Cabin' , 'Sex' ],axis = 1 )#删去几个不参与建模的变量,包括姓名、船票号,船舱号
test = test.join(SexCode,how = 'left' ,on = test.Sex)
test = test.drop([ 'Name' , 'Ticket' , 'Embarked' , 'Cabin' , 'Sex' ],axis = 1 )
print ( 'ETL IS DONE!' )
#MODEL FITTING
#===============PARAMETER AJUSTMENT============
min_leaf = 1
min_dec_gini = 0.0001
n_trees = 5
n_fea = int (math.sqrt( len (training.columns) - 1 ))
#==============================================
'''''
BEST SCORE:0.83
min_leaf=30
min_dec_gini=0.001
n_trees=20
'''
#ESSEMBLE BY RANDOM FOREST
FOREST = {}
tmp = list (training.columns)
tmp.pop(tmp.index( 'Survived' ))
feaList = pd.Series(tmp)
for t in range (n_trees):
# fea=[]
feasample = feaList.sample(n = n_fea,replace = False ) #select feature
fea = feasample.tolist()
fea.append( 'Survived' )
# feaNew=fea.append(target)
subset = training.sample(n = len (training),replace = True ) #generate the dataset with replacement
subset = subset[fea]
# print(str(t)+' Classifier built on feature:')
# print(list(fea))
FOREST[t] = tree_grow(subset, 'Survived' ,min_leaf,min_dec_gini) #save the tree
#MODEL PREDICTION
#======================
currentdata = training
output = 'submission_rf_20151116_30_0.001_20'
#======================
prediction = {}
for r in currentdata.index: #a row
prediction_vote = { 1 : 0 , 0 : 0 }
row = currentdata.get(currentdata.index = = r)
for n in range (n_trees):
tree_dict = FOREST[n] #a tree
p = model_prediction(tree_dict,row)
prediction_vote[p] + = 1
vote = pd.Series(prediction_vote)
prediction[r] = list (vote.order(ascending = False ).index)[ 0 ] #the vote result
result = pd.Series(prediction,name = 'Survived_p' )
#del prediction_vote
#del prediction
#result.to_csv(output)
t = training.join(result,how = 'left' )
accuracy = round ( len (t[t[ 'Survived' ] = = t[ 'Survived_p' ]]) / len (t), 5 )
print (accuracy)
|
上述是随机森林的代码,如上所述,随机森林是一系列决策树的组合,决策树每次分裂,用Gini系数衡量当前节点的“不纯净度”,如果按照某个特征的某个分裂点对数据集划分后,能够让数据集的Gini下降最多(显著地减少了数据集输出变量的不纯度),则选为当前最佳的分割特征及分割点。代码如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
|
# -*- coding: utf-8 -*-
"""
@author: kim
"""
import pandas as pd
import numpy as np
#import sklearn as sk
import math
def tree_grow(dataframe,target,min_leaf,min_dec_gini):
tree = {} #renew a tree
is_not_leaf = ( len (dataframe)>min_leaf)
if is_not_leaf:
fea,sp,gd = best_split_col(dataframe,target)
if gd>min_dec_gini:
tree[ 'fea' ] = fea
tree[ 'val' ] = sp
# dataframe.drop(fea,axis=1) #1116 modified
l,r = dataSplit(dataframe,fea,sp)
l.drop(fea,axis = 1 )
r.drop(fea,axis = 1 )
tree[ 'left' ] = tree_grow(l,target,min_leaf,min_dec_gini)
tree[ 'right' ] = tree_grow(r,target,min_leaf,min_dec_gini)
else : #return a leaf
return leaf(dataframe[target])
else :
return leaf(dataframe[target])
return tree
def leaf(class_lable):
tmp = {}
for i in class_lable:
if i in tmp:
tmp[i] + = 1
else :
tmp[i] = 1
s = pd.Series(tmp)
s.sort(ascending = False )
return s.index[ 0 ]
def gini_cal(class_lable):
p_1 = sum (class_lable) / len (class_lable)
p_0 = 1 - p_1
gini = 1 - ( pow (p_0, 2 ) + pow (p_1, 2 ))
return gini
def dataSplit(dataframe,split_fea,split_val):
left_node = dataframe[dataframe[split_fea]< = split_val]
right_node = dataframe[dataframe[split_fea]>split_val]
return left_node,right_node
def best_split_col(dataframe,target_name):
best_fea = '' #modified 1116
best_split_point = 0
col_list = list (dataframe.columns)
col_list.remove(target_name)
gini_0 = gini_cal(dataframe[target_name])
n = len (dataframe)
gini_dec = - 99999999
for col in col_list:
node = dataframe[[col,target_name]]
unique = node.groupby(col).count().index
for split_point in unique: #unique value
left_node,right_node = dataSplit(node,col,split_point)
if len (left_node)> 0 and len (right_node)> 0 :
gini_col = gini_cal(left_node[target_name]) * ( len (left_node) / n) + gini_cal(right_node[target_name]) * ( len (right_node) / n)
if (gini_0 - gini_col)>gini_dec:
gini_dec = gini_0 - gini_col #decrease of impurity
best_fea = col
best_split_point = split_point
#print(col,split_point,gini_0-gini_col)
return best_fea,best_split_point,gini_dec
def model_prediction(model,row): #row is a df
fea = model[ 'fea' ]
val = model[ 'val' ]
left = model[ 'left' ]
right = model[ 'right' ]
if row[fea].tolist()[ 0 ]< = val: #get the value
branch = left
else :
branch = right
if ( 'dict' in str ( type (branch) )):
prediction = model_prediction(branch,row)
else :
prediction = branch
return prediction
|
实际上,上面的代码还有很大的效率提升的空间,数据集不是很大的情况下,如果选择一个较大的输入参数,例如生成100棵树,就会显著地变慢;同时,将预测结果提交至kaggle进行评测,发现在测试集上的正确率不是很高,比使用sklearn里面相应的包进行预测的正确率(0.77512)要稍低一点 :-( 如果要提升准确率,两个大方向: 构造新的特征;调整现有模型的参数。
这里是抛砖引玉,欢迎大家对我的建模思路和算法的实现方法提出修改意见。
希望本文所述对大家Python程序设计有所帮助。
原文链接:http://blog.csdn.net/lo_cima/article/details/50533010