贝叶斯分类之旧金山犯罪类型分类预测
学习七月算法朴素贝叶斯分类器中项目的一个例子,这也是一个Kaggle比赛的例子。通过训练来预测犯罪类型。
环境: win7 64位 python3.5
1、加载数据
该数据是旧金山12年的犯罪记录,数据文件是一个csv文件可以使用pandas来加载数据,数据内容摘录:
Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
2015-05-13 23:53:00,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,”ARREST, BOOKED”,OAK ST / LAGUNA ST,-122.425891675136,37.7745985956747
2015-05-13 23:53:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,”ARREST, BOOKED”,OAK ST / LAGUNA ST,-122.425891675136,37.7745985956747
2015-05-13 23:33:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,”ARREST, BOOKED”,VANNESS AV / GREENWICH ST,-122.42436302145,37.8004143219856
2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,NORTHERN,NONE,1500 Block of LOMBARD ST,-122.42699532676599,37.80087263276921
2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,PARK,NONE,100 Block of BRODERICK ST,-122.438737622757,37.771541172057795
2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM UNLOCKED AUTO,Wednesday,INGLESIDE,NONE,0 Block of TEDDY AV,-122.40325236121201,37.713430704116
从上面的摘录可以看出有一下特征
Dates:犯罪的日期
Category:犯罪类型
Descript:犯罪描述
DayOfWeek:星期几
PdDistrict:所属警区
Resolution:处理结果
Address:发生街区
X and Y:GPS坐标
import pandas as pd
import numpy as np
train = pd.read_csv("C:\\data\\SanFrancisco\\train.csv",parse_dates=['Dates'])
test = pd.read_csv("C:\\data\\SanFrancisco\\test.csv",parse_dates=['Dates'])
train[0:6]
Dates Category Descript \
0 2015-05-13 23:53:00 WARRANTS WARRANT ARREST
1 2015-05-13 23:53:00 OTHER OFFENSES TRAFFIC VIOLATION ARREST
2 2015-05-13 23:33:00 OTHER OFFENSES TRAFFIC VIOLATION ARREST
3 2015-05-13 23:30:00 LARCENY/THEFT GRAND THEFT FROM LOCKED AUTO
4 2015-05-13 23:30:00 LARCENY/THEFT GRAND THEFT FROM LOCKED AUTO
5 2015-05-13 23:30:00 LARCENY/THEFT GRAND THEFT FROM UNLOCKED AUTO
DayOfWeek PdDistrict Resolution Address \
0 Wednesday NORTHERN ARREST, BOOKED OAK ST / LAGUNA ST
1 Wednesday NORTHERN ARREST, BOOKED OAK ST / LAGUNA ST
2 Wednesday NORTHERN ARREST, BOOKED VANNESS AV / GREENWICH ST
3 Wednesday NORTHERN NONE 1500 Block of LOMBARD ST
4 Wednesday PARK NONE 100 Block of BRODERICK ST
5 Wednesday INGLESIDE NONE 0 Block of TEDDY AV
X Y
0 -122.425892 37.774599
1 -122.425892 37.774599
2 -122.424363 37.800414
3 -122.426995 37.800873
4 -122.438738 37.771541
5 -122.403252 37.713431
2、特征预处理
上述数据中类别和文本类型非常多,所以要进行特征处理。因为我们要预测的是犯罪类型,
所以要尽可能的将犯罪相关因素的特征量化。
日期Dates:前5条记录发现几乎犯罪时间都是23点以后,这也符合常理。
犯罪类型Category:这个target,是需要量化的。
罪状Descript:这个特征都是犯罪以后的事了,没什么意义。
星期几DayOfWeek:这个与时间Dates关系还是挺强的,毕竟周末或者节假日户外活动的人多的话,也很容易招贼。
所属警区PdDistrict和处理结果Resolution:这两个特征与犯罪动因也没什么太大关系。
发生街区位置Address:对美国街区有一定了解的话,就知道美国有一些街区比如是低收入、非法移民等聚居的街区治安不是太好,犯罪比例也相对比较高。
接下来将对日期、犯罪类型、星期几、街区等特征进行预处理。
使用pandas的get_dummies()可以直接拿到一个二值化的01向量
使用pandas的LabelEncoder可以对类别编号
import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn import preprocessing
# pd.set_option('display.notebook_repr_html',False)
# pd.set_option('display.max_columns',None)
# pd.set_option('display.max_rows',150)
# pd.set_option('display.max_seq_items',None)
#用LabelEncoder对不同的犯罪类型编号
leCrime = preprocessing.LabelEncoder()
crime = leCrime.fit_transform(train.Category)
#因子化星期几,街区,小时等特征
days = pd.get_dummies(train.DayOfWeek)
district = pd.get_dummies(train.PdDistrict)
hour = train.Dates.dt.hour
hour = pd.get_dummies(hour)
#组合特征
trainData = pd.concat([hour, days, district], axis=1)
trainData['crime']=crime
#对于测试数据做同样的处理
days = pd.get_dummies(test.DayOfWeek)
district = pd.get_dummies(test.PdDistrict)
hour = test.Dates.dt.hour
hour = pd.get_dummies(hour)
testData = pd.concat([hour, days, district], axis=1)
trainData
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 \
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
19 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
21 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
23 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
24 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
25 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
26 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
27 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
28 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
29 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
31 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
32 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
33 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
34 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
35 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
36 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
37 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
38 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
39 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
40 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
41 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
42 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
43 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
44 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
45 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
46 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
47 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
48 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
49 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
50 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
51 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
52 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
53 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
54 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
55 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
56 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
57 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
58 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
59 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
60 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
61 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
62 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
63 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
65 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
66 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
67 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
68 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
69 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
70 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
71 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
72 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
73 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
74 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
877974 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
877975 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
877976 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
877977 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
877978 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
877979 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
877980 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
877981 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
877982 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
877983 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
877984 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
877985 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
877986 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
877987 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
877988 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
877989 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
877990 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
877991 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
877992 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
877993 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
877994 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
877995 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
877996 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
877997 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
877998 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
877999 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
878000 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
878001 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
878002 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
878003 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878004 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878005 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878006 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878007 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878008 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878009 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878010 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878011 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878012 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878013 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878014 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878015 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878016 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878017 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878018 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878019 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878020 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878021 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878022 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878023 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878024 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878025 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878026 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878027 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878028 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878029 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878030 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878031 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878032 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878033 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878034 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878035 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878036 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878037 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878038 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878039 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878040 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878041 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878042 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878043 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878044 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878045 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878046 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878047 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
878048 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
20 21 22 23 Friday Monday Saturday Sunday Thursday Tuesday \
0 0 0 0 1 0 0 0 0 0 0
1 0 0 0 1 0 0 0 0 0 0
2 0 0 0 1 0 0 0 0 0 0
3 0 0 0 1 0 0 0 0 0 0
4 0 0 0 1 0 0 0 0 0 0
5 0 0 0 1 0 0 0 0 0 0
6 0 0 0 1 0 0 0 0 0 0
7 0 0 0 1 0 0 0 0 0 0
8 0 0 0 1 0 0 0 0 0 0
9 0 0 0 1 0 0 0 0 0 0
10 0 0 1 0 0 0 0 0 0 0
11 0 0 1 0 0 0 0 0 0 0
12 0 0 1 0 0 0 0 0 0 0
13 0 0 1 0 0 0 0 0 0 0
14 0 0 1 0 0 0 0 0 0 0
15 0 0 1 0 0 0 0 0 0 0
16 0 0 1 0 0 0 0 0 0 0
17 0 1 0 0 0 0 0 0 0 0
18 0 1 0 0 0 0 0 0 0 0
19 0 1 0 0 0 0 0 0 0 0
20 0 1 0 0 0 0 0 0 0 0
21 0 1 0 0 0 0 0 0 0 0
22 0 1 0 0 0 0 0 0 0 0
23 0 1 0 0 0 0 0 0 0 0
24 0 1 0 0 0 0 0 0 0 0
25 0 1 0 0 0 0 0 0 0 0
26 0 1 0 0 0 0 0 0 0 0
27 0 1 0 0 0 0 0 0 0 0
28 0 1 0 0 0 0 0 0 0 0
29 1 0 0 0 0 0 0 0 0 0
30 1 0 0 0 0 0 0 0 0 0
31 1 0 0 0 0 0 0 0 0 0
32 1 0 0 0 0 0 0 0 0 0
33 1 0 0 0 0 0 0 0 0 0
34 1 0 0 0 0 0 0 0 0 0
35 1 0 0 0 0 0 0 0 0 0
36 1 0 0 0 0 0 0 0 0 0
37 1 0 0 0 0 0 0 0 0 0
38 1 0 0 0 0 0 0 0 0 0
39 1 0 0 0 0 0 0 0 0 0
40 1 0 0 0 0 0 0 0 0 0
41 1 0 0 0 0 0 0 0 0 0
42 1 0 0 0 0 0 0 0 0 0
43 1 0 0 0 0 0 0 0 0 0
44 1 0 0 0 0 0 0 0 0 0
45 1 0 0 0 0 0 0 0 0 0
46 1 0 0 0 0 0 0 0 0 0
47 1 0 0 0 0 0 0 0 0 0
48 0 0 0 0 0 0 0 0 0 0
49 0 0 0 0 0 0 0 0 0 0
50 0 0 0 0 0 0 0 0 0 0
51 0 0 0 0 0 0 0 0 0 0
52 0 0 0 0 0 0 0 0 0 0
53 0 0 0 0 0 0 0 0 0 0
54 0 0 0 0 0 0 0 0 0 0
55 0 0 0 0 0 0 0 0 0 0
56 0 0 0 0 0 0 0 0 0 0
57 0 0 0 0 0 0 0 0 0 0
58 0 0 0 0 0 0 0 0 0 0
59 0 0 0 0 0 0 0 0 0 0
60 0 0 0 0 0 0 0 0 0 0
61 0 0 0 0 0 0 0 0 0 0
62 0 0 0 0 0 0 0 0 0 0
63 0 0 0 0 0 0 0 0 0 0
64 0 0 0 0 0 0 0 0 0 0
65 0 0 0 0 0 0 0 0 0 0
66 0 0 0 0 0 0 0 0 0 0
67 0 0 0 0 0 0 0 0 0 0
68 0 0 0 0 0 0 0 0 0 0
69 0 0 0 0 0 0 0 0 0 0
70 0 0 0 0 0 0 0 0 0 0
71 0 0 0 0 0 0 0 0 0 0
72 0 0 0 0 0 0 0 0 0 0
73 0 0 0 0 0 0 0 0 0 0
74 0 0 0 0 0 0 0 0 0 0
... .. .. .. .. ... ... ... ... ... ...
877974 0 0 0 0 0 1 0 0 0 0
877975 0 0 0 0 0 1 0 0 0 0
877976 0 0 0 0 0 1 0 0 0 0
877977 0 0 0 0 0 1 0 0 0 0
877978 0 0 0 0 0 1 0 0 0 0
877979 0 0 0 0 0 1 0 0 0 0
877980 0 0 0 0 0 1 0 0 0 0
877981 0 0 0 0 0 1 0 0 0 0
877982 0 0 0 0 0 1 0 0 0 0
877983 0 0 0 0 0 1 0 0 0 0
877984 0 0 0 0 0 1 0 0 0 0
877985 0 0 0 0 0 1 0 0 0 0
877986 0 0 0 0 0 1 0 0 0 0
877987 0 0 0 0 0 1 0 0 0 0
877988 0 0 0 0 0 1 0 0 0 0
877989 0 0 0 0 0 1 0 0 0 0
877990 0 0 0 0 0 1 0 0 0 0
877991 0 0 0 0 0 1 0 0 0 0
877992 0 0 0 0 0 1 0 0 0 0
877993 0 0 0 0 0 1 0 0 0 0
877994 0 0 0 0 0 1 0 0 0 0
877995 0 0 0 0 0 1 0 0 0 0
877996 0 0 0 0 0 1 0 0 0 0
877997 0 0 0 0 0 1 0 0 0 0
877998 0 0 0 0 0 1 0 0 0 0
877999 0 0 0 0 0 1 0 0 0 0
878000 0 0 0 0 0 1 0 0 0 0
878001 0 0 0 0 0 1 0 0 0 0
878002 0 0 0 0 0 1 0 0 0 0
878003 0 0 0 0 0 1 0 0 0 0
878004 0 0 0 0 0 1 0 0 0 0
878005 0 0 0 0 0 1 0 0 0 0
878006 0 0 0 0 0 1 0 0 0 0
878007 0 0 0 0 0 1 0 0 0 0
878008 0 0 0 0 0 1 0 0 0 0
878009 0 0 0 0 0 1 0 0 0 0
878010 0 0 0 0 0 1 0 0 0 0
878011 0 0 0 0 0 1 0 0 0 0
878012 0 0 0 0 0 1 0 0 0 0
878013 0 0 0 0 0 1 0 0 0 0
878014 0 0 0 0 0 1 0 0 0 0
878015 0 0 0 0 0 1 0 0 0 0
878016 0 0 0 0 0 1 0 0 0 0
878017 0 0 0 0 0 1 0 0 0 0
878018 0 0 0 0 0 1 0 0 0 0
878019 0 0 0 0 0 1 0 0 0 0
878020 0 0 0 0 0 1 0 0 0 0
878021 0 0 0 0 0 1 0 0 0 0
878022 0 0 0 0 0 1 0 0 0 0
878023 0 0 0 0 0 1 0 0 0 0
878024 0 0 0 0 0 1 0 0 0 0
878025 0 0 0 0 0 1 0 0 0 0
878026 0 0 0 0 0 1 0 0 0 0
878027 0 0 0 0 0 1 0 0 0 0
878028 0 0 0 0 0 1 0 0 0 0
878029 0 0 0 0 0 1 0 0 0 0
878030 0 0 0 0 0 1 0 0 0 0
878031 0 0 0 0 0 1 0 0 0 0
878032 0 0 0 0 0 1 0 0 0 0
878033 0 0 0 0 0 1 0 0 0 0
878034 0 0 0 0 0 1 0 0 0 0
878035 0 0 0 0 0 1 0 0 0 0
878036 0 0 0 0 0 1 0 0 0 0
878037 0 0 0 0 0 1 0 0 0 0
878038 0 0 0 0 0 1 0 0 0 0
878039 0 0 0 0 0 1 0 0 0 0
878040 0 0 0 0 0 1 0 0 0 0
878041 0 0 0 0 0 1 0 0 0 0
878042 0 0 0 0 0 1 0 0 0 0
878043 0 0 0 0 0 1 0 0 0 0
878044 0 0 0 0 0 1 0 0 0 0
878045 0 0 0 0 0 1 0 0 0 0
878046 0 0 0 0 0 1 0 0 0 0
878047 0 0 0 0 0 1 0 0 0 0
878048 0 0 0 0 0 1 0 0 0 0
Wednesday BAYVIEW CENTRAL INGLESIDE MISSION NORTHERN PARK \
0 1 0 0 0 0 1 0
1 1 0 0 0 0 1 0
2 1 0 0 0 0 1 0
3 1 0 0 0 0 1 0
4 1 0 0 0 0 0 1
5 1 0 0 1 0 0 0
6 1 0 0 1 0 0 0
7 1 1 0 0 0 0 0
8 1 0 0 0 0 0 0
9 1 0 1 0 0 0 0
10 1 0 1 0 0 0 0
11 1 0 0 0 0 0 0
12 1 0 0 0 0 0 0
13 1 0 0 0 0 1 0
14 1 1 0 0 0 0 0
15 1 1 0 0 0 0 0
16 1 0 0 0 0 0 0
17 1 0 0 1 0 0 0
18 1 1 0 0 0 0 0
19 1 0 0 0 0 0 0
20 1 0 0 1 0 0 0
21 1 0 0 1 0 0 0
22 1 0 0 0 0 0 0
23 1 0 0 0 0 0 0
24 1 0 0 0 0 1 0
25 1 0 0 0 0 0 0
26 1 0 0 0 0 1 0
27 1 0 0 1 0 0 0
28 1 0 0 0 0 0 0
29 1 0 0 0 0 0 0
30 1 0 0 0 0 1 0
31 1 0 0 0 1 0 0
32 1 0 0 0 0 1 0
33 1 0 0 0 0 1 0
34 1 0 0 0 0 1 0
35 1 0 0 0 0 0 0
36 1 0 0 0 0 1 0
37 1 0 0 0 0 1 0
38 1 0 0 0 0 0 0
39 1 0 0 1 0 0 0
40 1 0 0 0 0 0 0
41 1 0 0 0 0 0 0
42 1 0 0 0 0 0 0
43 1 1 0 0 0 0 0
44 1 1 0 0 0 0 0
45 1 0 1 0 0 0 0
46 1 0 0 1 0 0 0
47 1 0 0 0 0 0 0
48 1 0 1 0 0 0 0
49 1 0 0 0 0 0 1
50 1 1 0 0 0 0 0
51 1 1 0 0 0 0 0
52 1 0 0 0 0 0 0
53 1 0 0 0 0 0 0
54 1 0 0 0 0 0 0
55 1 0 0 0 0 0 0
56 1 0 0 0 0 1 0
57 1 0 0 0 0 0 0
58 1 0 0 0 0 1 0
59 1 0 1 0 0 0 0
60 1 0 1 0 0 0 0
61 1 0 1 0 0 0 0
62 1 0 1 0 0 0 0
63 1 0 0 0 0 0 0
64 1 0 0 0 0 0 0
65 1 0 0 0 0 0 0
66 1 0 0 0 0 0 0
67 1 0 0 0 0 0 0
68 1 0 0 0 0 0 0
69 1 0 0 0 0 0 0
70 1 0 0 0 0 0 0
71 1 0 0 0 0 1 0
72 1 1 0 0 0 0 0
73 1 0 0 0 1 0 0
74 1 0 1 0 0 0 0
... ... ... ... ... ... ... ...
877974 0 0 0 0 0 0 1
877975 0 0 0 0 0 0 1
877976 0 0 1 0 0 0 0
877977 0 0 0 0 0 0 0
877978 0 0 0 0 0 0 0
877979 0 0 0 0 0 0 0
877980 0 0 0 0 0 0 0
877981 0 0 0 0 0 1 0
877982 0 0 0 0 0 0 0
877983 0 0 0 0 1 0 0
877984 0 0 1 0 0 0 0
877985 0 0 0 0 0 0 0
877986 0 1 0 0 0 0 0
877987 0 0 0 1 0 0 0
877988 0 0 0 0 0 0 0
877989 0 1 0 0 0 0 0
877990 0 0 0 0 0 1 0
877991 0 0 0 0 0 0 0
877992 0 0 0 0 0 0 1
877993 0 0 0 0 0 0 0
877994 0 0 0 1 0 0 0
877995 0 0 0 0 1 0 0
877996 0 0 0 0 1 0 0
877997 0 1 0 0 0 0 0
877998 0 0 0 0 0 0 1
877999 0 1 0 0 0 0 0
878000 0 1 0 0 0 0 0
878001 0 0 0 0 0 0 0
878002 0 0 0 0 0 0 0
878003 0 0 1 0 0 0 0
878004 0 0 0 0 0 1 0
878005 0 0 0 0 0 0 0
878006 0 0 0 0 0 0 0
878007 0 0 0 0 0 0 0
878008 0 0 0 1 0 0 0
878009 0 0 0 1 0 0 0
878010 0 0 0 0 0 0 0
878011 0 0 0 0 0 1 0
878012 0 0 0 0 0 0 0
878013 0 0 0 0 0 0 0
878014 0 0 0 0 0 1 0
878015 0 0 0 0 0 1 0
878016 0 1 0 0 0 0 0
878017 0 0 1 0 0 0 0
878018 0 0 1 0 0 0 0
878019 0 0 0 0 0 0 0
878020 0 0 0 0 0 1 0
878021 0 0 0 0 0 1 0
878022 0 0 0 0 1 0 0
878023 0 0 0 0 0 0 0
878024 0 0 0 0 0 0 1
878025 0 1 0 0 0 0 0
878026 0 1 0 0 0 0 0
878027 0 0 0 0 0 0 0
878028 0 0 0 0 0 0 0
878029 0 0 0 0 0 0 0
878030 0 0 0 0 0 0 0
878031 0 1 0 0 0 0 0
878032 0 0 0 0 0 1 0
878033 0 0 0 0 0 0 0
878034 0 0 0 0 0 0 0
878035 0 0 0 0 0 1 0
878036 0 0 0 0 0 1 0
878037 0 0 0 0 0 1 0
878038 0 0 0 0 0 0 0
878039 0 0 0 0 0 1 0
878040 0 0 0 0 1 0 0
878041 0 0 0 0 0 0 0
878042 0 1 0 0 0 0 0
878043 0 1 0 0 0 0 0
878044 0 0 0 0 0 0 0
878045 0 0 0 1 0 0 0
878046 0 0 0 0 0 0 0
878047 0 0 0 0 0 0 0
878048 0 1 0 0 0 0 0
RICHMOND SOUTHERN TARAVAL TENDERLOIN crime
0 0 0 0 0 37
1 0 0 0 0 21
2 0 0 0 0 21
3 0 0 0 0 16
4 0 0 0 0 16
5 0 0 0 0 16
6 0 0 0 0 36
7 0 0 0 0 36
8 1 0 0 0 16
9 0 0 0 0 16
10 0 0 0 0 16
11 0 0 1 0 21
12 0 0 0 1 35
13 0 0 0 0 16
14 0 0 0 0 20
15 0 0 0 0 20
16 0 0 0 1 25
17 0 0 0 0 1
18 0 0 0 0 21
19 0 0 0 1 20
20 0 0 0 0 16
21 0 0 0 0 25
22 0 0 0 1 37
23 0 0 0 1 20
24 0 0 0 0 16
25 0 0 0 1 20
26 0 0 0 0 16
27 0 0 0 0 16
28 0 0 1 0 16
29 0 0 1 0 21
30 0 0 0 0 16
31 0 0 0 0 20
32 0 0 0 0 35
33 0 0 0 0 16
34 0 0 0 0 35
35 0 1 0 0 16
36 0 0 0 0 16
37 0 0 0 0 16
38 0 0 1 0 38
39 0 0 0 0 35
40 0 1 0 0 20
41 0 1 0 0 16
42 0 0 0 1 16
43 0 0 0 0 21
44 0 0 0 0 21
45 0 0 0 0 21
46 0 0 0 0 36
47 0 0 1 0 16
48 0 0 0 0 20
49 0 0 0 0 4
50 0 0 0 0 25
51 0 0 0 0 1
52 0 1 0 0 16
53 0 1 0 0 16
54 0 1 0 0 32
55 0 1 0 0 16
56 0 0 0 0 16
57 0 1 0 0 16
58 0 0 0 0 16
59 0 0 0 0 36
60 0 0 0 0 36
61 0 0 0 0 8
62 0 0 0 0 32
63 0 1 0 0 20
64 0 1 0 0 16
65 0 0 1 0 16
66 0 0 0 1 37
67 0 0 0 1 37
68 0 0 0 1 21
69 0 1 0 0 16
70 0 1 0 0 16
71 0 0 0 0 16
72 0 0 0 0 16
73 0 0 0 0 36
74 0 0 0 0 16
... ... ... ... ... ...
877974 0 0 0 0 36
877975 0 0 0 0 36
877976 0 0 0 0 20
877977 0 1 0 0 21
877978 0 0 1 0 21
877979 0 0 1 0 36
877980 0 0 1 0 36
877981 0 0 0 0 32
877982 0 1 0 0 21
877983 0 0 0 0 21
877984 0 0 0 0 16
877985 0 1 0 0 21
877986 0 0 0 0 21
877987 0 0 0 0 4
877988 0 1 0 0 34
877989 0 0 0 0 21
877990 0 0 0 0 20
877991 0 1 0 0 21
877992 0 0 0 0 16
877993 0 1 0 0 21
877994 0 0 0 0 36
877995 0 0 0 0 37
877996 0 0 0 0 21
877997 0 0 0 0 21
877998 0 0 0 0 19
877999 0 0 0 0 36
878000 0 0 0 0 36
878001 0 1 0 0 21
878002 0 1 0 0 16
878003 0 0 0 0 1
878004 0 0 0 0 1
878005 0 1 0 0 21
878006 0 1 0 0 35
878007 0 1 0 0 34
878008 0 0 0 0 30
878009 0 0 0 0 21
878010 1 0 0 0 4
878011 0 0 0 0 35
878012 1 0 0 0 13
878013 0 1 0 0 4
878014 0 0 0 0 21
878015 0 0 0 0 30
878016 0 0 0 0 35
878017 0 0 0 0 25
878018 0 0 0 0 21
878019 0 1 0 0 21
878020 0 0 0 0 21
878021 0 0 0 0 35
878022 0 0 0 0 36
878023 0 0 0 1 16
878024 0 0 0 0 21
878025 0 0 0 0 21
878026 0 0 0 0 37
878027 0 1 0 0 37
878028 0 1 0 0 1
878029 0 0 0 1 21
878030 0 0 0 1 28
878031 0 0 0 0 1
878032 0 0 0 0 21
878033 1 0 0 0 35
878034 1 0 0 0 34
878035 0 0 0 0 1
878036 0 0 0 0 16
878037 0 0 0 0 35
878038 0 0 0 1 37
878039 0 0 0 0 21
878040 0 0 0 0 1
878041 1 0 0 0 21
878042 0 0 0 0 1
878043 0 0 0 0 21
878044 0 0 1 0 25
878045 0 0 0 0 16
878046 0 1 0 0 16
878047 0 1 0 0 35
878048 0 0 0 0 12
[878049 rows x 42 columns]
我们可以快速地筛出一部分重要的特征,搭建一个baseline系统,再考虑步步优化。比如我们这里
简单一点,就只取星期几和街区作为分类器输入特征,我们用scikit-learn中的train_test_split
函数拿到训练集和交叉验证集,用朴素贝叶斯和逻辑回归都建立模型,对比一下它们的表现:
from sklearn.cross_validation import train_test_split
from sklearn import preprocessing
from sklearn.metrics import log_loss
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
import time
# 只取星期几和街区作为分类器输入特征
features = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday', 'BAYVIEW', 'CENTRAL', 'INGLESIDE', 'MISSION',
'NORTHERN', 'PARK', 'RICHMOND', 'SOUTHERN', 'TARAVAL', 'TENDERLOIN']
# 分割训练集(3/5)和测试集(2/5)
training, validation = train_test_split(trainData, train_size=.60)
# 朴素贝叶斯建模,计算log_loss
model = BernoulliNB()
nbStart = time.time()
model.fit(training[features], training['crime'])
nbCostTime = time.time() - nbStart
predicted = np.array(model.predict_proba(validation[features]))
print("朴素贝叶斯建模耗时 %f 秒" %(nbCostTime))
print("朴素贝叶斯log损失为 %f " %(log_loss(validation['crime'],predicted)))
#逻辑回归建模,计算log_loss
model = LogisticRegression(C=.01)
lrStart = time.time()
model.fit(training[features],training['crime'])
lrCostTime = time.time() - lrStart
predicted = np.array(model.predict_proba(validation[features]))
print("逻辑回归建模耗时 %f 秒" %(lrCostTime))
print("逻辑回归log损失为 %f" %(log_loss(validation['crime'], predicted)))
朴素贝叶斯建模耗时 0.477027 秒
朴素贝叶斯log损失为 2.614108 秒
逻辑回归建模耗时 58.954372 秒
逻辑回归log损失为 2.621150
我们可以看到目前的特征和参数设定下,朴素贝叶斯的log损失还低一些,另外我们可以明显看到,
朴素贝叶斯建模消耗的时间远小于逻辑回归建模时间。
from sklearn.cross_validation import train_test_split
from sklearn import preprocessing
from sklearn.metrics import log_loss
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
import time
features = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday', 'BAYVIEW', 'CENTRAL', 'INGLESIDE', 'MISSION',
'NORTHERN', 'PARK', 'RICHMOND', 'SOUTHERN', 'TARAVAL', 'TENDERLOIN']
hourFea = [x for x in range(0,24)]
features = features + hourFea
from sklearn.cross_validation import train_test_split
from sklearn import preprocessing
from sklearn.metrics import log_loss
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
import time
# 只取星期几和街区作为分类器输入特征
features = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday', 'BAYVIEW', 'CENTRAL', 'INGLESIDE', 'MISSION',
'NORTHERN', 'PARK', 'RICHMOND', 'SOUTHERN', 'TARAVAL', 'TENDERLOIN']
# 分割训练集(3/5)和测试集(2/5)
training, validation = train_test_split(trainData, train_size=.60)
# 朴素贝叶斯建模,计算log_loss
model = BernoulliNB()
nbStart = time.time()
model.fit(training[features], training['crime'])
nbCostTime = time.time() - nbStart
predicted = np.array(model.predict_proba(validation[features]))
print("朴素贝叶斯建模耗时 %f 秒" %(nbCostTime))
print("朴素贝叶斯log损失为 %f 秒" %(log_loss(validation['crime'],predicted)))
#逻辑回归建模,计算log_loss
model = LogisticRegression(C=.01)
lrStart = time.time()
model.fit(training[features],training['crime'])
lrCostTime = time.time() - lrStart
predicted = np.array(model.predict_proba(validation[features]))
print("逻辑回归建模耗时 %f 秒" %(lrCostTime))
print("逻辑回归log损失为 %f" %(log_loss(validation['crime'], predicted)))
朴素贝叶斯建模耗时 0.478027 秒
朴素贝叶斯log损失为 2.613777 秒
逻辑回归建模耗时 58.734359 秒
逻辑回归log损失为 2.621033
可以看到在这三个类别特征下,朴素贝叶斯相对于逻辑回归,依旧有一定的优势(log损失更小),
同时训练时间很短,这意味着模型虽然简单,但是效果依旧强大。
参考文献:
http://blog.csdn.net/han_xiaoyang/article/details/50629608