上篇中我们用常规方法分析了泰坦尼克号事件中乘客的生存情况,最后得出结论,已知某人的资料并不能判断他或她是生存或死亡。
上篇链接:https://blog.csdn.net/wuzlun/article/details/80189766
那么接下来我们用机器学习的方法来分析。
# 再次导入本次用到的工具
# 使用该魔法,不用写plt.show()
%matplotlib inline
import warnings
# 忽略警告提示
warnings.filterwarnings('ignore')
warnings.filterwarnings('ignore', category=DeprecationWarning)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif']=['SimHei'] # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus']=False # 用来正常显示负号
# seaborn作为matplotlib的补充及扩展
import seaborn as sns
二、机器学习
首先,还是对数据的处理。
# 导入下载的训练数据(train.csv)和测试数据集(test.csv)
# 合并数据集,方便同时对两个数据集进行清洗
train = pd.read_csv('./data/train.csv')
test = pd.read_csv('./data/test.csv')
print('训练数据集',train.shape, '测试数据集',test.shape)
# 合并后再删除干扰列“Survived/PassengerId”
full = train.append( test , ignore_index = True )
full.drop('PassengerId', 1,inplace=True)
print('合并后的数据集', full.shape, '\n\n合并后表格结构如下:')
full.head()
训练数据集 (891, 12) 测试数据集 (418, 11) 合并后的数据集 (1309, 11) 合并后表格结构如下:
Age | Cabin | Embarked | Fare | Name | Parch | Pclass | Sex | SibSp | Survived | Ticket | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22.0 | NaN | S | 7.2500 | Braund, Mr. Owen Harris | 0 | 3 | male | 1 | 0.0 | A/5 21171 |
1 | 38.0 | C85 | C | 71.2833 | Cumings, Mrs. John Bradley (Florence Briggs Th… | 0 | 1 | female | 1 | 1.0 | PC 17599 |
2 | 26.0 | NaN | S | 7.9250 | Heikkinen, Miss. Laina | 0 | 3 | female | 0 | 1.0 | STON/O2. 3101282 |
3 | 35.0 | C123 | S | 53.1000 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 1 | female | 1 | 1.0 | 113803 |
4 | 35.0 | NaN | S | 8.0500 | Allen, Mr. William Henry | 0 | 3 | male | 0 | 0.0 | 373450 |
#获取数据类型列的描述统计信息
full.describe()
Age | Fare | Parch | Pclass | SibSp | Survived | |
---|---|---|---|---|---|---|
count | 1046.000000 | 1308.000000 | 1309.000000 | 1309.000000 | 1309.000000 | 891.000000 |
mean | 29.881138 | 33.295479 | 0.385027 | 2.294882 | 0.498854 | 0.383838 |
std | 14.413493 | 51.758668 | 0.865560 | 0.837836 | 1.041658 | 0.486592 |
min | 0.170000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 |
25% | 21.000000 | 7.895800 | 0.000000 | 2.000000 | 0.000000 | 0.000000 |
50% | 28.000000 | 14.454200 | 0.000000 | 3.000000 | 0.000000 | 0.000000 |
75% | 39.000000 | 31.275000 | 0.000000 | 3.000000 | 1.000000 | 1.000000 |
max | 80.000000 | 512.329200 | 9.000000 | 3.000000 | 8.000000 | 1.000000 |
# 查看每一列的数据类型,和数据总数
full.info()
我们发现合并后的总表有1309行记录。
其中数据类型列:
1)年龄(Age)里面数据总数是1046条,缺失了1309-1046=263,缺失率263/1309=20%
2)船票价格(Fare)里面数据总数是1308条,缺失了1条数据
3) 生存或死亡(Survived) 这个不用处理字符串列:
1)登船港口(Embarked)里面数据总数是1307,只缺失了2条数据,缺失比较少
2)船舱号(Cabin)里面数据总数是295,缺失了1309-295=1014,缺失率1014/1309=77.5%,缺失比较大
这为我们下一步数据清洗指明了方向,只有知道哪些数据缺失数据,我们才能有针对性的处理。
数据清理
很多机器学习算法为了训练模型,要求所传入的特征中不能有空值,所以首先要做的就是对缺失值处理。
1、缺失值处理
缺失值处理的原则:
1. 如果是数值类型,用平均值取代。
PS: 上篇分析知道乘客年龄分布很分散,这里用中位数取代。
2. 如果是分类数据,用最常见的类别取代
3. 使用模型预测缺失值,例如:K-NN
(1)、数值类型处理
# 数值类型处理
# 训练表和测试表分开处理
sRow=891 # 原始数据集有891行
# 年龄处理
full[0:sRow]['Age'].fillna(train['Age'].median(), inplace=True)
full[sRow:]['Age'].fillna(test['Age'].median(), inplace=True)
# 船票价格处理
full[0:sRow]['Fare'].fillna(train['Fare'].mean(), inplace=True)
full[sRow:]['Fare'].fillna(test['Fare'].mean(), inplace=True)
full.describe()
Age | Fare | Parch | Pclass | SibSp | Survived | |
---|---|---|---|---|---|---|
count | 1309.000000 | 1309.000000 | 1309.000000 | 1309.000000 | 1309.000000 | 891.000000 |
mean | 29.437487 | 33.297261 | 0.385027 | 2.294882 | 0.498854 | 0.383838 |
std | 12.915275 | 51.738919 | 0.865560 | 0.837836 | 1.041658 | 0.486592 |
min | 0.170000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 |
25% | 22.000000 | 7.895800 | 0.000000 | 2.000000 | 0.000000 | 0.000000 |
50% | 28.000000 | 14.454200 | 0.000000 | 3.000000 | 0.000000 | 0.000000 |
75% | 35.000000 | 31.275000 | 0.000000 | 3.000000 | 1.000000 | 1.000000 |
max | 80.000000 | 512.329200 | 9.000000 | 3.000000 | 8.000000 | 1.000000 |
(2)、字符类型处理
# 查看登船港口(Embarked)人数
print('训练表统计\n',full[0:sRow]['Embarked'].value_counts())
print('测试表统计\n',full[sRow:]['Embarked'].value_counts())
训练表统计
S 644
C 168
Q 77
Name: Embarked, dtype: int64
测试表统计
S 270
C 102
Q 46
Name: Embarked, dtype: int64
# 用S替换登船港口(Embarked)缺失值
full['Embarked'].fillna('s', inplace=True)
# 船舱号(Cabin)缺失数据比较多,缺失值填充为U,表示未知(Uknow)
full['Cabin'].fillna( 'U', inplace=True)
缺失值处理已经完成,我们查看数据处理情况,及完整的表格
# 缺失处理后的表格
full.info()
full.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1309 entries, 0 to 1308
Data columns (total 11 columns):
Age 1309 non-null float64
Cabin 1309 non-null object
Embarked 1309 non-null object
Fare 1309 non-null float64
Name 1309 non-null object
Parch 1309 non-null int64
Pclass 1309 non-null int64
Sex 1309 non-null object
SibSp 1309 non-null int64
Survived 891 non-null float64
Ticket 1309 non-null object
dtypes: float64(3), int64(3), object(5)
memory usage: 112.6+ KB
Age | Cabin | Embarked | Fare | Name | Parch | Pclass | Sex | SibSp | Survived | Ticket | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22.0 | U | S | 7.2500 | Braund, Mr. Owen Harris | 0 | 3 | male | 1 | 0.0 | A/5 21171 |
1 | 38.0 | C85 | C | 71.2833 | Cumings, Mrs. John Bradley (Florence Briggs Th… | 0 | 1 | female | 1 | 1.0 | PC 17599 |
2 | 26.0 | U | S | 7.9250 | Heikkinen, Miss. Laina | 0 | 3 | female | 0 | 1.0 | STON/O2. 3101282 |
3 | 35.0 | C123 | S | 53.1000 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 1 | female | 1 | 1.0 | 113803 |
4 | 35.0 | U | S | 8.0500 | Allen, Mr. William Henry | 0 | 3 | male | 0 | 0.0 | 373450 |
看上表,通过缺失值处理,现在的表格已经很完整了。但通过上篇的分析,这个表格维度太高,还不能直接分析。所以接下来我们要对表格进行特征工程处理。
2、特征提取
(1)、数据分类
查看数据类型,分为3种数据类型。并对类别数据处理:用数值代替类别,并进行One-hot编码。
1.数值类型:
年龄(Age),船票价格(Fare),同代直系亲属人数(SibSp),不同代直系亲属人数(Parch)
2.时间序列:无
3.分类数据:
1)有直接类别的
乘客性别(Sex):男性male,女性female
登船港口(Embarked):出发地点S=英国南安普顿Southampton,途径地点1:C=法国 瑟堡市Cherbourg,出发地点2:Q=爱尔兰 昆士敦Queenstown
客舱等级(Pclass):1=1等舱,2=2等舱,3=3等舱
2)字符串类型:可能从这里面提取出特征来,也归到分类数据中
乘客姓名(Name)
客舱号(Cabin)
船票编号(Ticket)
先从简单的开始处理
性别(Sex)
# 将性别的值映射为数值
# 男(male)对应数值1,女(female)对应数值0
sex_mapDict={'male':1, 'female':0}
# map函数:对Series每个数据应用自定义的函数计算
full['Sex']=full['Sex'].map(sex_mapDict)
full.head()
Age | Cabin | Embarked | Fare | Name | Parch | Pclass | Sex | SibSp | Survived | Ticket | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22.0 | U | S | 7.2500 | Braund, Mr. Owen Harris | 0 | 3 | 1 | 1 | 0.0 | A/5 21171 |
1 | 38.0 | C85 | C | 71.2833 | Cumings, Mrs. John Bradley (Florence Briggs Th… | 0 | 1 | 0 | 1 | 1.0 | PC 17599 |
2 | 26.0 | U | S | 7.9250 | Heikkinen, Miss. Laina | 0 | 3 | 0 | 0 | 1.0 | STON/O2. 3101282 |
3 | 35.0 | C123 | S | 53.1000 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 1 | 0 | 1 | 1.0 | 113803 |
4 | 35.0 | U | S | 8.0500 | Allen, Mr. William Henry | 0 | 3 | 1 | 0 | 0.0 | 373450 |
登船港口(Embarked)
将维方法:
通过get_dummies进行one-hot编码,产生虚拟变量表,再将此表添加到full表中。
登船港口(Embarked)的值是:
出发地点:S=英国南安普顿Southampton
途径地点1:C=法国 瑟堡市Cherbourg
途径地点2:Q=爱尔兰 昆士敦Queenstown
# 存放提取后的特征
embarkedDf = pd.DataFrame()
embarkedDf = pd.get_dummies(full['Embarked'], prefix='Embarked') # 列名前缀是Embarked
embarkedDf.head()
Embarked_C | Embarked_Q | Embarked_S | Embarked_s | |
---|---|---|---|---|
0 | 0 | 0 | 1 | 0 |
1 | 1 | 0 | 0 | 0 |
2 | 0 | 0 | 1 | 0 |
3 | 0 | 0 | 1 | 0 |
4 | 0 | 0 | 1 | 0 |
# 添加one-hot编码产生的虚拟变量(dummy variables)到泰坦尼克号数据集full
full = pd.concat([full, embarkedDf], axis=1)
# 删除登船港口(Embarked)列
full.drop('Embarked', axis=1, inplace=True)
full.head()
Age | Cabin | Fare | Name | Parch | Pclass | Sex | SibSp | Survived | Ticket | Embarked_C | Embarked_Q | Embarked_S | Embarked_s | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22.0 | U | 7.2500 | Braund, Mr. Owen Harris | 0 | 3 | 1 | 1 | 0.0 | A/5 21171 | 0 | 0 | 1 | 0 |
1 | 38.0 | C85 | 71.2833 | Cumings, Mrs. John Bradley (Florence Briggs Th… | 0 | 1 | 0 | 1 | 1.0 | PC 17599 | 1 | 0 | 0 | 0 |
2 | 26.0 | U | 7.9250 | Heikkinen, Miss. Laina | 0 | 3 | 0 | 0 | 1.0 | STON/O2. 3101282 | 0 | 0 | 1 | 0 |
3 | 35.0 | C123 | 53.1000 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 1 | 0 | 1 | 1.0 | 113803 | 0 | 0 | 1 | 0 |
4 | 35.0 | U | 8.0500 | Allen, Mr. William Henry | 0 | 3 | 1 | 0 | 0.0 | 373450 | 0 | 0 | 1 | 0 |
舰艇等级(Pclass)
将维方法:
通过get_dummies进行one-hot编码,产生虚拟变量表,再将此表添加到full表中。
1=1等舱,2=2等舱,3=3等舱
# 存放提取Pclass后特征
pclassDf = pd.DataFrame()
pclassDf = pd.get_dummies(full['Pclass'], prefix='Pclass') # Pclass为前缀
pclassDf.head()
Pclass_1 | Pclass_2 | Pclass_3 | |
---|---|---|---|
0 | 0 | 0 | 1 |
1 | 1 | 0 | 0 |
2 | 0 | 0 | 1 |
3 | 1 | 0 | 0 |
4 | 0 | 0 | 1 |
# 添加one-hot编码产生的虚拟变量(dummy variables)到泰坦尼克号数据集full
full = pd.concat([full, pclassDf], axis=1)
# 删除舰艇等级(Pclass)列
full.drop('Pclass', axis=1, inplace=True)
full.head()
Age | Cabin | Fare | Name | Parch | Sex | SibSp | Survived | Ticket | Embarked_C | Embarked_Q | Embarked_S | Embarked_s | Pclass_1 | Pclass_2 | Pclass_3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22.0 | U | 7.2500 | Braund, Mr. Owen Harris | 0 | 1 | 1 | 0.0 | A/5 21171 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
1 | 38.0 | C85 | 71.2833 | Cumings, Mrs. John Bradley (Florence Briggs Th… | 0 | 0 | 1 | 1.0 | PC 17599 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
2 | 26.0 | U | 7.9250 | Heikkinen, Miss. Laina | 0 | 0 | 0 | 1.0 | STON/O2. 3101282 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
3 | 35.0 | C123 | 53.1000 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 0 | 1 | 1.0 | 113803 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
4 | 35.0 | U | 8.0500 | Allen, Mr. William Henry | 0 | 1 | 0 | 0.0 | 373450 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
姓名(Name)
查看姓名这一列长啥样,注意到在乘客名字(Name)中,有一个非常显著的特点:
乘客头衔每个名字当中都包含了具体的称谓或者说是头衔,将这部分信息提取出来后可以作为非常有用一个新变量,可以帮助我们进行预测。
例如:
Braund, Mr. Owen Harris
Heikkinen, Miss. Laina
Oliva y Ocana, Dona. Fermina
Peter, Master. Michael J
所以可以从姓名中提取头衔。
# 定义一个函数,从乘客姓名中提取头衔
def getTitle(name):
str1 = name.split(',')[1] # Mr. Owen Harris
str2 = str1.split('.')[0] # Mr
return str2.strip() # 去掉空格字符
# 存放提取后的特征
titleDf = pd.DataFrame()
titleDf['Title'] = full['Name'].map(getTitle)
print('特征表格:',titleDf.shape)
titleDf.head()
特征表格: (1309, 1)
Title | |
---|---|
0 | Mr |
1 | Mrs |
2 | Miss |
3 | Mrs |
4 | Mr |
定义以下几种头衔类别:
Officer*官员
Royalty王室(皇室)
Mr已婚男士
Mrs已婚妇女
Miss年轻未婚女子
Master有技能的人/教师
# 姓名中头衔字符串与定义头衔类别的映射关系
title_mapDict = {
"Capt": "Officer",
"Col": "Officer",
"Major": "Officer",
"Jonkheer": "Royalty",
"Don": "Royalty",
"Sir" : "Royalty",
"Dr": "Officer",
"Rev": "Officer",
"the Countess":"Royalty",
"Dona": "Royalty",
"Mme": "Mrs",
"Mlle": "Miss",
"Ms": "Mrs",
"Mr" : "Mr",
"Mrs" : "Mrs",
"Miss" : "Miss",
"Master" : "Master",
"Lady" : "Royalty"
}
# map函数:对Series每个数据应用自定义的函数计算
titleDf['Title'] = titleDf['Title'].map(title_mapDict)
titleDf = pd.get_dummies(titleDf['Title'], prefix='Title')
titleDf.head()
Title_Master | Title_Miss | Title_Mr | Title_Mrs | Title_Officer | Title_Royalty | |
---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 1 | 0 | 0 |
2 | 0 | 1 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 1 | 0 | 0 |
4 | 0 | 0 | 1 | 0 | 0 | 0 |
# 添加one-hot编码产生的虚拟变量(dummy variables)到泰坦尼克号数据集full
full = pd.concat([full, titleDf], axis=1)
full.drop('Name', axis=1, inplace=True)
full.head()
Age | Cabin | Fare | Parch | Sex | SibSp | Survived | Ticket | Embarked_C | Embarked_Q | … | Embarked_s | Pclass_1 | Pclass_2 | Pclass_3 | Title_Master | Title_Miss | Title_Mr | Title_Mrs | Title_Officer | Title_Royalty | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22.0 | U | 7.2500 | 0 | 1 | 1 | 0.0 | A/5 21171 | 0 | 0 | … | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
1 | 38.0 | C85 | 71.2833 | 0 | 0 | 1 | 1.0 | PC 17599 | 1 | 0 | … | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
2 | 26.0 | U | 7.9250 | 0 | 0 | 0 | 1.0 | STON/O2. 3101282 | 0 | 0 | … | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
3 | 35.0 | C123 | 53.1000 | 0 | 0 | 1 | 1.0 | 113803 | 0 | 0 | … | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
4 | 35.0 | U | 8.0500 | 0 | 1 | 0 | 0.0 | 373450 | 0 | 0 | … | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
5 rows × 21 columns
客舱号(Cabin)
提取客舱号首字母当客舱的类别
# 存放客舱号特征
cabinDf = pd.DataFrame()
full['Cabin'] = full['Cabin'].map(lambda c: c[0])
cabinDf = pd.get_dummies(full['Cabin'], prefix='Cabin')
cabinDf.head()
Cabin_A | Cabin_B | Cabin_C | Cabin_D | Cabin_E | Cabin_F | Cabin_G | Cabin_T | Cabin_U | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
3 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
# 添加one-hot编码产生的虚拟变量(dummy variables)到泰坦尼克号数据集full
full = pd.concat([full, cabinDf], axis=1)
full.drop('Cabin', axis=1, inplace=True)
full.head()
Age | Fare | Parch | Sex | SibSp | Survived | Ticket | Embarked_C | Embarked_Q | Embarked_S | … | Title_Royalty | Cabin_A | Cabin_B | Cabin_C | Cabin_D | Cabin_E | Cabin_F | Cabin_G | Cabin_T | Cabin_U | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22.0 | 7.2500 | 0 | 1 | 1 | 0.0 | A/5 21171 | 0 | 0 | 1 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 38.0 | 71.2833 | 0 | 0 | 1 | 1.0 | PC 17599 | 1 | 0 | 0 | … | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 26.0 | 7.9250 | 0 | 0 | 0 | 1.0 | STON/O2. 3101282 | 0 | 0 | 1 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
3 | 35.0 | 53.1000 | 0 | 0 | 1 | 1.0 | 113803 | 0 | 0 | 1 | … | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 35.0 | 8.0500 | 0 | 1 | 0 | 0.0 | 373450 | 0 | 0 | 1 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
5 rows × 29 columns
家庭信息
统计家庭人数,确定家庭类型
家庭人数=同代直系亲属数(Parch)+不同代直系亲属数(SibSp)+乘客自己
家庭类别:
小家庭Family_Single:家庭人数=1
中等家庭Family_Small: 2<=家庭人数<=4
大家庭Family_Large: 家庭人数>=5
# 存放家庭信息
familyDf = pd.DataFrame()
familyDf[ 'FamilySize' ] = full[ 'Parch' ] + full[ 'SibSp' ] + 1
# if 条件为真的时候返回if前面内容,否则返回0
familyDf[ 'Family_Single' ] = familyDf[ 'FamilySize' ].map( lambda s : 1 if s == 1 else 0 )
familyDf[ 'Family_Small' ] = familyDf[ 'FamilySize' ].map( lambda s : 1 if 2 <= s <= 4 else 0 )
familyDf[ 'Family_Large' ] = familyDf[ 'FamilySize' ].map( lambda s : 1 if 5 <= s else 0 )
familyDf.head()
FamilySize | Family_Single | Family_Small | Family_Large | |
---|---|---|---|---|
0 | 2 | 0 | 1 | 0 |
1 | 2 | 0 | 1 | 0 |
2 | 1 | 1 | 0 | 0 |
3 | 2 | 0 | 1 | 0 |
4 | 1 | 1 | 0 | 0 |
# 添加one-hot编码产生的虚拟变量(dummy variables)到泰坦尼克号数据集full
full = pd.concat([full, familyDf], axis=1)
full.head()
Age | Fare | Parch | Sex | SibSp | Survived | Ticket | Embarked_C | Embarked_Q | Embarked_S | … | Cabin_D | Cabin_E | Cabin_F | Cabin_G | Cabin_T | Cabin_U | FamilySize | Family_Single | Family_Small | Family_Large | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22.0 | 7.2500 | 0 | 1 | 1 | 0.0 | A/5 21171 | 0 | 0 | 1 | … | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 0 | 1 | 0 |
1 | 38.0 | 71.2833 | 0 | 0 | 1 | 1.0 | PC 17599 | 1 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 1 | 0 |
2 | 26.0 | 7.9250 | 0 | 0 | 0 | 1.0 | STON/O2. 3101282 | 0 | 0 | 1 | … | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 |
3 | 35.0 | 53.1000 | 0 | 0 | 1 | 1.0 | 113803 | 0 | 0 | 1 | … | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 1 | 0 |
4 | 35.0 | 8.0500 | 0 | 1 | 0 | 0.0 | 373450 | 0 | 0 | 1 | … | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 |
5 rows × 33 columns
船票编号(Ticket)
提取船票编号字母前缀为特征类型
# 定义函数,过滤船票编号中的字符,并提取字母前缀
def cleanTicket(ticket):
ticket = ticket.replace('.','')
ticket = ticket.replace('/','')
ticket = ticket.split()
ticket = map(lambda t : t.strip(), ticket)
ticket = list(filter(lambda t : not t.isdigit(), ticket)) # isdigit()是否是数字
if len(ticket) > 0:
return ticket[0]
else:
return 'Nfrefix'
# 存放船票编号特征
ticketDf = pd.DataFrame()
full['Ticket'] = full['Ticket'].map(cleanTicket)
ticketDf = pd.get_dummies(full['Ticket'], prefix='Ticket')
ticketDf.head()
Ticket_A | Ticket_A4 | Ticket_A5 | Ticket_AQ3 | Ticket_AQ4 | Ticket_AS | Ticket_C | Ticket_CA | Ticket_CASOTON | Ticket_FC | … | Ticket_SOPP | Ticket_SOTONO2 | Ticket_SOTONOQ | Ticket_SP | Ticket_STONO | Ticket_STONO2 | Ticket_STONOQ | Ticket_SWPP | Ticket_WC | Ticket_WEP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 37 columns
# 添加one-hot编码产生的虚拟变量(dummy variables)到泰坦尼克号数据集full
full = pd.concat([full, ticketDf], axis=1)
full.drop('Ticket', axis=1, inplace=True)
full.head()
Age | Fare | Parch | Sex | SibSp | Survived | Embarked_C | Embarked_Q | Embarked_S | Embarked_s | … | Ticket_SOPP | Ticket_SOTONO2 | Ticket_SOTONOQ | Ticket_SP | Ticket_STONO | Ticket_STONO2 | Ticket_STONOQ | Ticket_SWPP | Ticket_WC | Ticket_WEP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22.0 | 7.2500 | 0 | 1 | 1 | 0.0 | 0 | 0 | 1 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 38.0 | 71.2833 | 0 | 0 | 1 | 1.0 | 1 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 26.0 | 7.9250 | 0 | 0 | 0 | 1.0 | 0 | 0 | 1 | 0 | … | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
3 | 35.0 | 53.1000 | 0 | 0 | 1 | 1.0 | 0 | 0 | 1 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 35.0 | 8.0500 | 0 | 1 | 0 | 0.0 | 0 | 0 | 1 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 69 columns
特征全部处理完成,从上表可以知道,总共有69个特征。但我们不会直接拿69个特殊来预测,这样维度还是太大了,我们应根据权重的大小来选择。
3、特征选择
特征选择就是选择重要的特征,忽略影响不大的特征。
选择特征之前,我们先获取已处理的各数据集
处理后的各个数据集
# 原始数据集有891行
# 结果数据集,直接获取
targets = pd.read_csv('./data/train.csv', usecols=['Survived'])['Survived'].values
# targets = full.loc[0:890,'Survived'] # 同上
# 训练数据集
trainDf = full.iloc[:891]
# 测试数据集
testDf = full.iloc[891:]
方法一:利用pearson相关系数法
# 利用 pearson相关系数法查看
corrDf = trainDf.corr()
corrDf
Age | Fare | Parch | Sex | SibSp | Survived | Embarked_C | Embarked_Q | Embarked_S | Embarked_s | … | Ticket_SOPP | Ticket_SOTONO2 | Ticket_SOTONOQ | Ticket_SP | Ticket_STONO | Ticket_STONO2 | Ticket_STONOQ | Ticket_SWPP | Ticket_WC | Ticket_WEP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Age | 1.000000 | 0.096688 | -0.172482 | 0.081163 | -0.233296 | -0.064910 | 0.030248 | -0.031415 | -0.014665 | 0.075229 | … | 0.019230 | 0.004150 | -0.023078 | -0.026692 | -0.004743 | -0.024435 | NaN | 0.040600 | -0.005417 | 0.095166 |
Fare | 0.096688 | 1.000000 | 0.216225 | -0.182333 | 0.159651 | 0.257307 | 0.269335 | -0.117216 | -0.166603 | 0.045646 | … | -0.026551 | -0.023569 | -0.065010 | -0.016229 | -0.057589 | -0.035872 | NaN | -0.020728 | -0.021149 | 0.041570 |
Parch | -0.172482 | 0.216225 | 1.000000 | -0.245489 | 0.414838 | 0.081629 | -0.011069 | -0.081228 | 0.063036 | -0.022467 | … | -0.027532 | -0.022467 | -0.061983 | -0.015878 | -0.055345 | -0.039002 | NaN | -0.022467 | 0.134682 | 0.044618 |
Sex | 0.081163 | -0.182333 | -0.245489 | 1.000000 | -0.114631 | -0.543351 | -0.082853 | -0.074115 | 0.125722 | -0.064296 | … | 0.002321 | 0.034990 | 0.078271 | 0.024728 | 0.086193 | -0.082890 | NaN | 0.034990 | -0.055216 | 0.002321 |
SibSp | -0.233296 | 0.159651 | 0.414838 | -0.114631 | 1.000000 | -0.035322 | -0.059528 | -0.026354 | 0.070941 | -0.022508 | … | -0.027582 | -0.022508 | -0.062097 | -0.015907 | -0.046612 | -0.001719 | NaN | -0.022508 | 0.026776 | 0.007576 |
Survived | -0.064910 | 0.257307 | 0.081629 | -0.543351 | -0.035322 | 1.000000 | 0.168240 | 0.003650 | -0.155660 | 0.060095 | … | -0.045876 | -0.037436 | -0.067404 | -0.026456 | 0.007887 | 0.019667 | NaN | 0.060095 | -0.062182 | -0.006036 |
Embarked_C | 0.030248 | 0.269335 | -0.011069 | -0.082853 | -0.059528 | 0.168240 | 1.000000 | -0.148258 | -0.778359 | -0.022864 | … | -0.028018 | -0.022864 | -0.063078 | -0.016158 | -0.056322 | -0.039691 | NaN | -0.022864 | -0.051357 | -0.028018 |
Embarked_Q | -0.031415 | -0.117216 | -0.081228 | -0.074115 | -0.026354 | 0.003650 | -0.148258 | 1.000000 | -0.496624 | -0.014588 | … | -0.017877 | -0.014588 | -0.040246 | -0.010310 | -0.035936 | -0.025324 | NaN | -0.014588 | -0.032768 | -0.017877 |
Embarked_S | -0.014665 | -0.166603 | 0.063036 | 0.125722 | 0.070941 | -0.155660 | -0.778359 | -0.496624 | 1.000000 | -0.076588 | … | 0.035996 | 0.029374 | 0.081040 | 0.020759 | 0.072361 | 0.050993 | NaN | 0.029374 | 0.065981 | 0.035996 |
Embarked_s | 0.075229 | 0.045646 | -0.022467 | -0.064296 | -0.022508 | 0.060095 | -0.022864 | -0.014588 | -0.076588 | 1.000000 | … | -0.002757 | -0.002250 | -0.006207 | -0.001590 | -0.005542 | -0.003905 | NaN | -0.002250 | -0.005053 | -0.002757 |
Pclass_1 | 0.323896 | 0.591711 | -0.017633 | -0.098013 | -0.054582 | 0.285904 | 0.296423 | -0.155342 | -0.170379 | 0.083847 | … | -0.032880 | -0.026831 | -0.074023 | -0.018962 | -0.066095 | -0.046578 | NaN | -0.026831 | -0.060268 | 0.102749 |
Pclass_2 | 0.015831 | -0.118557 | -0.000734 | -0.064746 | -0.055932 | 0.093349 | -0.125416 | -0.127301 | 0.192061 | -0.024197 | … | 0.066072 | -0.024197 | -0.066756 | -0.017100 | -0.059607 | -0.042005 | NaN | 0.092975 | 0.024606 | -0.029652 |
Pclass_3 | -0.291955 | -0.413333 | 0.015790 | 0.137143 | 0.092548 | -0.322308 | -0.153329 | 0.237449 | -0.009511 | -0.052550 | … | -0.025444 | 0.042811 | 0.118109 | 0.030255 | 0.105459 | 0.074318 | NaN | -0.052550 | 0.031902 | -0.064397 |
Title_Master | -0.373960 | 0.010908 | 0.267344 | 0.159934 | 0.349559 | 0.085221 | -0.035225 | 0.010478 | 0.025291 | -0.010283 | … | -0.012601 | -0.010283 | -0.028370 | -0.007267 | -0.025332 | -0.017851 | NaN | -0.010283 | -0.023098 | -0.012601 |
Title_Miss | -0.248767 | 0.120829 | 0.102514 | -0.691548 | 0.084945 | 0.332795 | 0.037613 | 0.168720 | -0.142412 | 0.034389 | … | -0.029652 | -0.024197 | -0.045206 | -0.017100 | -0.059607 | 0.093599 | NaN | -0.024197 | 0.077244 | 0.018210 |
Title_Mr | 0.180808 | -0.183766 | -0.333905 | 0.867334 | -0.250489 | -0.549199 | -0.072567 | -0.078338 | 0.118482 | -0.055767 | … | 0.010178 | 0.040342 | 0.093621 | 0.028510 | 0.099377 | -0.069002 | NaN | 0.040342 | -0.038911 | -0.029080 |
Title_Mrs | 0.166798 | 0.105665 | 0.221318 | -0.552686 | 0.059941 | 0.344935 | 0.066101 | -0.091121 | -0.005691 | 0.048498 | … | 0.031722 | -0.019338 | -0.053352 | -0.013667 | -0.047638 | 0.005683 | NaN | -0.019338 | -0.012963 | -0.023698 |
Title_Officer | 0.179927 | 0.010357 | -0.048211 | 0.089228 | -0.024712 | -0.031316 | -0.008034 | 0.012618 | -0.000180 | -0.006811 | … | -0.008346 | -0.006811 | -0.018790 | -0.004813 | -0.016777 | -0.011823 | NaN | -0.006811 | -0.015298 | 0.129364 |
Title_Royalty | 0.070654 | 0.015044 | -0.035583 | -0.007483 | -0.008384 | 0.033391 | 0.079020 | -0.023105 | -0.054171 | -0.003563 | … | -0.004366 | -0.003563 | -0.009830 | -0.002518 | -0.008777 | -0.006185 | NaN | -0.003563 | -0.008004 | -0.004366 |
Cabin_A | 0.121732 | 0.019549 | -0.040325 | 0.078271 | -0.046266 | 0.022287 | 0.093040 | -0.040246 | -0.055383 | -0.006207 | … | -0.007606 | -0.006207 | -0.017123 | -0.004386 | -0.015289 | -0.010775 | NaN | -0.006207 | -0.013941 | -0.007606 |
Cabin_B | 0.096080 | 0.386297 | 0.056498 | -0.109689 | -0.034538 | 0.175095 | 0.168642 | -0.072579 | -0.123057 | 0.200996 | … | -0.013716 | -0.011193 | -0.030880 | -0.007910 | -0.027572 | -0.019430 | NaN | -0.011193 | -0.025141 | 0.159633 |
Cabin_C | 0.115188 | 0.364318 | 0.030736 | -0.058649 | 0.029251 | 0.114652 | 0.113952 | -0.049776 | -0.066995 | -0.012631 | … | -0.015478 | -0.012631 | -0.034846 | -0.008926 | -0.031114 | -0.021926 | NaN | -0.012631 | -0.028371 | -0.015478 |
Cabin_D | 0.135674 | 0.098878 | -0.019125 | -0.079248 | -0.017575 | 0.150716 | 0.102977 | -0.060318 | -0.051139 | -0.009302 | … | -0.011399 | -0.009302 | -0.025663 | -0.006574 | -0.022914 | -0.016148 | NaN | -0.009302 | -0.020894 | -0.011399 |
Cabin_E | 0.120483 | 0.053717 | -0.016554 | -0.047003 | -0.036865 | 0.145321 | -0.015939 | -0.037897 | 0.038685 | -0.009155 | … | 0.092903 | -0.009155 | 0.021626 | -0.006470 | -0.022551 | -0.015892 | NaN | -0.009155 | -0.020563 | 0.092903 |
Cabin_F | -0.076393 | -0.033093 | 0.023694 | -0.008202 | 0.001706 | 0.057935 | -0.034726 | -0.004113 | 0.033537 | -0.005771 | … | -0.007073 | -0.005771 | -0.015923 | -0.004079 | -0.014217 | -0.010019 | NaN | -0.005771 | -0.012964 | -0.007073 |
Cabin_G | -0.075406 | -0.025180 | 0.072388 | -0.091031 | -0.001402 | 0.016040 | -0.032371 | -0.020654 | 0.041589 | -0.003185 | … | -0.003903 | -0.003185 | -0.008787 | -0.002251 | -0.007846 | -0.005529 | NaN | -0.003185 | -0.007155 | -0.003903 |
Cabin_T | 0.040285 | 0.002224 | -0.015878 | 0.024728 | -0.015907 | -0.026456 | -0.016158 | -0.010310 | 0.020759 | -0.001590 | … | -0.001948 | -0.001590 | -0.004386 | -0.001124 | -0.003917 | -0.002760 | NaN | -0.001590 | -0.003571 | -0.001948 |
Cabin_U | -0.240314 | -0.482075 | -0.036987 | 0.140391 | 0.040460 | -0.316912 | -0.208528 | 0.129572 | 0.110087 | -0.087042 | … | -0.014439 | 0.025846 | 0.050544 | 0.018266 | 0.063670 | 0.044868 | NaN | 0.025846 | 0.058056 | -0.106664 |
FamilySize | -0.245619 | 0.217138 | 0.783111 | -0.200988 | 0.890712 | 0.016639 | -0.046215 | -0.058592 | 0.079977 | -0.026608 | … | -0.032606 | -0.026608 | -0.073407 | -0.018804 | -0.059507 | -0.020659 | NaN | -0.026608 | 0.085586 | 0.027468 |
Family_Single | 0.171647 | -0.271832 | -0.583398 | 0.303646 | -0.584471 | -0.203367 | -0.095298 | 0.086464 | 0.024929 | 0.038510 | … | 0.047192 | 0.038510 | 0.106245 | 0.027216 | 0.074968 | -0.017280 | NaN | 0.038510 | -0.044131 | -0.071588 |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
Ticket_CA | -0.062501 | -0.005363 | 0.228435 | -0.006179 | 0.357512 | -0.019137 | -0.105869 | -0.067548 | 0.136015 | -0.010417 | … | -0.012765 | -0.010417 | -0.028739 | -0.007362 | -0.025661 | -0.018084 | NaN | -0.010417 | -0.023399 | -0.012765 |
Ticket_CASOTON | -0.003507 | -0.014649 | -0.015878 | 0.024728 | -0.015907 | -0.026456 | -0.016158 | -0.010310 | 0.020759 | -0.001590 | … | -0.001948 | -0.001590 | -0.004386 | -0.001124 | -0.003917 | -0.002760 | NaN | -0.001590 | -0.003571 | -0.001948 |
Ticket_FC | 0.004221 | 0.013361 | -0.015878 | 0.024728 | 0.014507 | -0.026456 | -0.016158 | -0.010310 | 0.020759 | -0.001590 | … | -0.001948 | -0.001590 | -0.004386 | -0.001124 | -0.003917 | -0.002760 | NaN | -0.001590 | -0.003571 | -0.001948 |
Ticket_FCC | 0.038324 | -0.015359 | 0.039016 | -0.070383 | -0.008384 | 0.064285 | -0.036212 | -0.023105 | 0.046524 | -0.003563 | … | -0.004366 | -0.003563 | -0.009830 | -0.002518 | -0.008777 | -0.006185 | NaN | -0.003563 | -0.008004 | -0.004366 |
Ticket_Fa | -0.003507 | -0.016800 | -0.015878 | 0.024728 | -0.015907 | -0.026456 | -0.016158 | -0.010310 | 0.020759 | -0.001590 | … | -0.001948 | -0.001590 | -0.004386 | -0.001124 | -0.003917 | -0.002760 | NaN | -0.001590 | -0.003571 | -0.001948 |
Ticket_LINE | 0.014906 | -0.043544 | -0.031809 | 0.049539 | -0.031867 | -0.018481 | -0.032371 | -0.020654 | 0.041589 | -0.003185 | … | -0.003903 | -0.003185 | -0.008787 | -0.002251 | -0.007846 | -0.005529 | NaN | -0.003185 | -0.007155 | -0.003903 |
Ticket_LP | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | … | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Ticket_Nfrefix | -0.030025 | -0.173170 | -0.035762 | -0.032507 | -0.080768 | 0.001492 | -0.115615 | 0.172297 | -0.010083 | 0.027979 | … | -0.098535 | -0.080408 | -0.221835 | -0.056825 | -0.198077 | -0.139586 | NaN | -0.080408 | -0.180613 | -0.098535 |
Ticket_PC | 0.128823 | 0.486256 | -0.049451 | -0.073639 | -0.046244 | 0.147062 | 0.397139 | -0.082643 | -0.293812 | -0.012745 | … | -0.015618 | -0.012745 | -0.035162 | -0.009007 | -0.031396 | -0.022125 | NaN | -0.012745 | -0.028628 | -0.015618 |
Ticket_PP | -0.056706 | -0.021012 | 0.044618 | -0.038235 | -0.010003 | 0.033803 | -0.028018 | -0.017877 | 0.035996 | -0.002757 | … | -0.003378 | -0.002757 | -0.007606 | -0.001948 | -0.006791 | -0.004786 | NaN | -0.002757 | -0.006193 | -0.003378 |
Ticket_PPP | -0.001318 | -0.007835 | -0.022467 | -0.014653 | 0.020528 | 0.011329 | 0.098396 | -0.014588 | -0.076588 | -0.002250 | … | -0.002757 | -0.002250 | -0.006207 | -0.001590 | -0.005542 | -0.003905 | NaN | -0.002250 | -0.005053 | -0.002757 |
Ticket_SC | -0.031844 | -0.013636 | -0.015878 | -0.045439 | -0.015907 | 0.042470 | 0.069538 | -0.010310 | -0.054125 | -0.001590 | … | -0.001948 | -0.001590 | -0.004386 | -0.001124 | -0.003917 | -0.002760 | NaN | -0.001590 | -0.003571 | -0.001948 |
Ticket_SCA3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | … | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Ticket_SCA4 | -0.003507 | -0.016302 | -0.015878 | 0.024728 | -0.015907 | -0.026456 | -0.016158 | -0.010310 | 0.020759 | -0.001590 | … | -0.001948 | -0.001590 | -0.004386 | -0.001124 | -0.003917 | -0.002760 | NaN | -0.001590 | -0.003571 | -0.001948 |
Ticket_SCAH | 0.020719 | -0.012023 | -0.027532 | -0.038235 | 0.007576 | 0.033803 | 0.021514 | -0.017877 | -0.007287 | -0.002757 | … | -0.003378 | -0.002757 | -0.007606 | -0.001948 | -0.006791 | -0.004786 | NaN | -0.002757 | -0.006193 | -0.003378 |
Ticket_SCOW | 0.009373 | -0.013451 | -0.015878 | 0.024728 | -0.015907 | -0.026456 | -0.016158 | -0.010310 | 0.020759 | -0.001590 | … | -0.001948 | -0.001590 | -0.004386 | -0.001124 | -0.003917 | -0.002760 | NaN | -0.001590 | -0.003571 | -0.001948 |
Ticket_SCPARIS | -0.037643 | -0.016863 | 0.005189 | 0.039034 | -0.007625 | 0.008185 | 0.184602 | -0.027369 | -0.143687 | -0.004221 | … | -0.005172 | -0.004221 | -0.011644 | -0.002983 | -0.010397 | -0.007327 | NaN | -0.004221 | -0.009481 | -0.005172 |
Ticket_SCParis | -0.040572 | 0.002973 | 0.093228 | -0.020746 | 0.013831 | 0.016040 | 0.139310 | -0.020654 | -0.108433 | -0.003185 | … | -0.003903 | -0.003185 | -0.008787 | -0.002251 | -0.007846 | -0.005529 | NaN | -0.003185 | -0.007155 | -0.003903 |
Ticket_SOC | -0.045527 | 0.051055 | -0.039002 | 0.032015 | 0.035636 | -0.036769 | -0.039691 | -0.025324 | 0.050993 | -0.003905 | … | -0.004786 | -0.003905 | -0.010775 | -0.002760 | -0.009621 | -0.006780 | NaN | -0.003905 | -0.008772 | -0.004786 |
Ticket_SOP | 0.055741 | -0.013282 | -0.015878 | 0.024728 | -0.015907 | -0.026456 | -0.016158 | -0.010310 | 0.020759 | -0.001590 | … | -0.001948 | -0.001590 | -0.004386 | -0.001124 | -0.003917 | -0.002760 | NaN | -0.001590 | -0.003571 | -0.001948 |
Ticket_SOPP | 0.019230 | -0.026551 | -0.027532 | 0.002321 | -0.027582 | -0.045876 | -0.028018 | -0.017877 | 0.035996 | -0.002757 | … | 1.000000 | -0.002757 | -0.007606 | -0.001948 | -0.006791 | -0.004786 | NaN | -0.002757 | -0.006193 | -0.003378 |
Ticket_SOTONO2 | 0.004150 | -0.023569 | -0.022467 | 0.034990 | -0.022508 | -0.037436 | -0.022864 | -0.014588 | 0.029374 | -0.002250 | … | -0.002757 | 1.000000 | -0.006207 | -0.001590 | -0.005542 | -0.003905 | NaN | -0.002250 | -0.005053 | -0.002757 |
Ticket_SOTONOQ | -0.023078 | -0.065010 | -0.061983 | 0.078271 | -0.062097 | -0.067404 | -0.063078 | -0.040246 | 0.081040 | -0.006207 | … | -0.007606 | -0.006207 | 1.000000 | -0.004386 | -0.015289 | -0.010775 | NaN | -0.006207 | -0.013941 | -0.007606 |
Ticket_SP | -0.026692 | -0.016229 | -0.015878 | 0.024728 | -0.015907 | -0.026456 | -0.016158 | -0.010310 | 0.020759 | -0.001590 | … | -0.001948 | -0.001590 | -0.004386 | 1.000000 | -0.003917 | -0.002760 | NaN | -0.001590 | -0.003571 | -0.001948 |
Ticket_STONO | -0.004743 | -0.057589 | -0.055345 | 0.086193 | -0.046612 | 0.007887 | -0.056322 | -0.035936 | 0.072361 | -0.005542 | … | -0.006791 | -0.005542 | -0.015289 | -0.003917 | 1.000000 | -0.009621 | NaN | -0.005542 | -0.012448 | -0.006791 |
Ticket_STONO2 | -0.024435 | -0.035872 | -0.039002 | -0.082890 | -0.001719 | 0.019667 | -0.039691 | -0.025324 | 0.050993 | -0.003905 | … | -0.004786 | -0.003905 | -0.010775 | -0.002760 | -0.009621 | 1.000000 | NaN | -0.003905 | -0.008772 | -0.004786 |
Ticket_STONOQ | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | … | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Ticket_SWPP | 0.040600 | -0.020728 | -0.022467 | 0.034990 | -0.022508 | 0.060095 | -0.022864 | -0.014588 | 0.029374 | -0.002250 | … | -0.002757 | -0.002250 | -0.006207 | -0.001590 | -0.005542 | -0.003905 | NaN | 1.000000 | -0.005053 | -0.002757 |
Ticket_WC | -0.005417 | -0.021149 | 0.134682 | -0.055216 | 0.026776 | -0.062182 | -0.051357 | -0.032768 | 0.065981 | -0.005053 | … | -0.006193 | -0.005053 | -0.013941 | -0.003571 | -0.012448 | -0.008772 | NaN | -0.005053 | 1.000000 | -0.006193 |
Ticket_WEP | 0.095166 | 0.041570 | 0.044618 | 0.002321 | 0.007576 | -0.006036 | -0.028018 | -0.017877 | 0.035996 | -0.002757 | … | -0.003378 | -0.002757 | -0.007606 | -0.001948 | -0.006791 | -0.004786 | NaN | -0.002757 | -0.006193 | 1.000000 |
69 rows × 69 columns
# 查看各个特征与生存情况(Survived)的相关系数
corrDf['Survived'].sort_values(ascending =False)
Survived 1.000000
Title_Mrs 0.344935
Title_Miss 0.332795
Pclass_1 0.285904
Family_Small 0.279855
Fare 0.257307
Cabin_B 0.175095
Embarked_C 0.168240
Cabin_D 0.150716
Ticket_PC 0.147062
Cabin_E 0.145321
Cabin_C 0.114652
Pclass_2 0.093349
Title_Master 0.085221
Parch 0.081629
Ticket_FCC 0.064285
Embarked_s 0.060095
Ticket_SWPP 0.060095
Cabin_F 0.057935
Ticket_SC 0.042470
Ticket_PP 0.033803
Ticket_SCAH 0.033803
Title_Royalty 0.033391
Cabin_A 0.022287
Ticket_STONO2 0.019667
FamilySize 0.016639
Cabin_G 0.016040
Ticket_SCParis 0.016040
Ticket_PPP 0.011329
Ticket_SCPARIS 0.008185
...
Ticket_SOP -0.026456
Ticket_Fa -0.026456
Ticket_SCOW -0.026456
Cabin_T -0.026456
Ticket_AS -0.026456
Ticket_FC -0.026456
Ticket_CASOTON -0.026456
Title_Officer -0.031316
SibSp -0.035322
Ticket_SOC -0.036769
Ticket_SOTONO2 -0.037436
Ticket_SOPP -0.045876
Ticket_WC -0.062182
Age -0.064910
Ticket_SOTONOQ -0.067404
Ticket_A4 -0.070234
Ticket_A5 -0.092199
Family_Large -0.125147
Embarked_S -0.155660
Family_Single -0.203367
Cabin_U -0.316912
Pclass_3 -0.322308
Sex -0.543351
Title_Mr -0.549199
Ticket_A NaN
Ticket_AQ3 NaN
Ticket_AQ4 NaN
Ticket_LP NaN
Ticket_SCA3 NaN
Ticket_STONOQ NaN
Name: Survived, Length: 69, dtype: float64
根据各个特征与生存情况(Survived)的相关系数大小,发现下几个特征和生存情况关系最大:
头衔(前面所在的数据集titleDf)、年龄(Age)、客舱等级(pclassDf)、家庭大小(familyDf)、船票价格(Fare)、船舱号(cabinDf)、登船港口(embarkedDf)、性别(Sex)
下面进行特征选择。
方法二:随机森林分类器方法
# 导入机器学习相关包
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble.gradient_boosting import GradientBoostingClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
# 随机森林分类器
clf = RandomForestClassifier(n_estimators=100, max_features='sqrt')
clf = clf.fit(trainDf, targets)
# 查看特征重要性
features = pd.DataFrame()
features['feature'] = trainDf.columns
features['importance'] = clf.feature_importances_
features.sort_values(by=['importance'], ascending=True, inplace=True)
features.set_index('feature', inplace=True)
features.drop('Survived', inplace=True) # 忽略Survived列影响
# 可视化
features.plot(kind='barh', figsize=(25, 25), color='g')
plt.title('特征重要性排行')
plt.xlabel('特征')
plt.ylabel('系数')
Text(0,0.5,'系数')
从上表可以看出,生存情况和性别、头衔(头衔又区分多种,其中Title_Mr最高、Title_Mrs第二、Title_Miss第三)、船票价格、年龄、船舱等级等有很大的关系。
下面进行方法二的特征选择。
# 查看特征重要性
features = pd.DataFrame()
features['feature'] = trainDf.columns
features['importance'] = clf.feature_importances_
features.sort_values(by=['importance'], ascending=True, inplace=True)
features.set_index('feature', inplace=True)
features.drop('Survived', inplace=True) # 忽略Survived列影响
# 可视化
features.plot(kind='barh', figsize=(25, 25), color='g')
plt.title('特征重要性排行')
plt.xlabel('特征')
plt.ylabel('系数')
从上表可以看出,生存情况和性别、头衔(头衔又区分多种,其中Title_Mr最高、Title_Mrs第二、Title_Miss第三)、船票价格、年龄、船舱等级等有很大的关系。
下面进行方法二的特征选择。
4、构建模型
用训练数据和某个机器学习算法得到机器学习模型,用测试数据评估模型
方法一
从已处理的原始数据集(trainDf)中拆分出训练数据集(用于模型训练train),预测数据集(用于模型评估test),再利用train_test_split交叉验证的函数,得到训练数据集和验证数据集。选择机器学习算法,训练模型、评估模型,最后就是用Kaggle提供的测试数据(这里是testDf)测试模型。
# 拆分出来的验证数据集占20%
# X,Y 需要把数据转为矩阵形式
X = features_1.iloc[:891] # 转换成矩阵形式
Y = targets
train_X, test_X, train_Y, test_Y = train_test_split(X, Y, test_size=.2)
# 输出数据集大小
print ('原始数据集特征:',X.shape,
'训练数据集特征:',train_X.shape ,
'验证数据集特征:',test_X.shape)
print ('原始数据集标签:',targets.shape,
'训练数据集标签:',train_Y.shape ,
'验证数据集标签:',test_Y.shape)
原始数据集特征: (891, 29) 训练数据集特征: (712, 29) 验证数据集特征: (179, 29)
原始数据集标签: (891,) 训练数据集标签: (712,) 验证数据集标签: (179,)
第一次接触机器学习,所以用逻辑回归算法
# 创建模型:逻辑回归(logisic regression)
model_1 = LogisticRegression()
# 训练模型
model_1.fit( train_X , train_Y )
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
# 评估模型
# score得到的是模型的正确率
model_1.score(test_X , test_Y )
0.8212290502793296
# 评估模型
# score得到的是模型的正确率
model_1.score(test_X , test_Y )
得到结果:
0.8212290502793296
# 应用模型,测试数据集
pre_X = features_1.iloc[891:]
predict_1 = model_1.predict(pre_X)
# 保存为 Kaggle submission格式
passenger_id = pd.read_csv('./data/test.csv', usecols=['PassengerId'])['PassengerId'].values
predDf_1 = pd.DataFrame( { 'PassengerId': passenger_id ,
'Survived': predict_1 } )
predDf_1.shape
# predDf_1.head()
predDf_1.to_csv('predict_result_1.csv', index=False) # 生成文件
方法二
随机森林算法,还没学会