Kaggle初体验-机器学习之泰坦尼克号乘客生存预测(下)

时间:2023-01-23 20:05:54

上篇中我们用常规方法分析了泰坦尼克号事件中乘客的生存情况,最后得出结论,已知某人的资料并不能判断他或她是生存或死亡。

上篇链接:https://blog.csdn.net/wuzlun/article/details/80189766

那么接下来我们用机器学习的方法来分析。

# 再次导入本次用到的工具

# 使用该魔法,不用写plt.show()
%matplotlib inline  

import warnings
# 忽略警告提示
warnings.filterwarnings('ignore')
warnings.filterwarnings('ignore', category=DeprecationWarning)

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif']=['SimHei'] # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus']=False # 用来正常显示负号

# seaborn作为matplotlib的补充及扩展
import seaborn as sns  

二、机器学习

首先,还是对数据的处理。

# 导入下载的训练数据(train.csv)和测试数据集(test.csv)
# 合并数据集,方便同时对两个数据集进行清洗

train = pd.read_csv('./data/train.csv')
test = pd.read_csv('./data/test.csv')
print('训练数据集',train.shape, '测试数据集',test.shape)

# 合并后再删除干扰列“Survived/PassengerId”
full = train.append( test , ignore_index = True )
full.drop('PassengerId', 1,inplace=True)
print('合并后的数据集', full.shape, '\n\n合并后表格结构如下:')
full.head()
训练数据集 (891, 12) 测试数据集 (418, 11) 合并后的数据集 (1309, 11) 合并后表格结构如下:
Age Cabin Embarked Fare Name Parch Pclass Sex SibSp Survived Ticket
0 22.0 NaN S 7.2500 Braund, Mr. Owen Harris 0 3 male 1 0.0 A/5 21171
1 38.0 C85 C 71.2833 Cumings, Mrs. John Bradley (Florence Briggs Th… 0 1 female 1 1.0 PC 17599
2 26.0 NaN S 7.9250 Heikkinen, Miss. Laina 0 3 female 0 1.0 STON/O2. 3101282
3 35.0 C123 S 53.1000 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 1 female 1 1.0 113803
4 35.0 NaN S 8.0500 Allen, Mr. William Henry 0 3 male 0 0.0 373450
#获取数据类型列的描述统计信息
full.describe()
Age Fare Parch Pclass SibSp Survived
count 1046.000000 1308.000000 1309.000000 1309.000000 1309.000000 891.000000
mean 29.881138 33.295479 0.385027 2.294882 0.498854 0.383838
std 14.413493 51.758668 0.865560 0.837836 1.041658 0.486592
min 0.170000 0.000000 0.000000 1.000000 0.000000 0.000000
25% 21.000000 7.895800 0.000000 2.000000 0.000000 0.000000
50% 28.000000 14.454200 0.000000 3.000000 0.000000 0.000000
75% 39.000000 31.275000 0.000000 3.000000 1.000000 1.000000
max 80.000000 512.329200 9.000000 3.000000 8.000000 1.000000
# 查看每一列的数据类型,和数据总数
full.info()

Kaggle初体验-机器学习之泰坦尼克号乘客生存预测(下)

我们发现合并后的总表有1309行记录。

其中数据类型列:
1)年龄(Age)里面数据总数是1046条,缺失了1309-1046=263,缺失率263/1309=20%
2)船票价格(Fare)里面数据总数是1308条,缺失了1条数据
3) 生存或死亡(Survived) 这个不用处理

字符串列:
1)登船港口(Embarked)里面数据总数是1307,只缺失了2条数据,缺失比较少
2)船舱号(Cabin)里面数据总数是295,缺失了1309-295=1014,缺失率1014/1309=77.5%,缺失比较大

这为我们下一步数据清洗指明了方向,只有知道哪些数据缺失数据,我们才能有针对性的处理。

数据清理

很多机器学习算法为了训练模型,要求所传入的特征中不能有空值,所以首先要做的就是对缺失值处理。

1、缺失值处理

缺失值处理的原则:
1. 如果是数值类型,用平均值取代。
PS: 上篇分析知道乘客年龄分布很分散,这里用中位数取代。
2. 如果是分类数据,用最常见的类别取代
3. 使用模型预测缺失值,例如:K-NN

(1)、数值类型处理

# 数值类型处理
# 训练表和测试表分开处理
sRow=891  # 原始数据集有891行

# 年龄处理
full[0:sRow]['Age'].fillna(train['Age'].median(), inplace=True)
full[sRow:]['Age'].fillna(test['Age'].median(), inplace=True)

# 船票价格处理
full[0:sRow]['Fare'].fillna(train['Fare'].mean(), inplace=True)
full[sRow:]['Fare'].fillna(test['Fare'].mean(), inplace=True)

full.describe()
Age Fare Parch Pclass SibSp Survived
count 1309.000000 1309.000000 1309.000000 1309.000000 1309.000000 891.000000
mean 29.437487 33.297261 0.385027 2.294882 0.498854 0.383838
std 12.915275 51.738919 0.865560 0.837836 1.041658 0.486592
min 0.170000 0.000000 0.000000 1.000000 0.000000 0.000000
25% 22.000000 7.895800 0.000000 2.000000 0.000000 0.000000
50% 28.000000 14.454200 0.000000 3.000000 0.000000 0.000000
75% 35.000000 31.275000 0.000000 3.000000 1.000000 1.000000
max 80.000000 512.329200 9.000000 3.000000 8.000000 1.000000

(2)、字符类型处理

# 查看登船港口(Embarked)人数
print('训练表统计\n',full[0:sRow]['Embarked'].value_counts())
print('测试表统计\n',full[sRow:]['Embarked'].value_counts())
训练表统计
S    644
C    168
Q     77
Name: Embarked, dtype: int64
测试表统计
S    270
C    102
Q     46
Name: Embarked, dtype: int64
# 用S替换登船港口(Embarked)缺失值
full['Embarked'].fillna('s', inplace=True)

# 船舱号(Cabin)缺失数据比较多,缺失值填充为U,表示未知(Uknow) 
full['Cabin'].fillna( 'U', inplace=True)
缺失值处理已经完成,我们查看数据处理情况,及完整的表格
# 缺失处理后的表格
full.info()
full.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1309 entries, 0 to 1308
Data columns (total 11 columns):
Age         1309 non-null float64
Cabin       1309 non-null object
Embarked    1309 non-null object
Fare        1309 non-null float64
Name        1309 non-null object
Parch       1309 non-null int64
Pclass      1309 non-null int64
Sex         1309 non-null object
SibSp       1309 non-null int64
Survived    891 non-null float64
Ticket      1309 non-null object
dtypes: float64(3), int64(3), object(5)
memory usage: 112.6+ KB
Age Cabin Embarked Fare Name Parch Pclass Sex SibSp Survived Ticket
0 22.0 U S 7.2500 Braund, Mr. Owen Harris 0 3 male 1 0.0 A/5 21171
1 38.0 C85 C 71.2833 Cumings, Mrs. John Bradley (Florence Briggs Th… 0 1 female 1 1.0 PC 17599
2 26.0 U S 7.9250 Heikkinen, Miss. Laina 0 3 female 0 1.0 STON/O2. 3101282
3 35.0 C123 S 53.1000 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 1 female 1 1.0 113803
4 35.0 U S 8.0500 Allen, Mr. William Henry 0 3 male 0 0.0 373450

看上表,通过缺失值处理,现在的表格已经很完整了。但通过上篇的分析,这个表格维度太高,还不能直接分析。所以接下来我们要对表格进行特征工程处理。

2、特征提取

(1)、数据分类

查看数据类型,分为3种数据类型。并对类别数据处理:用数值代替类别,并进行One-hot编码。

1.数值类型:

年龄(Age),船票价格(Fare),同代直系亲属人数(SibSp),不同代直系亲属人数(Parch)
2.时间序列:
3.分类数据:

1)有直接类别的
乘客性别(Sex):男性male,女性female
登船港口(Embarked):出发地点S=英国南安普顿Southampton,途径地点1:C=法国 瑟堡市Cherbourg,出发地点2:Q=爱尔兰 昆士敦Queenstown
客舱等级(Pclass):1=1等舱,2=2等舱,3=3等舱
2)字符串类型:可能从这里面提取出特征来,也归到分类数据中
乘客姓名(Name)
客舱号(Cabin)
船票编号(Ticket)

先从简单的开始处理

性别(Sex)

# 将性别的值映射为数值
# 男(male)对应数值1,女(female)对应数值0

sex_mapDict={'male':1, 'female':0}
# map函数:对Series每个数据应用自定义的函数计算
full['Sex']=full['Sex'].map(sex_mapDict)
full.head()
Age Cabin Embarked Fare Name Parch Pclass Sex SibSp Survived Ticket
0 22.0 U S 7.2500 Braund, Mr. Owen Harris 0 3 1 1 0.0 A/5 21171
1 38.0 C85 C 71.2833 Cumings, Mrs. John Bradley (Florence Briggs Th… 0 1 0 1 1.0 PC 17599
2 26.0 U S 7.9250 Heikkinen, Miss. Laina 0 3 0 0 1.0 STON/O2. 3101282
3 35.0 C123 S 53.1000 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 1 0 1 1.0 113803
4 35.0 U S 8.0500 Allen, Mr. William Henry 0 3 1 0 0.0 373450

登船港口(Embarked)

将维方法:
通过get_dummies进行one-hot编码,产生虚拟变量表,再将此表添加到full表中。

登船港口(Embarked)的值是:
出发地点:S=英国南安普顿Southampton
途径地点1:C=法国 瑟堡市Cherbourg
途径地点2:Q=爱尔兰 昆士敦Queenstown

# 存放提取后的特征
embarkedDf = pd.DataFrame()
embarkedDf = pd.get_dummies(full['Embarked'], prefix='Embarked')  # 列名前缀是Embarked
embarkedDf.head()
Embarked_C Embarked_Q Embarked_S Embarked_s
0 0 0 1 0
1 1 0 0 0
2 0 0 1 0
3 0 0 1 0
4 0 0 1 0
# 添加one-hot编码产生的虚拟变量(dummy variables)到泰坦尼克号数据集full
full = pd.concat([full, embarkedDf], axis=1)
# 删除登船港口(Embarked)列
full.drop('Embarked', axis=1, inplace=True)
full.head()
Age Cabin Fare Name Parch Pclass Sex SibSp Survived Ticket Embarked_C Embarked_Q Embarked_S Embarked_s
0 22.0 U 7.2500 Braund, Mr. Owen Harris 0 3 1 1 0.0 A/5 21171 0 0 1 0
1 38.0 C85 71.2833 Cumings, Mrs. John Bradley (Florence Briggs Th… 0 1 0 1 1.0 PC 17599 1 0 0 0
2 26.0 U 7.9250 Heikkinen, Miss. Laina 0 3 0 0 1.0 STON/O2. 3101282 0 0 1 0
3 35.0 C123 53.1000 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 1 0 1 1.0 113803 0 0 1 0
4 35.0 U 8.0500 Allen, Mr. William Henry 0 3 1 0 0.0 373450 0 0 1 0

舰艇等级(Pclass)

将维方法:
通过get_dummies进行one-hot编码,产生虚拟变量表,再将此表添加到full表中。
1=1等舱,2=2等舱,3=3等舱

# 存放提取Pclass后特征
pclassDf = pd.DataFrame()
pclassDf = pd.get_dummies(full['Pclass'], prefix='Pclass')  # Pclass为前缀
pclassDf.head()
Pclass_1 Pclass_2 Pclass_3
0 0 0 1
1 1 0 0
2 0 0 1
3 1 0 0
4 0 0 1
# 添加one-hot编码产生的虚拟变量(dummy variables)到泰坦尼克号数据集full
full = pd.concat([full, pclassDf], axis=1)
# 删除舰艇等级(Pclass)列
full.drop('Pclass', axis=1, inplace=True)
full.head()
Age Cabin Fare Name Parch Sex SibSp Survived Ticket Embarked_C Embarked_Q Embarked_S Embarked_s Pclass_1 Pclass_2 Pclass_3
0 22.0 U 7.2500 Braund, Mr. Owen Harris 0 1 1 0.0 A/5 21171 0 0 1 0 0 0 1
1 38.0 C85 71.2833 Cumings, Mrs. John Bradley (Florence Briggs Th… 0 0 1 1.0 PC 17599 1 0 0 0 1 0 0
2 26.0 U 7.9250 Heikkinen, Miss. Laina 0 0 0 1.0 STON/O2. 3101282 0 0 1 0 0 0 1
3 35.0 C123 53.1000 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 0 1 1.0 113803 0 0 1 0 1 0 0
4 35.0 U 8.0500 Allen, Mr. William Henry 0 1 0 0.0 373450 0 0 1 0 0 0 1

姓名(Name)

查看姓名这一列长啥样,注意到在乘客名字(Name)中,有一个非常显著的特点:
乘客头衔每个名字当中都包含了具体的称谓或者说是头衔,将这部分信息提取出来后可以作为非常有用一个新变量,可以帮助我们进行预测。
例如:
Braund, Mr. Owen Harris
Heikkinen, Miss. Laina
Oliva y Ocana, Dona. Fermina
Peter, Master. Michael J

所以可以从姓名中提取头衔。

# 定义一个函数,从乘客姓名中提取头衔
def getTitle(name):
    str1 = name.split(',')[1]  # Mr. Owen Harris
    str2 = str1.split('.')[0]  # Mr
    return str2.strip()        # 去掉空格字符
# 存放提取后的特征
titleDf = pd.DataFrame()
titleDf['Title'] = full['Name'].map(getTitle)
print('特征表格:',titleDf.shape)
titleDf.head()
特征表格: (1309, 1)
Title
0 Mr
1 Mrs
2 Miss
3 Mrs
4 Mr

定义以下几种头衔类别:

Officer*官员
Royalty王室(皇室)
Mr已婚男士
Mrs已婚妇女
Miss年轻未婚女子
Master有技能的人/教师

# 姓名中头衔字符串与定义头衔类别的映射关系
title_mapDict = {
                    "Capt":       "Officer",
                    "Col":        "Officer",
                    "Major":      "Officer",
                    "Jonkheer":   "Royalty",
                    "Don":        "Royalty",
                    "Sir" :       "Royalty",
                    "Dr":         "Officer",
                    "Rev":        "Officer",
                    "the Countess":"Royalty",
                    "Dona":       "Royalty",
                    "Mme":        "Mrs",
                    "Mlle":       "Miss",
                    "Ms":         "Mrs",
                    "Mr" :        "Mr",
                    "Mrs" :       "Mrs",
                    "Miss" :      "Miss",
                    "Master" :    "Master",
                    "Lady" :      "Royalty"
                    }
# map函数:对Series每个数据应用自定义的函数计算
titleDf['Title'] = titleDf['Title'].map(title_mapDict)
titleDf = pd.get_dummies(titleDf['Title'], prefix='Title')
titleDf.head()
Title_Master Title_Miss Title_Mr Title_Mrs Title_Officer Title_Royalty
0 0 0 1 0 0 0
1 0 0 0 1 0 0
2 0 1 0 0 0 0
3 0 0 0 1 0 0
4 0 0 1 0 0 0
# 添加one-hot编码产生的虚拟变量(dummy variables)到泰坦尼克号数据集full
full = pd.concat([full, titleDf], axis=1)
full.drop('Name', axis=1, inplace=True)
full.head()
Age Cabin Fare Parch Sex SibSp Survived Ticket Embarked_C Embarked_Q Embarked_s Pclass_1 Pclass_2 Pclass_3 Title_Master Title_Miss Title_Mr Title_Mrs Title_Officer Title_Royalty
0 22.0 U 7.2500 0 1 1 0.0 A/5 21171 0 0 0 0 0 1 0 0 1 0 0 0
1 38.0 C85 71.2833 0 0 1 1.0 PC 17599 1 0 0 1 0 0 0 0 0 1 0 0
2 26.0 U 7.9250 0 0 0 1.0 STON/O2. 3101282 0 0 0 0 0 1 0 1 0 0 0 0
3 35.0 C123 53.1000 0 0 1 1.0 113803 0 0 0 1 0 0 0 0 0 1 0 0
4 35.0 U 8.0500 0 1 0 0.0 373450 0 0 0 0 0 1 0 0 1 0 0 0

5 rows × 21 columns

客舱号(Cabin)

提取客舱号首字母当客舱的类别

# 存放客舱号特征
cabinDf = pd.DataFrame()
full['Cabin'] = full['Cabin'].map(lambda c: c[0])
cabinDf = pd.get_dummies(full['Cabin'], prefix='Cabin')
cabinDf.head()
Cabin_A Cabin_B Cabin_C Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U
0 0 0 0 0 0 0 0 0 1
1 0 0 1 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 1
3 0 0 1 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 1
# 添加one-hot编码产生的虚拟变量(dummy variables)到泰坦尼克号数据集full
full = pd.concat([full, cabinDf], axis=1)
full.drop('Cabin', axis=1, inplace=True)
full.head()
Age Fare Parch Sex SibSp Survived Ticket Embarked_C Embarked_Q Embarked_S Title_Royalty Cabin_A Cabin_B Cabin_C Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U
0 22.0 7.2500 0 1 1 0.0 A/5 21171 0 0 1 0 0 0 0 0 0 0 0 0 1
1 38.0 71.2833 0 0 1 1.0 PC 17599 1 0 0 0 0 0 1 0 0 0 0 0 0
2 26.0 7.9250 0 0 0 1.0 STON/O2. 3101282 0 0 1 0 0 0 0 0 0 0 0 0 1
3 35.0 53.1000 0 0 1 1.0 113803 0 0 1 0 0 0 1 0 0 0 0 0 0
4 35.0 8.0500 0 1 0 0.0 373450 0 0 1 0 0 0 0 0 0 0 0 0 1

5 rows × 29 columns

家庭信息

统计家庭人数,确定家庭类型

家庭人数=同代直系亲属数(Parch)+不同代直系亲属数(SibSp)+乘客自己
家庭类别:
小家庭Family_Single:家庭人数=1
中等家庭Family_Small: 2<=家庭人数<=4
大家庭Family_Large: 家庭人数>=5

# 存放家庭信息
familyDf = pd.DataFrame()
familyDf[ 'FamilySize' ] = full[ 'Parch' ] + full[ 'SibSp' ] + 1

# if 条件为真的时候返回if前面内容,否则返回0
familyDf[ 'Family_Single' ] = familyDf[ 'FamilySize' ].map( lambda s : 1 if s == 1 else 0 )
familyDf[ 'Family_Small' ]  = familyDf[ 'FamilySize' ].map( lambda s : 1 if 2 <= s <= 4 else 0 )
familyDf[ 'Family_Large' ]  = familyDf[ 'FamilySize' ].map( lambda s : 1 if 5 <= s else 0 )

familyDf.head()
FamilySize Family_Single Family_Small Family_Large
0 2 0 1 0
1 2 0 1 0
2 1 1 0 0
3 2 0 1 0
4 1 1 0 0
# 添加one-hot编码产生的虚拟变量(dummy variables)到泰坦尼克号数据集full
full = pd.concat([full, familyDf], axis=1)
full.head()
Age Fare Parch Sex SibSp Survived Ticket Embarked_C Embarked_Q Embarked_S Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U FamilySize Family_Single Family_Small Family_Large
0 22.0 7.2500 0 1 1 0.0 A/5 21171 0 0 1 0 0 0 0 0 1 2 0 1 0
1 38.0 71.2833 0 0 1 1.0 PC 17599 1 0 0 0 0 0 0 0 0 2 0 1 0
2 26.0 7.9250 0 0 0 1.0 STON/O2. 3101282 0 0 1 0 0 0 0 0 1 1 1 0 0
3 35.0 53.1000 0 0 1 1.0 113803 0 0 1 0 0 0 0 0 0 2 0 1 0
4 35.0 8.0500 0 1 0 0.0 373450 0 0 1 0 0 0 0 0 1 1 1 0 0

5 rows × 33 columns

船票编号(Ticket)

提取船票编号字母前缀为特征类型

# 定义函数,过滤船票编号中的字符,并提取字母前缀
def cleanTicket(ticket):
        ticket = ticket.replace('.','')
        ticket = ticket.replace('/','')
        ticket = ticket.split()
        ticket = map(lambda t : t.strip(), ticket)
        ticket = list(filter(lambda t : not t.isdigit(), ticket))  # isdigit()是否是数字
        if len(ticket) > 0:
            return ticket[0]
        else: 
            return 'Nfrefix'
# 存放船票编号特征
ticketDf = pd.DataFrame()
full['Ticket'] = full['Ticket'].map(cleanTicket)
ticketDf = pd.get_dummies(full['Ticket'], prefix='Ticket')
ticketDf.head()
Ticket_A Ticket_A4 Ticket_A5 Ticket_AQ3 Ticket_AQ4 Ticket_AS Ticket_C Ticket_CA Ticket_CASOTON Ticket_FC Ticket_SOPP Ticket_SOTONO2 Ticket_SOTONOQ Ticket_SP Ticket_STONO Ticket_STONO2 Ticket_STONOQ Ticket_SWPP Ticket_WC Ticket_WEP
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

5 rows × 37 columns

# 添加one-hot编码产生的虚拟变量(dummy variables)到泰坦尼克号数据集full
full = pd.concat([full, ticketDf], axis=1)
full.drop('Ticket', axis=1, inplace=True)
full.head()
Age Fare Parch Sex SibSp Survived Embarked_C Embarked_Q Embarked_S Embarked_s Ticket_SOPP Ticket_SOTONO2 Ticket_SOTONOQ Ticket_SP Ticket_STONO Ticket_STONO2 Ticket_STONOQ Ticket_SWPP Ticket_WC Ticket_WEP
0 22.0 7.2500 0 1 1 0.0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
1 38.0 71.2833 0 0 1 1.0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
2 26.0 7.9250 0 0 0 1.0 0 0 1 0 0 0 0 0 0 1 0 0 0 0
3 35.0 53.1000 0 0 1 1.0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
4 35.0 8.0500 0 1 0 0.0 0 0 1 0 0 0 0 0 0 0 0 0 0 0

5 rows × 69 columns

特征全部处理完成,从上表可以知道,总共有69个特征。但我们不会直接拿69个特殊来预测,这样维度还是太大了,我们应根据权重的大小来选择。

3、特征选择

特征选择就是选择重要的特征,忽略影响不大的特征。

选择特征之前,我们先获取已处理的各数据集

处理后的各个数据集

# 原始数据集有891行

# 结果数据集,直接获取
targets = pd.read_csv('./data/train.csv', usecols=['Survived'])['Survived'].values
# targets = full.loc[0:890,'Survived'] # 同上

# 训练数据集
trainDf = full.iloc[:891]

# 测试数据集
testDf = full.iloc[891:]

方法一:利用pearson相关系数法

# 利用 pearson相关系数法查看
corrDf = trainDf.corr()
corrDf
Age Fare Parch Sex SibSp Survived Embarked_C Embarked_Q Embarked_S Embarked_s Ticket_SOPP Ticket_SOTONO2 Ticket_SOTONOQ Ticket_SP Ticket_STONO Ticket_STONO2 Ticket_STONOQ Ticket_SWPP Ticket_WC Ticket_WEP
Age 1.000000 0.096688 -0.172482 0.081163 -0.233296 -0.064910 0.030248 -0.031415 -0.014665 0.075229 0.019230 0.004150 -0.023078 -0.026692 -0.004743 -0.024435 NaN 0.040600 -0.005417 0.095166
Fare 0.096688 1.000000 0.216225 -0.182333 0.159651 0.257307 0.269335 -0.117216 -0.166603 0.045646 -0.026551 -0.023569 -0.065010 -0.016229 -0.057589 -0.035872 NaN -0.020728 -0.021149 0.041570
Parch -0.172482 0.216225 1.000000 -0.245489 0.414838 0.081629 -0.011069 -0.081228 0.063036 -0.022467 -0.027532 -0.022467 -0.061983 -0.015878 -0.055345 -0.039002 NaN -0.022467 0.134682 0.044618
Sex 0.081163 -0.182333 -0.245489 1.000000 -0.114631 -0.543351 -0.082853 -0.074115 0.125722 -0.064296 0.002321 0.034990 0.078271 0.024728 0.086193 -0.082890 NaN 0.034990 -0.055216 0.002321
SibSp -0.233296 0.159651 0.414838 -0.114631 1.000000 -0.035322 -0.059528 -0.026354 0.070941 -0.022508 -0.027582 -0.022508 -0.062097 -0.015907 -0.046612 -0.001719 NaN -0.022508 0.026776 0.007576
Survived -0.064910 0.257307 0.081629 -0.543351 -0.035322 1.000000 0.168240 0.003650 -0.155660 0.060095 -0.045876 -0.037436 -0.067404 -0.026456 0.007887 0.019667 NaN 0.060095 -0.062182 -0.006036
Embarked_C 0.030248 0.269335 -0.011069 -0.082853 -0.059528 0.168240 1.000000 -0.148258 -0.778359 -0.022864 -0.028018 -0.022864 -0.063078 -0.016158 -0.056322 -0.039691 NaN -0.022864 -0.051357 -0.028018
Embarked_Q -0.031415 -0.117216 -0.081228 -0.074115 -0.026354 0.003650 -0.148258 1.000000 -0.496624 -0.014588 -0.017877 -0.014588 -0.040246 -0.010310 -0.035936 -0.025324 NaN -0.014588 -0.032768 -0.017877
Embarked_S -0.014665 -0.166603 0.063036 0.125722 0.070941 -0.155660 -0.778359 -0.496624 1.000000 -0.076588 0.035996 0.029374 0.081040 0.020759 0.072361 0.050993 NaN 0.029374 0.065981 0.035996
Embarked_s 0.075229 0.045646 -0.022467 -0.064296 -0.022508 0.060095 -0.022864 -0.014588 -0.076588 1.000000 -0.002757 -0.002250 -0.006207 -0.001590 -0.005542 -0.003905 NaN -0.002250 -0.005053 -0.002757
Pclass_1 0.323896 0.591711 -0.017633 -0.098013 -0.054582 0.285904 0.296423 -0.155342 -0.170379 0.083847 -0.032880 -0.026831 -0.074023 -0.018962 -0.066095 -0.046578 NaN -0.026831 -0.060268 0.102749
Pclass_2 0.015831 -0.118557 -0.000734 -0.064746 -0.055932 0.093349 -0.125416 -0.127301 0.192061 -0.024197 0.066072 -0.024197 -0.066756 -0.017100 -0.059607 -0.042005 NaN 0.092975 0.024606 -0.029652
Pclass_3 -0.291955 -0.413333 0.015790 0.137143 0.092548 -0.322308 -0.153329 0.237449 -0.009511 -0.052550 -0.025444 0.042811 0.118109 0.030255 0.105459 0.074318 NaN -0.052550 0.031902 -0.064397
Title_Master -0.373960 0.010908 0.267344 0.159934 0.349559 0.085221 -0.035225 0.010478 0.025291 -0.010283 -0.012601 -0.010283 -0.028370 -0.007267 -0.025332 -0.017851 NaN -0.010283 -0.023098 -0.012601
Title_Miss -0.248767 0.120829 0.102514 -0.691548 0.084945 0.332795 0.037613 0.168720 -0.142412 0.034389 -0.029652 -0.024197 -0.045206 -0.017100 -0.059607 0.093599 NaN -0.024197 0.077244 0.018210
Title_Mr 0.180808 -0.183766 -0.333905 0.867334 -0.250489 -0.549199 -0.072567 -0.078338 0.118482 -0.055767 0.010178 0.040342 0.093621 0.028510 0.099377 -0.069002 NaN 0.040342 -0.038911 -0.029080
Title_Mrs 0.166798 0.105665 0.221318 -0.552686 0.059941 0.344935 0.066101 -0.091121 -0.005691 0.048498 0.031722 -0.019338 -0.053352 -0.013667 -0.047638 0.005683 NaN -0.019338 -0.012963 -0.023698
Title_Officer 0.179927 0.010357 -0.048211 0.089228 -0.024712 -0.031316 -0.008034 0.012618 -0.000180 -0.006811 -0.008346 -0.006811 -0.018790 -0.004813 -0.016777 -0.011823 NaN -0.006811 -0.015298 0.129364
Title_Royalty 0.070654 0.015044 -0.035583 -0.007483 -0.008384 0.033391 0.079020 -0.023105 -0.054171 -0.003563 -0.004366 -0.003563 -0.009830 -0.002518 -0.008777 -0.006185 NaN -0.003563 -0.008004 -0.004366
Cabin_A 0.121732 0.019549 -0.040325 0.078271 -0.046266 0.022287 0.093040 -0.040246 -0.055383 -0.006207 -0.007606 -0.006207 -0.017123 -0.004386 -0.015289 -0.010775 NaN -0.006207 -0.013941 -0.007606
Cabin_B 0.096080 0.386297 0.056498 -0.109689 -0.034538 0.175095 0.168642 -0.072579 -0.123057 0.200996 -0.013716 -0.011193 -0.030880 -0.007910 -0.027572 -0.019430 NaN -0.011193 -0.025141 0.159633
Cabin_C 0.115188 0.364318 0.030736 -0.058649 0.029251 0.114652 0.113952 -0.049776 -0.066995 -0.012631 -0.015478 -0.012631 -0.034846 -0.008926 -0.031114 -0.021926 NaN -0.012631 -0.028371 -0.015478
Cabin_D 0.135674 0.098878 -0.019125 -0.079248 -0.017575 0.150716 0.102977 -0.060318 -0.051139 -0.009302 -0.011399 -0.009302 -0.025663 -0.006574 -0.022914 -0.016148 NaN -0.009302 -0.020894 -0.011399
Cabin_E 0.120483 0.053717 -0.016554 -0.047003 -0.036865 0.145321 -0.015939 -0.037897 0.038685 -0.009155 0.092903 -0.009155 0.021626 -0.006470 -0.022551 -0.015892 NaN -0.009155 -0.020563 0.092903
Cabin_F -0.076393 -0.033093 0.023694 -0.008202 0.001706 0.057935 -0.034726 -0.004113 0.033537 -0.005771 -0.007073 -0.005771 -0.015923 -0.004079 -0.014217 -0.010019 NaN -0.005771 -0.012964 -0.007073
Cabin_G -0.075406 -0.025180 0.072388 -0.091031 -0.001402 0.016040 -0.032371 -0.020654 0.041589 -0.003185 -0.003903 -0.003185 -0.008787 -0.002251 -0.007846 -0.005529 NaN -0.003185 -0.007155 -0.003903
Cabin_T 0.040285 0.002224 -0.015878 0.024728 -0.015907 -0.026456 -0.016158 -0.010310 0.020759 -0.001590 -0.001948 -0.001590 -0.004386 -0.001124 -0.003917 -0.002760 NaN -0.001590 -0.003571 -0.001948
Cabin_U -0.240314 -0.482075 -0.036987 0.140391 0.040460 -0.316912 -0.208528 0.129572 0.110087 -0.087042 -0.014439 0.025846 0.050544 0.018266 0.063670 0.044868 NaN 0.025846 0.058056 -0.106664
FamilySize -0.245619 0.217138 0.783111 -0.200988 0.890712 0.016639 -0.046215 -0.058592 0.079977 -0.026608 -0.032606 -0.026608 -0.073407 -0.018804 -0.059507 -0.020659 NaN -0.026608 0.085586 0.027468
Family_Single 0.171647 -0.271832 -0.583398 0.303646 -0.584471 -0.203367 -0.095298 0.086464 0.024929 0.038510 0.047192 0.038510 0.106245 0.027216 0.074968 -0.017280 NaN 0.038510 -0.044131 -0.071588
Ticket_CA -0.062501 -0.005363 0.228435 -0.006179 0.357512 -0.019137 -0.105869 -0.067548 0.136015 -0.010417 -0.012765 -0.010417 -0.028739 -0.007362 -0.025661 -0.018084 NaN -0.010417 -0.023399 -0.012765
Ticket_CASOTON -0.003507 -0.014649 -0.015878 0.024728 -0.015907 -0.026456 -0.016158 -0.010310 0.020759 -0.001590 -0.001948 -0.001590 -0.004386 -0.001124 -0.003917 -0.002760 NaN -0.001590 -0.003571 -0.001948
Ticket_FC 0.004221 0.013361 -0.015878 0.024728 0.014507 -0.026456 -0.016158 -0.010310 0.020759 -0.001590 -0.001948 -0.001590 -0.004386 -0.001124 -0.003917 -0.002760 NaN -0.001590 -0.003571 -0.001948
Ticket_FCC 0.038324 -0.015359 0.039016 -0.070383 -0.008384 0.064285 -0.036212 -0.023105 0.046524 -0.003563 -0.004366 -0.003563 -0.009830 -0.002518 -0.008777 -0.006185 NaN -0.003563 -0.008004 -0.004366
Ticket_Fa -0.003507 -0.016800 -0.015878 0.024728 -0.015907 -0.026456 -0.016158 -0.010310 0.020759 -0.001590 -0.001948 -0.001590 -0.004386 -0.001124 -0.003917 -0.002760 NaN -0.001590 -0.003571 -0.001948
Ticket_LINE 0.014906 -0.043544 -0.031809 0.049539 -0.031867 -0.018481 -0.032371 -0.020654 0.041589 -0.003185 -0.003903 -0.003185 -0.008787 -0.002251 -0.007846 -0.005529 NaN -0.003185 -0.007155 -0.003903
Ticket_LP NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Ticket_Nfrefix -0.030025 -0.173170 -0.035762 -0.032507 -0.080768 0.001492 -0.115615 0.172297 -0.010083 0.027979 -0.098535 -0.080408 -0.221835 -0.056825 -0.198077 -0.139586 NaN -0.080408 -0.180613 -0.098535
Ticket_PC 0.128823 0.486256 -0.049451 -0.073639 -0.046244 0.147062 0.397139 -0.082643 -0.293812 -0.012745 -0.015618 -0.012745 -0.035162 -0.009007 -0.031396 -0.022125 NaN -0.012745 -0.028628 -0.015618
Ticket_PP -0.056706 -0.021012 0.044618 -0.038235 -0.010003 0.033803 -0.028018 -0.017877 0.035996 -0.002757 -0.003378 -0.002757 -0.007606 -0.001948 -0.006791 -0.004786 NaN -0.002757 -0.006193 -0.003378
Ticket_PPP -0.001318 -0.007835 -0.022467 -0.014653 0.020528 0.011329 0.098396 -0.014588 -0.076588 -0.002250 -0.002757 -0.002250 -0.006207 -0.001590 -0.005542 -0.003905 NaN -0.002250 -0.005053 -0.002757
Ticket_SC -0.031844 -0.013636 -0.015878 -0.045439 -0.015907 0.042470 0.069538 -0.010310 -0.054125 -0.001590 -0.001948 -0.001590 -0.004386 -0.001124 -0.003917 -0.002760 NaN -0.001590 -0.003571 -0.001948
Ticket_SCA3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Ticket_SCA4 -0.003507 -0.016302 -0.015878 0.024728 -0.015907 -0.026456 -0.016158 -0.010310 0.020759 -0.001590 -0.001948 -0.001590 -0.004386 -0.001124 -0.003917 -0.002760 NaN -0.001590 -0.003571 -0.001948
Ticket_SCAH 0.020719 -0.012023 -0.027532 -0.038235 0.007576 0.033803 0.021514 -0.017877 -0.007287 -0.002757 -0.003378 -0.002757 -0.007606 -0.001948 -0.006791 -0.004786 NaN -0.002757 -0.006193 -0.003378
Ticket_SCOW 0.009373 -0.013451 -0.015878 0.024728 -0.015907 -0.026456 -0.016158 -0.010310 0.020759 -0.001590 -0.001948 -0.001590 -0.004386 -0.001124 -0.003917 -0.002760 NaN -0.001590 -0.003571 -0.001948
Ticket_SCPARIS -0.037643 -0.016863 0.005189 0.039034 -0.007625 0.008185 0.184602 -0.027369 -0.143687 -0.004221 -0.005172 -0.004221 -0.011644 -0.002983 -0.010397 -0.007327 NaN -0.004221 -0.009481 -0.005172
Ticket_SCParis -0.040572 0.002973 0.093228 -0.020746 0.013831 0.016040 0.139310 -0.020654 -0.108433 -0.003185 -0.003903 -0.003185 -0.008787 -0.002251 -0.007846 -0.005529 NaN -0.003185 -0.007155 -0.003903
Ticket_SOC -0.045527 0.051055 -0.039002 0.032015 0.035636 -0.036769 -0.039691 -0.025324 0.050993 -0.003905 -0.004786 -0.003905 -0.010775 -0.002760 -0.009621 -0.006780 NaN -0.003905 -0.008772 -0.004786
Ticket_SOP 0.055741 -0.013282 -0.015878 0.024728 -0.015907 -0.026456 -0.016158 -0.010310 0.020759 -0.001590 -0.001948 -0.001590 -0.004386 -0.001124 -0.003917 -0.002760 NaN -0.001590 -0.003571 -0.001948
Ticket_SOPP 0.019230 -0.026551 -0.027532 0.002321 -0.027582 -0.045876 -0.028018 -0.017877 0.035996 -0.002757 1.000000 -0.002757 -0.007606 -0.001948 -0.006791 -0.004786 NaN -0.002757 -0.006193 -0.003378
Ticket_SOTONO2 0.004150 -0.023569 -0.022467 0.034990 -0.022508 -0.037436 -0.022864 -0.014588 0.029374 -0.002250 -0.002757 1.000000 -0.006207 -0.001590 -0.005542 -0.003905 NaN -0.002250 -0.005053 -0.002757
Ticket_SOTONOQ -0.023078 -0.065010 -0.061983 0.078271 -0.062097 -0.067404 -0.063078 -0.040246 0.081040 -0.006207 -0.007606 -0.006207 1.000000 -0.004386 -0.015289 -0.010775 NaN -0.006207 -0.013941 -0.007606
Ticket_SP -0.026692 -0.016229 -0.015878 0.024728 -0.015907 -0.026456 -0.016158 -0.010310 0.020759 -0.001590 -0.001948 -0.001590 -0.004386 1.000000 -0.003917 -0.002760 NaN -0.001590 -0.003571 -0.001948
Ticket_STONO -0.004743 -0.057589 -0.055345 0.086193 -0.046612 0.007887 -0.056322 -0.035936 0.072361 -0.005542 -0.006791 -0.005542 -0.015289 -0.003917 1.000000 -0.009621 NaN -0.005542 -0.012448 -0.006791
Ticket_STONO2 -0.024435 -0.035872 -0.039002 -0.082890 -0.001719 0.019667 -0.039691 -0.025324 0.050993 -0.003905 -0.004786 -0.003905 -0.010775 -0.002760 -0.009621 1.000000 NaN -0.003905 -0.008772 -0.004786
Ticket_STONOQ NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Ticket_SWPP 0.040600 -0.020728 -0.022467 0.034990 -0.022508 0.060095 -0.022864 -0.014588 0.029374 -0.002250 -0.002757 -0.002250 -0.006207 -0.001590 -0.005542 -0.003905 NaN 1.000000 -0.005053 -0.002757
Ticket_WC -0.005417 -0.021149 0.134682 -0.055216 0.026776 -0.062182 -0.051357 -0.032768 0.065981 -0.005053 -0.006193 -0.005053 -0.013941 -0.003571 -0.012448 -0.008772 NaN -0.005053 1.000000 -0.006193
Ticket_WEP 0.095166 0.041570 0.044618 0.002321 0.007576 -0.006036 -0.028018 -0.017877 0.035996 -0.002757 -0.003378 -0.002757 -0.007606 -0.001948 -0.006791 -0.004786 NaN -0.002757 -0.006193 1.000000

69 rows × 69 columns

# 查看各个特征与生存情况(Survived)的相关系数
corrDf['Survived'].sort_values(ascending =False)
    Survived          1.000000
    Title_Mrs         0.344935
    Title_Miss        0.332795
    Pclass_1          0.285904
    Family_Small      0.279855
    Fare              0.257307
    Cabin_B           0.175095
    Embarked_C        0.168240
    Cabin_D           0.150716
    Ticket_PC         0.147062
    Cabin_E           0.145321
    Cabin_C           0.114652
    Pclass_2          0.093349
    Title_Master      0.085221
    Parch             0.081629
    Ticket_FCC        0.064285
    Embarked_s        0.060095
    Ticket_SWPP       0.060095
    Cabin_F           0.057935
    Ticket_SC         0.042470
    Ticket_PP         0.033803
    Ticket_SCAH       0.033803
    Title_Royalty     0.033391
    Cabin_A           0.022287
    Ticket_STONO2     0.019667
    FamilySize        0.016639
    Cabin_G           0.016040
    Ticket_SCParis    0.016040
    Ticket_PPP        0.011329
    Ticket_SCPARIS    0.008185
                        ...   
    Ticket_SOP       -0.026456
    Ticket_Fa        -0.026456
    Ticket_SCOW      -0.026456
    Cabin_T          -0.026456
    Ticket_AS        -0.026456
    Ticket_FC        -0.026456
    Ticket_CASOTON   -0.026456
    Title_Officer    -0.031316
    SibSp            -0.035322
    Ticket_SOC       -0.036769
    Ticket_SOTONO2   -0.037436
    Ticket_SOPP      -0.045876
    Ticket_WC        -0.062182
    Age              -0.064910
    Ticket_SOTONOQ   -0.067404
    Ticket_A4        -0.070234
    Ticket_A5        -0.092199
    Family_Large     -0.125147
    Embarked_S       -0.155660
    Family_Single    -0.203367
    Cabin_U          -0.316912
    Pclass_3         -0.322308
    Sex              -0.543351
    Title_Mr         -0.549199
    Ticket_A               NaN
    Ticket_AQ3             NaN
    Ticket_AQ4             NaN
    Ticket_LP              NaN
    Ticket_SCA3            NaN
    Ticket_STONOQ          NaN
    Name: Survived, Length: 69, dtype: float64

根据各个特征与生存情况(Survived)的相关系数大小,发现下几个特征和生存情况关系最大:

头衔(前面所在的数据集titleDf)、年龄(Age)、客舱等级(pclassDf)、家庭大小(familyDf)、船票价格(Fare)、船舱号(cabinDf)、登船港口(embarkedDf)、性别(Sex)

下面进行特征选择。

方法二:随机森林分类器方法

# 导入机器学习相关包
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble.gradient_boosting import GradientBoostingClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
# 随机森林分类器
clf = RandomForestClassifier(n_estimators=100, max_features='sqrt')
clf = clf.fit(trainDf, targets)
# 查看特征重要性
features = pd.DataFrame()
features['feature'] = trainDf.columns
features['importance'] = clf.feature_importances_
features.sort_values(by=['importance'], ascending=True, inplace=True)
features.set_index('feature', inplace=True)
features.drop('Survived',  inplace=True) # 忽略Survived列影响
# 可视化
features.plot(kind='barh', figsize=(25, 25), color='g')
plt.title('特征重要性排行')
plt.xlabel('特征')
plt.ylabel('系数')
Text(0,0.5,'系数')

Kaggle初体验-机器学习之泰坦尼克号乘客生存预测(下)

从上表可以看出,生存情况和性别、头衔(头衔又区分多种,其中Title_Mr最高、Title_Mrs第二、Title_Miss第三)、船票价格、年龄、船舱等级等有很大的关系。

下面进行方法二的特征选择。

# 查看特征重要性
features = pd.DataFrame()
features['feature'] = trainDf.columns
features['importance'] = clf.feature_importances_
features.sort_values(by=['importance'], ascending=True, inplace=True)
features.set_index('feature', inplace=True)
features.drop('Survived',  inplace=True) # 忽略Survived列影响
# 可视化
features.plot(kind='barh', figsize=(25, 25), color='g')
plt.title('特征重要性排行')
plt.xlabel('特征')
plt.ylabel('系数')

Kaggle初体验-机器学习之泰坦尼克号乘客生存预测(下)

从上表可以看出,生存情况和性别、头衔(头衔又区分多种,其中Title_Mr最高、Title_Mrs第二、Title_Miss第三)、船票价格、年龄、船舱等级等有很大的关系。

下面进行方法二的特征选择。

4、构建模型

用训练数据和某个机器学习算法得到机器学习模型,用测试数据评估模型

方法一

从已处理的原始数据集(trainDf)中拆分出训练数据集(用于模型训练train),预测数据集(用于模型评估test),再利用train_test_split交叉验证的函数,得到训练数据集和验证数据集。选择机器学习算法,训练模型、评估模型,最后就是用Kaggle提供的测试数据(这里是testDf)测试模型。

# 拆分出来的验证数据集占20%
# X,Y 需要把数据转为矩阵形式
X = features_1.iloc[:891]  # 转换成矩阵形式
Y = targets
train_X, test_X, train_Y, test_Y = train_test_split(X, Y, test_size=.2)

# 输出数据集大小
print ('原始数据集特征:',X.shape, 
       '训练数据集特征:',train_X.shape ,
       '验证数据集特征:',test_X.shape)

print ('原始数据集标签:',targets.shape, 
       '训练数据集标签:',train_Y.shape ,
       '验证数据集标签:',test_Y.shape)
原始数据集特征: (891, 29) 训练数据集特征: (712, 29) 验证数据集特征: (179, 29)
原始数据集标签: (891,) 训练数据集标签: (712,) 验证数据集标签: (179,)

第一次接触机器学习,所以用逻辑回归算法

# 创建模型:逻辑回归(logisic regression)
model_1 = LogisticRegression()

# 训练模型
model_1.fit( train_X , train_Y )
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
# 评估模型
# score得到的是模型的正确率
model_1.score(test_X , test_Y )
0.8212290502793296
# 评估模型
# score得到的是模型的正确率
model_1.score(test_X , test_Y )
得到结果:
0.8212290502793296
# 应用模型,测试数据集
pre_X = features_1.iloc[891:]
predict_1 = model_1.predict(pre_X)

# 保存为 Kaggle submission格式
passenger_id = pd.read_csv('./data/test.csv', usecols=['PassengerId'])['PassengerId'].values
predDf_1 = pd.DataFrame( { 'PassengerId': passenger_id , 
                         'Survived': predict_1 } )
predDf_1.shape
# predDf_1.head()
predDf_1.to_csv('predict_result_1.csv', index=False) # 生成文件

方法二

随机森林算法,还没学会

请看该大神