上一篇文章已经记录了如何填充缺失值,在这篇文章中,则主要记录特征工程的实践过程。
特征工程构建过程中,主要对变量进行转换,将类别型变量重新编码,数值型变量也可利用函数等进行转换。
本文主要分为以下几个部分:
1.类别型变量
2.新增变量
3.数值型变量
4.最终整合
类别型变量
对于类别型变量,主要是将变量进行重新编码,比如将Cond和Qual相关的好坏数据重新编码为(0,1,2,3,4,5)。
代码如下:
all_data = all_data.replace({'Utilities': {'AllPub': 1, 'NoSeWa': 0, 'NoSewr': 0, 'ELO': 0}, 'Street': {'Pave': 1, 'Grvl': 0 }, 'FireplaceQu': {'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1, 'NoFireplace': 0 }, 'Fence': {'GdPrv': 2, 'GdWo': 2, 'MnPrv': 1, 'MnWw': 1, 'NoFence': 0}, 'ExterQual': {'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1 }, 'ExterCond': {'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1 }, 'BsmtQual': {'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1, 'NoBsmt': 0}, 'BsmtExposure': {'Gd': 3, 'Av': 2, 'Mn': 1, 'No': 0, 'NoBsmt': 0}, 'BsmtCond': {'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1, 'NoBsmt': 0}, 'GarageQual': {'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1, 'NoGarage': 0}, 'GarageCond': {'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1, 'NoGarage': 0}, 'KitchenQual': {'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1}, 'Functional': {'Typ': 0, 'Min1': 1, 'Min2': 1, 'Mod': 2, 'Maj1': 3, 'Maj2': 4, 'Sev': 5, 'Sal': 6} }) all_data = all_data.replace({'CentralAir': {'Y': 1, 'N': 0}}) all_data = all_data.replace({'PavedDrive': {'Y': 1, 'P': 0, 'N': 0}})
此处主要利用了replace函数,其实还可以利用map函数通过字典进行映射,效果是一样的。
新增变量
新增变量主要是利用已有的变量进行变量的延伸,比如Cond和Qual相关的变量可以将Ex和Gd等代表好的作为一类,而Fa和Po等不好的作为一类,在这个过程中,主要依据为探索性分析中的一些箱形图、直方图所表现出的各特征变量对SalePrice的影响,主要方法则是将会增加SalePrice的特征变量归为一类,其他的则归为另一类。
首先将所有质量和状况相关的根据好坏进行分类。
代码如下:
#OverallQual和OverallCond
overall_poor_qu = all_data.OverallQual.copy()
overall_poor_qu = 5 - overall_poor_qu
overall_poor_qu[overall_poor_qu<0] = 0
overall_poor_qu.name = 'overall_poor_qu'
overall_good_qu = all_data.OverallQual.copy()
overall_good_qu = overall_good_qu - 5
overall_good_qu[overall_good_qu<0] = 0
overall_good_qu.name = 'overall_good_qu'
overall_poor_cond = all_data.OverallCond.copy()
overall_poor_cond = 5 - overall_poor_cond
overall_poor_cond[overall_poor_cond<0] = 0
overall_poor_cond.name = 'overall_poor_cond'
overall_good_cond = all_data.OverallCond.copy()
overall_good_cond = overall_good_cond - 5
overall_good_cond[overall_good_cond<0] = 0
overall_good_cond.name = 'overall_good_cond'
#ExterQual和ExterCond
exter_poor_qu = all_data.ExterQual.copy()
exter_poor_qu[exter_poor_qu<3] = 1
exter_poor_qu[exter_poor_qu>=3] = 0
exter_poor_qu.name = 'exter_poor_qu'
exter_good_qu = all_data.ExterQual.copy()
exter_good_qu[exter_good_qu<=3] = 0
exter_good_qu[exter_good_qu>3] = 1
exter_good_qu.name = 'exter_good_qu'
exter_poor_cond = all_data.ExterCond.copy()
exter_poor_cond[exter_poor_cond<3] = 1
exter_poor_cond[exter_poor_cond>=3] = 0
exter_poor_cond.name = 'exter_poor_cond'
exter_good_cond = all_data.ExterCond.copy()
exter_good_cond[exter_good_cond<=3] = 0
exter_good_cond[exter_good_cond>3] = 1
exter_good_cond.name = 'exter_good_cond'
#BsmtCond
bsmt_poor_cond = all_data.BsmtCond.copy()
bsmt_poor_cond[bsmt_poor_cond<3] = 1
bsmt_poor_cond[bsmt_poor_cond>=3] = 0
bsmt_poor_cond.name = 'bsmt_poor_cond'
bsmt_good_cond = all_data.BsmtCond.copy()
bsmt_good_cond[bsmt_good_cond<=3] = 0
bsmt_good_cond[bsmt_good_cond>3] = 1
bsmt_good_cond.name = 'bsmt_good_cond'
#GarageQual和GarageCond
garage_poor_qu = all_data.GarageQual.copy()
garage_poor_qu[garage_poor_qu<3] = 1
garage_poor_qu[garage_poor_qu>=3] = 0
garage_poor_qu.name = 'garage_poor_qu'
garage_good_qu = all_data.GarageQual.copy()
garage_good_qu[garage_good_qu<=3] = 0
garage_good_qu[garage_good_qu>3] = 1
garage_good_qu.name = 'garage_good_qu'
garage_poor_cond = all_data.GarageCond.copy()
garage_poor_cond[garage_poor_cond<3] = 1
garage_poor_cond[garage_poor_cond>=3] = 0
garage_poor_cond.name = 'garage_poor_cond'
garage_good_cond = all_data.GarageCond.copy()
garage_good_cond[garage_good_cond<=3] = 0
garage_good_cond[garage_good_cond>3] = 1
garage_good_cond.name = 'garage_good_cond'
#KitchenQual
kitchen_poor_qu = all_data.KitchenQual.copy()
kitchen_poor_qu[kitchen_poor_qu<3] = 1
kitchen_poor_qu[kitchen_poor_qu>=3] = 0
kitchen_poor_qu.name = 'kitchen_poor_qu'
kitchen_good_qu = all_data.KitchenQual.copy()
kitchen_good_qu[kitchen_good_qu<=3] = 0
kitchen_good_qu[kitchen_good_qu>3] = 1
kitchen_good_qu.name = 'kitchen_good_qu'
#将上述新建变量连接
qu_list = pd.concat((overall_poor_qu, overall_good_qu, overall_poor_cond, overall_good_cond, exter_poor_qu,
exter_good_qu, exter_poor_cond, exter_good_cond, bsmt_poor_cond, bsmt_good_cond, garage_poor_qu,
garage_good_qu, garage_poor_cond, garage_good_cond, kitchen_poor_qu, kitchen_good_qu), axis=1)
然后将其他变量根据不同类别进行再次分类。
代码如下:
#HeatingQC
bad_heating = all_data.HeatingQC.replace({'Ex': 0,
'Gd': 0,
'TA': 0,
'Fa': 1,
'Po': 1})
bad_heating.name = 'bad_heating'
#MasVnrType
MasVnrType_Any = all_data.MasVnrType.replace({'BrkCmn': 1,
'BrkFace': 1,
'CBlock': 1,
'Stone': 1,
'None': 0})
MasVnrType_Any.name = 'MasVnrType_Any'
#SaleCondition
SaleCondition_PriceDown = all_data.SaleCondition.replace({'Abnorml': 1,
'Alloca': 1,
'AdjLand': 1,
'Family': 1,
'Normal': 0,
'Partial': 0})
SaleCondition_PriceDown.name = 'SaleCondition_PriceDown'
#Neighborhood(Neighborhood中事实上Crawfor和Somerst个人认为不该单独列出来,但奇怪的是,这样的结果的确好不少。)
Neighborhood_Good = pd.DataFrame(np.zeros((all_data.shape[0],1)), columns=['Neighborhood_Good'])
Neighborhood_Good[all_data.Neighborhood=='NridgHt'] = 1
Neighborhood_Good[all_data.Neighborhood=='Crawfor'] = 1
Neighborhood_Good[all_data.Neighborhood=='StoneBr'] = 1
Neighborhood_Good[all_data.Neighborhood=='Somerst'] = 1
Neighborhood_Good[all_data.Neighborhood=='NoRidge'] = 1
#将月份进行转换,经value_counts发现4,5,6,7售卖出更多的房子
#price_category = price_category.to_sparse()#生成稀疏矩阵
season = all_data.MoSold.replace( {1: 0,
2: 0,
3: 0,
4: 1,
5: 1,
6: 1,
7: 1,
8: 0,
9: 0,
10: 0,
11: 0,
12: 0})
season.name = 'season'
接下来对建造时间、重建时间、售出时间等与年份相关的数据进行转换。
代码如下:
Xremoded = (all_data['YearBuilt']!=all_data['YearRemodAdd'])*1 #销售日期不等于改建日期
Xrecentremoded = (all_data['YearRemodAdd']>=all_data['YrSold'])*1 #销售之前已经改建过
XnewHouse = (all_data['YearBuilt']>=all_data['YrSold'])*1 #在建造之前就已售出
XHouseAge = 2010 - all_data['YearBuilt']
XTimeSinceSold = 2010 - all_data['YrSold']
XYearSinceRemodel = all_data['YrSold'] - all_data['YearRemodAdd']
Xremoded.name='Xremoded'
Xrecentremoded.name='Xrecentremoded'
XnewHouse.name='XnewHouse'
XTimeSinceSold.name='XTimeSinceSold'
XYearSinceRemodel.name='XYearSinceRemodel'
XHouseAge.name='XHouseAge'
year_list = pd.concat((Xremoded,Xrecentremoded,XnewHouse,XHouseAge,XTimeSinceSold,XYearSinceRemodel),axis=1)
#以20年作为一组,将年份分成7组
year_map = pd.concat(pd.Series('YearGroup' + str(i+1), index=range(1871+i*20,1891+i*20)) for i in range(0, 7))
all_data.GarageYrBlt = all_data.GarageYrBlt.map(year_map)
all_data.loc[all_data['GarageYrBlt'].isnull(), 'GarageYrBlt'] = 'NoGarage'
all_data.YearBuilt = all_data.YearBuilt.map(year_map)
all_data.YearRemodAdd = all_data.YearRemodAdd.map(year_map)
最后依据SalePrice新增价格类别变量,利用支持向量机进行分类。
代码如下:
#利用支持向量机对SalePrice进行分类
from sklearn.svm import SVC
svm = SVC(C=100, gamma=0.0001, kernel='rbf')
pc = pd.Series(np.zeros(train.shape[0]))
pc[:] = 'pc1'
pc[train.SalePrice >= 150000] = 'pc2'
pc[train.SalePrice >= 220000] = 'pc3'
columns_for_pc = ['Exterior1st', 'Exterior2nd', 'RoofMatl', 'Condition1', 'Condition2', 'BldgType']#房子外部覆盖物/屋顶材料/房屋条件/住宅类型
X_t = pd.get_dummies(train.loc[:, columns_for_pc], sparse=True)
svm.fit(X_t, pc) #Training
pc_pred = svm.predict(X_t)
#对price类型进行预测
price_category = pd.DataFrame(np.zeros((all_data.shape[0],1)), columns=['pc'])
X_t = pd.get_dummies(all_data.loc[:, columns_for_pc], sparse=True)
pc_pred = svm.predict(X_t)
price_category[pc_pred=='pc2'] = 1
price_category[pc_pred=='pc3'] = 2
最后再对某些看上去是数值型但其实是类别型的变量进行数据类型转换。
代码如下:
#将月份变为类别型变量
all_data['MoSold'] = all_data['MoSold'].apply(str)
all_data['MSSubClass'] = all_data['MSSubClass'].apply(str)
数值型变量
首先将数值型变量针对四分位数进行转换。(博主对于此处转换的原理并没有理解的很透彻,感觉上像是将主要数值算入影响因子,而将次要的省去)
代码如下:
data_num = all_data.select_dtypes(include=[np.number])
t = data_num.quantile(.75)#较大四分位数,所有数值由小到大排列后第75%的数字
use_75_scater = t[t != 0].index
all_data[use_75_scater] = all_data[use_75_scater]/all_data[use_75_scater].quantile(.75)#数值相对较大四分位数的倍数
然后便是对数值型经常做的事情了,查看各变量正态属性,对于偏度较大的需进行正态标准化。
代码如下:
#查看数值型变量的正态属性
from scipy.stats import skew #偏度(跟正态分布的差别)
skewness = data_num.apply(lambda x: skew(x))
skewness.sort_values(ascending=False)
skewness = skewness[abs(skewness)>0.5]
print("有{}数值型变量需要进行正态化".format(skewness.shape[0]))
#此处输出“有38数值型变量需要进行正态化”
#将偏度大于0.5的变量进行正态标准化
from scipy.special import boxcox1p
#train_numberic[skewness.index]=np.log1p(train_numberic[skewness.index])
#经实践证明,boxcox1p的效果要比log1p要好。
all_data[skewness.index]=boxcox1p(all_data[skewness.index],0.15)
别忘了对SalePrice也要进行正态标准化。
#将SalePrice进行正态标准化
train["SalePrice"] = np.log1p(train["SalePrice"])
最终整合
变量转换的过程已经完成,接下来要对所有变量进行整合,并将所有的类别型变量进行哑变量转换。
代码如下:
#将类别型变量进行哑变量转换,并用平均数进行缺失值填充
X = pd.get_dummies(all_data)
X = X.fillna(X.mean())
#删除1,0数据较不平衡的几个特征变量
X = X.drop('RoofMatl_ClyTile', axis=1)
X = X.drop('Condition2_PosN', axis=1)
X = X.drop('MSZoning_C (all)', axis=1)
X = X.drop('MSSubClass_SubClass_160', axis=1)
#数据与新增变量的最后连接
X = pd.concat((X, newer_dwelling, season,year_list,qu_list,
bad_heating, MasVnrType_Any, price_category, SaleCondition_PriceDown,Neighborhood_Good), axis=1)
最后一步,是如果我自己做,不会想到的一点,利用已有数据进行某些数据的增强。我们都知道,房子的面积对房价起着决定性的作用,在房子面积的基础之上,如果其他条件好,则价格会继续上升,因此,可以将面积相关变量与其他质量/时间/暖气等结合,其他条件好的则会增加面积优势,否则将削弱。
代码如下:
#将面积相关变量与其他质量/重建/暖气等结合,其他条件好的则会增加面积优势,否则将削弱
from itertools import product, chain
def poly(X):
areas = ['LotArea','TotalBsmtSF', 'GrLivArea', 'GarageArea', 'BsmtUnfSF']
#qu_list.axes[1]相当于qu_list.columns
t = chain(qu_list.axes[1].get_values(),year_list.axes[1].get_values(),
['OverallQual', 'OverallCond', 'ExterQual', 'ExterCond', 'BsmtCond', 'GarageQual', 'GarageCond',
'KitchenQual', 'HeatingQC', 'bad_heating', 'MasVnrType_Any'])
for a, t in product(areas, t):
#连乘,1代表以行为轴,将两列连乘,若为0,则以列为轴,将每行连乘
x = X.loc[:, [a, t]].prod(1)
x.name = a + '_' + t
yield x
最后再将所有的特征变量进行连接,并分离出训练数据和测试数据。
代码如下:
XP = pd.concat(poly(X), axis=1)
X = pd.concat((X, XP), axis=1)
X_train = X[:train.shape[0]]
X_test = X[train.shape[0]:]
y = train.SalePrice
至此,特征工程完全结束,其实还有一步,便是删除离群点,离群点的发现应该在探索性分析里,此处就不再多讲。
再次感谢@Kuangmeng https://github.com/kuangmeng/HousePrices的源代码以及其他Kernel里的作者,比如Stacked Regressions : Top 4% on LeaderBoard、Comprehensive data exploration with Python等。
此次Kaggle项目中做缺失值处理和特征工程主要心得有以下几点:
(1)正态标准化的函数选择对结果有影响,几次实践下来都是boxcox1p的效果要好一些;
(2)缺失值处理过程中需考虑不同变量间的关系,不能只简单地依靠中位数、众数以及平均数等去填充;
(3)特征工程有时可能还需要一些脑洞,比如最后的将每两列相乘,这种手段是博主以前万万想不到的;
(4)缺失值处理和特征工程对结果的影响非常大,而两者特别是特征工程需要有耐心去不停尝试,有时可能一个小小的改变就会对结果影响非常大。