Kaggle 自行车租赁预测比赛项目实现

时间:2021-07-18 20:02:19

作者:大树

更新时间:01.20

email:59888745@qq.com

数据处理,机器学习

 

回主目录:2017 年学习记录和总结

 

 

In [ ]:
Kaggle上有很多有意思的项目,大家得空可以试着做一做,其中有个关于香港赛马预测的项目,若大家做的效果好,
预测的结果准确度高的话,可以轻松的 get money ,记得香港报纸有报道说有个大学教授通过统计学建模
进行赛马赢了5000万港币,相信通过机器学习,深度学习,一定可以提高投注的准确度,恭喜大家发财阿,加油学吧~~


Kaggle自行车租赁预测比赛,这是一个连续值预测的问题,
也就是我们说的机器学习中的回归问题,咱们一起来看看这个问题

这是一个城市自行车租赁系统,提供的数据为2年内华盛顿按小时记录的自行车租赁数据,其中训练集由每个月
的前19天组成,测试集由20号之后的时间组成需要我们自己去预测)。

Kaggle自行车租赁预测比赛:https://www.kaggle.com/c/bike-sharing-demand

1.加载数据
2.数据分析
3.特征数据提取
4.准备训练集数据,测试集数据
5.模型选择,先用自己合适算法跑一个baseline的model出来,再进行后续的分析步骤,一步步提高
6.参数调优,用Grid Search找最好的参数
7.用模型预测打分
In [61]:
#load data, review the fild and data type
import pandas as pd

df_train = pd.read_csv('kaggle_bike_competition_train.csv',header=0)
df_train.head(5)
df_train.dtypes
Out[61]:
datetime       object
season int64
holiday int64
workingday int64
weather int64
temp float64
atemp float64
humidity int64
windspeed float64
casual int64
registered int64
count int64
dtype: object
In [10]:
#look the data rows,columns
df_train.shape
Out[10]:
(10886, 12)
In [8]:
#看看有没有缺省的字段, 没有发现缺省值
df_train.count()
Out[8]:
datetime      10886
season 10886
holiday 10886
workingday 10886
weather 10886
temp 10886
atemp 10886
humidity 10886
windspeed 10886
casual 10886
registered 10886
count 10886
dtype: int64
In [34]:
#来处理时间,因为它包含的信息总是非常多的,毕竟变化都是随着时间发生的嘛
df_train.head()
df_train['hour']=pd.DatetimeIndex(df_train.datetime).hour
df_train['day']=pd.DatetimeIndex(df_train.datetime).dayofweek
df_train['month']=pd.DatetimeIndex(df_train.datetime).month

#other method
# df_train['dt']=pd.to_datetime(df_train['datetime'])
# df_train['day_of_week']=df_train['dt'].apply(lambda x:x.dayofweek)
# df_train['day_of_month']=df_train['dt'].apply(lambda x:x.day)

df_train.head()
Out[34]:
 
  datetime season holiday workingday weather temp atemp humidity windspeed casual registered count hour day month
0 2011-01-01 00:00:00 1 0 0 1 9.84 14.395 81 0.0 3 13 16 0 5 1
1 2011-01-01 01:00:00 1 0 0 1 9.02 13.635 80 0.0 8 32 40 1 5 1
2 2011-01-01 02:00:00 1 0 0 1 9.02 13.635 80 0.0 5 27 32 2 5 1
3 2011-01-01 03:00:00 1 0 0 1 9.84 14.395 75 0.0 3 10 13 3 5 1
4 2011-01-01 04:00:00 1 0 0 1 9.84 14.395 75 0.0 0 1 1 4 5 1
In [42]:
#提取 相关特征字段
# df = df_train.drop(['datetime','casual','registered'],axis=1,inplace=True)
df_train = df_train[['season','holiday','workingday','weather','temp','atemp',
'humidity','windspeed','count','month','day','hour']]
#df=df_train['datetime']
df_train.head(5)
Out[42]:
 
  season holiday workingday weather temp atemp humidity windspeed count month day hour
0 1 0 0 1 9.84 14.395 81 0.0 16 1 5 0
1 1 0 0 1 9.02 13.635 80 0.0 40 1 5 1
2 1 0 0 1 9.02 13.635 80 0.0 32 1 5 2
3 1 0 0 1 9.84 14.395 75 0.0 13 1 5 3
4 1 0 0 1 9.84 14.395 75 0.0 1 1 5 4
In [43]:
df_train.shape
Out[43]:
(10886, 12)
In [ ]:
准备训练集数据,测试集数据:
1. df_train_target目标,也就是count字段
2. df_train_data用于产出特征的数据
In [51]:
df_train_target = df_train['count'].values 
print(df_train_target.shape)
df_train_data = df_train.drop(['count'],axis =1).values
print(df_train_data.shape)
 
(10886,)
(10886, 11)
In [ ]:
算法
咱们依旧会使用交叉验证的方式交叉验证集约占全部数据的20%来看看模型的效果,
我们会试 支持向量回归/Suport Vector Regression, 岭回归/Ridge Regression
随机森林回归/Random Forest Regressor每个模型会跑3趟看平均的结果
In [63]:
from sklearn import linear_model
from sklearn import cross_validation
from sklearn import svm
from sklearn.ensemble import RandomForestRegressor
from sklearn.learning_curve import learning_curve
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import explained_variance_score

# 切分一下数据(训练集和测试集)
cv = cross_validation.ShuffleSplit(len(df_train_data), n_iter=3, test_size=0.2,
random_state=0)

# 各种模型来一圈

print("岭回归")
for train, test in cv:
svc = linear_model.Ridge().fit(df_train_data[train], df_train_target[train])
print("train score: {0:.3f}, test score: {1:.3f}\n".format(
svc.score(df_train_data[train], df_train_target[train]), svc.score(df_train_data[test], df_train_target[test])))

print("支持向量回归/SVR(kernel='rbf',C=10,gamma=.001)")
for train, test in cv:

svc = svm.SVR(kernel ='rbf', C = 10, gamma = .001).fit(df_train_data[train], df_train_target[train])
print("train score: {0:.3f}, test score: {1:.3f}\n".format(
svc.score(df_train_data[train], df_train_target[train]), svc.score(df_train_data[test], df_train_target[test])))

print("随机森林回归/Random Forest(n_estimators = 100)")
for train, test in cv:
svc = RandomForestRegressor(n_estimators = 100).fit(df_train_data[train], df_train_target[train])
print("train score: {0:.3f}, test score: {1:.3f}\n".format(
svc.score(df_train_data[train], df_train_target[train]), svc.score(df_train_data[test], df_train_target[test])))
 
岭回归
train score: 0.339, test score: 0.332

train score: 0.330, test score: 0.370

train score: 0.342, test score: 0.320

支持向量回归/SVR(kernel='rbf',C=10,gamma=.001)
train score: 0.417, test score: 0.408

train score: 0.406, test score: 0.452

train score: 0.419, test score: 0.390

随机森林回归/Random Forest(n_estimators = 100)
train score: 0.981, test score: 0.867

train score: 0.981, test score: 0.880

train score: 0.981, test score: 0.869

In [ ]:
随机森林回归获得了最佳结果
不过,参数设置得是不是最好的,这个我们可以用GridSearch来帮助测试,找最好的参数
In [67]:
X = df_train_data
y = df_train_target

X_train, X_test, y_train, y_test = cross_validation.train_test_split(
X, y, test_size=0.2, random_state=0)

tuned_parameters = [{'n_estimators':[10,100,500,550]}]

scores = ['r2']

for score in scores:

print(score)

clf = GridSearchCV(RandomForestRegressor(), tuned_parameters, cv=5, scoring=score)
clf.fit(X_train, y_train)

print("最佳参数找到了:")
print("")
#best_estimator_ returns the best estimator chosen by the search
print(clf.best_estimator_)
print("")
print("得分分别是:")
print("")
#grid_scores_的返回值:
# * a dict of parameter settings
# * the mean score over the cross-validation folds
# * the list of scores for each fold
for params, mean_score, scores in clf.grid_scores_:
print("%0.3f (+/-%0.03f) for %r"
% (mean_score, scores.std() / 2, params))
print("")
 
r2
最佳参数找到了:

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=550, n_jobs=1,
oob_score=False, random_state=None, verbose=0, warm_start=False)

得分分别是:

0.846 (+/-0.006) for {'n_estimators': 10}
0.862 (+/-0.005) for {'n_estimators': 100}
0.863 (+/-0.005) for {'n_estimators': 500}
0.864 (+/-0.005) for {'n_estimators': 550}

In [ ]:
Grid Search帮挑参数还是蛮方便的, 而且要看看模型状态是不是,过拟合or欠拟合
我们发现n_estimators=500,550,拟合得最好
In [ ]:
 
In [ ]: