kaggle竞赛入门整理

1、Bike Sharing Demand

kaggle: https://www.kaggle.com/c/bike-sharing-demand

目的：根据日期、时间、天气、温度等特征，预测自行车的租借量

处理：1、将日期（含年月日时分秒）提取出年，月，星期几，以及小时

2、season, weather都是类别标记的，利用哑变量编码

算法模型选取：

回归问题：1、RandomForestRegressor

2、GradientBoostingRegressor

# -*- coding: utf- -*-

import csv

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

train = pd.read_csv('data/train.csv')

test = pd.read_csv('data/test.csv')

# 选取特征值

selected_features = ['datetime', 'season', 'holiday',

                'workingday', 'weather', 'temp', 'atemp', 'humidity', 'windspeed']

#X_train = train[selected_features]

Y_train = train["count"]

result = test["datetime"]

# 特征值处理

month = pd.DatetimeIndex(train.datetime).month

day = pd.DatetimeIndex(train.datetime).dayofweek

hour = pd.DatetimeIndex(train.datetime).hour

season = pd.get_dummies(train.season)

weather = pd.get_dummies(train.weather)

X_train = pd.concat([season, weather], axis=)

X_test = pd.concat([pd.get_dummies(test.season), pd.get_dummies(test.weather)], axis=)

X_train['month'] = month

X_test['month'] = pd.DatetimeIndex(test.datetime).month

X_train['day'] = day

X_test['day'] = pd.DatetimeIndex(test.datetime).dayofweek

X_train['hour'] = hour

X_test['hour'] = pd.DatetimeIndex(test.datetime).hour

X_train['holiday'] = train['holiday']

X_test['holiday'] = test['holiday']

X_train['workingday'] = train['workingday']

X_test['workingday'] = test['workingday']

X_train['temp'] = train['temp']

X_test['temp'] = test['temp']

X_train['humidity'] = train['humidity']

X_test['humidity'] = test['humidity']

X_train['windspeed'] = train['windspeed']

X_test['windspeed'] = test['windspeed']

from sklearn.ensemble import *

clf = GradientBoostingRegressor(n_estimators=, max_depth=)

clf.fit(X_train, Y_train)

result = clf.predict(X_test)

result = np.expm1(result)

df=pd.DataFrame({'datetime':test['datetime'], 'count':result})

df.to_csv('results1.csv', index = False, columns=['datetime','count'])

from sklearn.ensemble import RandomForestRegressor

gbr = RandomForestRegressor()

gbr.fit(X_train, Y_train)

y_predict = gbr.predict(X_test).astype(int)

df = pd.DataFrame({'datetime': test.datetime, 'count': y_predict})

df.to_csv('result2.csv', index=False, columns=['datetime', 'count'])

#predictions_file = open("RandomForestRegssor.csv", "wb")

#open_file_object = csv.writer(predictions_file)

#open_file_object.writerow(["datetime", "count"])

#open_file_object.writerows(zip(res_time, y_predict))

2、Daily News for Stock Market Prediction

通过历史数据：包含每日点击率最高的25条新闻，与当日股市涨跌，来预测未来股市涨跌

方法一：

1、将25条新闻合并成一篇新闻，然后对每个单词做预处理（去掉特殊字符，含数字的单词，删除停词，变成小写，取词干），然后用TF-IDF提取特征，用SVM训练

2、用word2vec提取特征

具体实现：

https://github.com/yjfiejd/News_predict

3、

秒客网

kaggle竞赛入门整理

相关文章