Event Recommendation Engine Challenge分步解析第七步

一、请知晓

　本文是基于：

　　Event Recommendation Engine Challenge分步解析第一步

　　Event Recommendation Engine Challenge分步解析第二步

　　Event Recommendation Engine Challenge分步解析第三步

　　Event Recommendation Engine Challenge分步解析第四步

　　Event Recommendation Engine Challenge分步解析第五步

　　Event Recommendation Engine Challenge分步解析第六步

　需要读者先阅读前六篇文章解析

二、模型构建和预测

　实际上在上述特征构造好了之后，我们有很多的办法去训练得到模型和完成预测，这里用了sklearn中的SGDClassifier 事实上xgboost有更好的效果（显然我们的特征大多是密集型的浮点数，很适合GBDT这样的模型）

　注意交叉验证，我们这里用了10折的交叉验证

import pandas as pd

import numpy as np

from sklearn.linear_model import SGDClassifier

from sklearn.model_selection import KFold

import warnings

warnings.filterwarnings('ignore')

def train():

    """

    在我们得到的特征上训练分类器，target为1（感兴趣），或者是0（不感兴趣）

    """

    trainDf = pd.read_csv('data_train.csv')

    X = np.matrix( pd.DataFrame(trainDf, index=None, columns=['invited', 'user_reco', 'evt_p_reco',

                    'evt_c_reco','user_pop', 'frnd_infl', 'evt_pop']) )

    y = np.array(trainDf.interested)

    clf = SGDClassifier(loss='log', penalty='l2')

    clf.fit(X, y)

    return clf

def validate():

    """

    10折的交叉验证，并输出交叉验证的平均准确率

    """

    trainDf = pd.read_csv('data_train.csv')

    X = np.matrix(pd.DataFrame(trainDf, index=None, columns=['invited', 'user_reco', 'evt_p_reco',

                    'evt_c_reco','user_pop', 'frnd_infl', 'evt_pop']) )

    y = np.array(trainDf.interested)

    nrows = len(trainDf)

    kfold = KFold(n_splits=10,shuffle=False)

    avgAccuracy = 0

    run = 0

    for train, test in kfold.split(X, y):

        Xtrain, Xtest, ytrain, ytest = X[train], X[test], y[train], y[test]

        clf = SGDClassifier(loss='log', penalty='l2')

        clf.fit(Xtrain, ytrain)

        accuracy = 0

        ntest = len(ytest)

        for i in range(0, ntest):

            yt = clf.predict(Xtest[i, :])

            if yt == ytest[i]:

                accuracy += 1

        accuracy = accuracy / ntest

        print('accuracy(run %d) : %f' % (run, accuracy) )

def test(clf):

    """

    读取test数据，用分类器完成预测

    """

    origTestDf = pd.read_csv("test.csv")

    users = origTestDf.user

    events = origTestDf.event

    testDf = pd.read_csv("data_test.csv")

    fout = open("result.csv", 'w')

    fout.write(",".join(["user", "event", "outcome", "dist"]) + "\n")

    nrows = len(testDf)

    Xp = np.matrix(testDf)

    yp = np.zeros((nrows, 2))

    for i in range(0, nrows):

        xp = Xp[i, :]

        yp[i, 0] = clf.predict(xp)

        yp[i, 1] = clf.decision_function(xp)

        fout.write(",".join( map( lambda x: str(x), [users[i], events[i], yp[i, 0], yp[i, 1]] ) ) + "\n")

    fout.close()

clf = train()

validate()

test(clf)

print('done')

三、感谢

　本文参考请点击，感谢作者的分享，但是觉得里面有些小问题

秒客网

Event Recommendation Engine Challenge分步解析第七步

相关文章