实例目的:使用spark机器学习模型预测用户对mid的评分
1、训练数据格式(用户uid,电影mid,评分rating)
$more train.csv0,0,2
0,8,4
0,13,1
0,18,3
0,34,3
0,38,4
0,44,5
0,59,2
0,115,5
0,555,2
0,568,4
0,588,3
1,38,3
1,44,5
1,59,3
1,115,2
1,555,1
1,568,2
1,588,3
...
2、预测数据格式(用户uid,电影mid)
$more test.csv
0,12960
1,12726
1,11463
...
实际环境中,需要将测试数据切分为训练和测试数据,使用训练数据预测后,使用测试数据测试,下面是直接预测数据没有切分。
上传到hadoop hdfs目录/user/hdfs/cai/中
使用技术Spark on yarn、pyspark命令方式,使用spark-submit的朋友自己写个脚本吧。#Read Training Data
user_data = sc.textFile('/user/hdfs/cai/train.csv')
# user_data.first()
# user_data.count()
# rating_data = user_data.map(lambda line:line.split(','))
# ratings = rating_data.map(lambda fields: int(fields[2]))
# ratings.stats()
#import Rating, ALS
from pyspark.mllib.recommendation import Rating, ALS
rawRatings = user_data.map(lambda line:line.split(','))
ratings = rawRatings.map(lambda x: Rating(int(x[0]),int(x[1]),float(x[2])))
print ratings.take(5)
#model train
# model = ALS.train(ratings, 20, 5, 0.05)
model = ALS.train(ratings, 50, 10, 0.1)
userFeatures = model.userFeatures()
print userFeatures.take(2)
#Read Test Data and split
predict_data = sc.textFile('/user/hdfs/cai/test.csv')
predicts = predict_data.map(lambda line:line.split(','))
predictdata = predicts.map(lambda x: (int(x[0]),int(x[1])))
# print model.userFeatures().count()
# print model.productFeatures().count()
# print len(userFeatures.first()[1])
#predict data ,546196 is test data sum
predict_data_all = []
for predict_num in predictdata.take(546196):
predictRating = model.predict(predict_num[0],predict_num[1])
#four to five homes in retained 4 decimal places
Rating_new = round(float(predictRating),4)
#to predict result add to predict_data_all list
# print predict_num[0],predict_num[1],Rating_new
predict_data_all.append((predict_num[0],predict_num[1],Rating_new))
#ParallelCollectionRDD, create a parallel collection
ardd = sc.parallelize(predict_data_all)
# print predictRating
# print Rating_new
#Save data to HDFS
ardd.saveAsTextFile('/user/hdfs/cai/spark_out/')
难点在于模型参数model = ALS.train(ratings, 20, 5, 0.05),设置合适的参数,即可应用到生成环境中,效果不错的,这个是在某个比赛中获得40多排名的推荐预测。