基于Spark MLlib平台和基于模型的协同过滤算法的电影推荐系统(二)代码实现

时间:2021-08-06 20:39:21

上接基于Spark MLlib平台和基于模型的协同过滤算法的电影推荐系统(一)


1.  设置不打印一堆INFO信息(减少打印量 保证Shell页面清晰干净) 

      sc.setLogLevel("WARN")

2.   导入相关recommendation包中相关类,加载数据,并解析到RDD【Rating】对象

①导入相关recommendation包,其中recommendation._的含义是导入recommendation包中全部类
scala> import org.apache.spark.mllib.recommendation._
import org.apache.spark.mllib.recommendation._

②加载数据;匹配模式;user product rating的类型是Int Int Double,需要转换;
scala> val data = sc.textFile("/root/cccc.txt").map(_.split(",") match {case Array (user,product,rating) => Rating (user.toInt,product.toInt,rating.toDouble)})
data: org.apache.spark.rdd.RDD[org.apache.spark.mllib.recommendation.Rating] = MapPartitionsRDD[29] at map at <console>:24
或者:val data = sc.textFile("/root/cccc.txt").map(_.split(",");Rating(f(0).toInt,f(1).toInt,f(2).toDouble) //这句运行有错。
/**如果不用模式匹配 还可以用if判断(本身case就是if的另一种形式)**/
【附加:.first可以查看数据的第一行;.count可以统计数据的行数
scala> data.firstres24: org.apache.spark.mllib.recommendation.Rating = Rating(1,1,5.0)scala> data.countres25: Long = 16  


二:设置参数,建立ALS模型

利用自带的函数:
ALS.train(data,rank,iterations,lambda)
各参数意义:
ALS.train(数据,维度,迭代次数,正则化参数)


细释:k:维度(这里用rank表示),rank一般选在8到20之间
      iterations:迭代次数
      lambda:正则化参数,防止过拟合,【经验之谈】λ一般是以3倍数往上增 0.01 0.03 0.1 0.3 1 3.........

/**建立ALS模型使用模型推荐的参数即可,设置rank为10,迭代次数为20,alpha为0.01**/
val rank = 10;val iterations=20 ;val lambda =0.01;
val model = ALS.train(data,rank,iterations,lambda)
或者:
val model = ALS.train(data,8,10,0.01)

执行后看到MatrixFactorizationModel!!
scala> val model = ALS.train(data,rank,iterations,alpha)
model: org.apache.spark.mllib.recommendation.MatrixFactorizationModel = org.apache.spark.mllib.recommendation.MatrixFactorizationModel@3667e643
那!!怎么去观察MatrixFactorizationModel这种黑盒子里的内部结构?(参考倒数第一个模块)

那!!Rating是几乘几的矩阵呢?(参考倒数第二个模块)
 

三:进行预测
四:把预测的结果和原始评分整合

昨天:

val usersProducts = data.map{case Rating(user,product,rating) =>(user,product)}

val ratingAndPredictions = data.map{case Rating(user,product,rating) => ((user,product),rating)}.join(model.predict(usersProducts).map{case Rating(user,product,rating)=>((user,product),rating)})


参考其结构帮助理解:
scala> usersProducts.collect
res27: Array[(Int, Int)] = Array((1,1), (1,2), (1,3), (1,4), (2,1), (2,2), (2,3), (2,4), (3,1), (3,2), (3,3), (3,4), (4,1), (4,2), (4,3), (4,4))

输出格式为((用户,项目),(实际评分,预测评分))
scala> ratingAndPredictions.collect
res28: Array[((Int, Int), (Double, Double))] = Array(((1,4),(1.0,0.9999058733819626)), ((3,1),(1.0,0.9998962746677607)), ((2,3),(5.0,4.994863065698205)), ((1,2),(1.0,0.9999058733819626)), ((2,1),(5.0,4.994863065698205)), ((4,4),(5.0,4.994911307454755)), ((1,1),(5.0,4.994863065698205)), ((4,2),(5.0,4.994911307454755)), ((2,2),(1.0,0.9999058733819626)), ((4,1),(1.0,0.9998962746677607)), ((2,4),(1.0,0.9999058733819626)), ((3,2),(5.0,4.994911307454755)), ((3,4),(5.0,4.994911307454755)), ((3,3),(1.0,0.9998962746677607)), ((4,3),(1.0,0.9998962746677607)), ((1,3),(5.0,4.994863065698205)))




今天课上思路(与昨天的相同):

scala> val usersProducts = data.map{case Rating(user,product,rating) =>(user,product)}
usersProducts: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[210] at map at <console>:26

二元组的key也是一个元组
scala> model.predict(usersProducts).map(x => ((x.user,x.product),x.rating))
res6: org.apache.spark.rdd.RDD[((Int, Int), Double)] = MapPartitionsRDD[219] at map at <console>:31

键是二元组,值也是二元组
scala> model.predict(usersProducts).map(x => ((x.user,x.product),x.rating)).join(data.map(x => ((x.user,x.product),x.rating)))
res7: org.apache.spark.rdd.RDD[((Int, Int), (Double, Double))] = MapPartitionsRDD[232] at join at <console>:31

把第一行取出来看一下,输出格式为((用户,项目),(预测评分,实际评分))
scala> model.predict(usersProducts).map(x => ((x.user,x.product),x.rating)).join(data.map(x => ((x.user,x.product),x.rating))).take(1)
res8: Array[((Int, Int), (Double, Double))] = Array(((1,1),(4.996339908089835,5.0)))

打印一个用户的实际值和预测值并做差,打印所有用户的实际值和预测值并做差
scala> res8(0)._2
res11: (Double, Double) = (4.996339908089835,5.0)


【倒数第二个模块】
Rating是几乘几的矩阵??
是4*4的矩阵(矩阵=用户*项目)

验证如下:
scala> data.map(x => x.user)
res1: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[210] at map at <console>:27

scala> data.map(x => x.user).distinct.count
res3: Long = 4

scala> data.map(x => x.product).distinct.count
res4: Long = 4

【倒数第一个模块】
怎么去观察MatrixFactorizationModel这种黑盒子里的内部结构?

scala> model.
asInstanceOf isInstanceOf predict productFeatures
rank recommendProducts recommendProductsForUsers recommendUsers
recommendUsersForProducts save toString userFeatures

打印用户特征第一行
scala> model.userFeatures take 1
res1: Array[(Int, Array[Double])] = Array((3,Array(-0.21575616300106049, -0.5715493559837341, 0.012001494877040386, 0.050375282764434814, 0.1884985715150833, 0.6539813280105591, -0.023888511583209038, 0.355787068605423)))

打印项目特征第一行
scala> model.productFeatures take 1
res2: Array[(Int, Array[Double])] = Array((3,Array(-2.5677483081817627, -1.7736809253692627, -0.8949224948883057, 3.5357284545898438, 1.3151004314422607, -1.8309783935546875, -2.596622943878174, 0.4328916370868683)))

如果要知道用户3对项目3的评分:做内积
scala> val user3 = res1
user3: Array[(Int, Array[Double])] = Array((3,Array(-0.21575616300106049, -0.5715493559837341, 0.012001494877040386, 0.050375282764434814, 0.1884985715150833, 0.6539813280105591, -0.023888511583209038, 0.355787068605423)))

scala> val product3 = res2
product3: Array[(Int, Array[Double])] = Array((3,Array(-2.5677483081817627, -1.7736809253692627, -0.8949224948883057, 3.5357284545898438, 1.3151004314422607, -1.8309783935546875, -2.596622943878174, 0.4328916370868683)))

现在就是简单的scala的处理了
//先把第一个值取出来;这个值里面是二元组,._2 取二元组里的第二个值
scala> val user3 = res1(0)._2
user3: Array[Double] = Array(-0.21575616300106049, -0.5715493559837341, 0.012001494877040386, 0.050375282764434814, 0.1884985715150833, 0.6539813280105591, -0.023888511583209038, 0.355787068605423)

scala> val product3 = res2(0)._2
product3: Array[Double] = Array(-2.5677483081817627, -1.7736809253692627, -0.8949224948883057, 3.5357284545898438, 1.3151004314422607, -1.8309783935546875, -2.596622943878174, 0.4328916370868683)

通过zip把这两个二元组整合起来
scala> user3 zip product3 map(x => x._1+x._2) sum
warning: there were 1 feature warning(s); re-run with -feature for details
res4: Double = -3.9307828275486827

???res12.rating