两种推荐算法的实现
1.基于邻域的方法(协同过滤)(collaborative filtering): user-based, item-based。
2.基于隐语义的方法(矩阵分解):SVD。
使用python推荐系统库surprise。
surprise是scikit系列中的一个,简单易用,同时支持多种推荐算法:基础算法、协同过滤算法、矩阵分解(隐语义模型)。
surprise文档: https://surprise.readthedocs.io/en/stable/getting_started.html
import os, io, collections import pandas as pd from surprise import Dataset, KNNBaseline, SVD, accuracy, Reader from surprise.model_selection import cross_validate, train_test_split # 协同过滤方法 # 载入movielens-100k数据集,一个经典的公开推荐系统数据集,有选项提示是否下载。 data = Dataset.load_builtin('ml-100k') # 或载入本地数据集# 数据集路径path to dataset filefile_path = os.path.expanduser('~/.surprise_data/ml-100k/ml-100k/u.data')# 使用Reader指定文本格式,参数line_format指定特征(列名),参数sep指定分隔符reader = Reader(line_format='user item rating timestamp', sep='\t')# 加载数据集data = Dataset.load_from_file(file_path, reader=reader) data_df = pd.read_csv(file_path, sep='\t', header=None, names=['user','item','rating','timestamp']) item_df = pd.read_csv(os.path.expanduser('~/.surprise_data/ml-100k/ml-100k/u.item'), sep='|', encoding='ISO-8859-1', header=None, names=['mid','mtitle']+[x for x in range(22)]) # 每列都转换为字符串类型 data_df = data_df.astype(str)
item_df = item_df.astype(str) # 电影id到电影标题的映射 item_dict = { item_df.loc[x, 'mid']: item_df.loc[x, 'mtitle'] for x in range(len(item_df)) }
数据集说明:1997-9-19到1998-4-22,在七个月内从电影网站movielens.umn.edu收集而来。
查看数据集
root@c:~$ cd ~/.surprise_data/ml-100k/ml-100k root@c:ml-100k$ ls allbut.pl u1.base u2.test u4.base u5.test ub.base u.genre u.occupation mku.sh u1.test u3.base u4.test ua.base ub.test u.info u.user README u2.base u3.test u5.base ua.test u.data u.item
其中比较重要的文件有:u.data, u.item。
u.data包含用户对电影的100000个评分,共943位用户,1682部电影,每位用户至少对20部电影进行了评分,每一列分别为用户id,电影id,评分,时间戳。
1 root@c:ml-100k$ sed -n '1,5p' u.data 2 196 242 3 881250949 3 186 302 3 891717742 4 22 377 1 878887116 5 244 51 2 880606923 6 166 346 1 886397596
u.item包含电影的具体信息,前两列分别是电影id和电影标题。
1 root@c:ml-100k$ sed -n '1,5p' u.item 2 1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0 3 2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0 4 3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0 5 4|Get Shorty (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Get%20Shorty%20(1995)|0|1|0|0|0|1|0|0|1|0|0|0|0|0|0|0|0|0|0 6 5|Copycat (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Copycat%20(1995)|0|0|0|0|0|0|1|0|1|0|0|0|0|0|0|0|1|0|0
基于用户的协同过滤算法:
# 使用协同过滤算法时的相似性度量配置 # user-based user_based_sim_option = {'name': 'pearson_baseline', 'user_based': True} # item-based item_based_sim_option = {'name': 'pearson_baseline', 'user_based': False} # 为用户推荐n部电影,基于用户的协同过滤算法,先获取10个相似度最高的用户,把这些用户评分高的电影加入推荐列表。 def get_similar_users_recommendations(uid, n=10): # 获取训练集,这里取数据集全部数据 trainset = data.build_full_trainset() # 考虑基线评级的协同过滤算法 algo = KNNBaseline(sim_option = user_based_sim_option) # 拟合训练集 algo.fit(trainset) # 将原始id转换为内部id inner_id = algo.trainset.to_inner_uid(uid) # 使用get_neighbors方法得到10个最相似的用户 neighbors = algo.get_neighbors(inner_id, k=10) neighbors_uid = ( algo.trainset.to_raw_uid(x) for x in neighbors ) recommendations = set() #把评分为5的电影加入推荐列表 for user in neighbors_uid: if len(recommendations) > n: break item = data_df[data_df['user']==user] item = item[item['rating']=='5']['item'] for i in item: recommendations.add(item_dict[i]) print('\nrecommendations for user %s:') for i, j in enumerate(list(recommendations)): if i >= 10: break print(j)
给id为1的用户推荐10部电影:
1 In []: get_similar_users_recommendations('1', 10) 2 Out[]: Estimating biases using als... 3 Computing the msd similarity matrix... 4 Done computing similarity matrix. 5 6 recommendations for user %s: 7 Lawrence of Arabia (1962) 8 Full Monty, The (1997) 9 Winter Guest, The (1997) 10 Air Force One (1997) 11 Hoop Dreams (1994) 12 Game, The (1997) 13 English Patient, The (1996) 14 Mrs. Brown (Her Majesty, Mrs. Brown) (1997) 15 Contact (1997) 16 Liar Liar (1997)
基于物品的协同过滤算法:
# 与某电影相似度最高的n部电影,基于物品的协同过滤算法。 def get_similar_items(iid, n = 10): trainset = data.build_full_trainset() algo = KNNBaseline(sim_option = item_based_sim_option) algo.fit(trainset) inner_id = algo.trainset.to_inner_iid(iid) # 使用get_neighbors方法得到n个最相似的电影 neighbors = algo.get_neighbors(inner_id, k=n) neighbors_iid = ( algo.trainset.to_raw_iid(x) for x in neighbors ) recommendations = [ item_dict[x] for x in neighbors_iid ] print('\nten movies most similar to the %s:' % item_dict[iid]) for i in recommendations: print(i)
与id为2的电影(GoldenEye (1995))相似度最高的十部电影:
1 In []: get_similar_items('2') 2 Out[]: Estimating biases using als... 3 Computing the msd similarity matrix... 4 Done computing similarity matrix. 5 6 ten movies most similar to the GoldenEye (1995): 7 Evil Dead II (1987) 8 Hoop Dreams (1994) 9 Speed (1994) 10 Grand Day Out, A (1992) 11 Ed Wood (1994) 12 Adventures of Pinocchio, The (1996) 13 Highlander (1986) 14 Unforgiven (1992) 15 Down Periscope (1996) 16 Bullets Over Broadway (1994)
矩阵分解算法SVD:
# SVD算法,预测所有用户的电影的评分,把每个用户评分最高的n部电影加入字典。 def get_recommendations_dict(n = 10): trainset = data.build_full_trainset() # 测试集,所有未评分的值 testset = trainset.build_anti_testset() # 使用SVD算法 algo = SVD() algo.fit(trainset) # 预测 predictions = algo.test(testset) # 均方根误差 print("RMSE: %s" % accuracy.rmse(predictions)) # 字典保存每个用户评分最高的十部电影 user_recommendations = collections.defaultdict(list) for uid, iid, r_ui, est, details in predictions: user_recommendations[uid].append((iid, est)) for uid, user_ratings in user_recommendations.items(): user_ratings.sort(key = lambda x: x[1], reverse=True) user_recommendations[uid] = user_ratings[:n] return user_recommendations # 获取每个用户评分最高的10部电影 user_recommendations = get_recommendations_dict(10) # 显示为用户推荐的电影名 def rec_for_user(uid): print("recommendations for user %s:" % uid) #[ item_dict[x[0]] for x in user_recommendations[uid] ] for i in user_recommendations[uid]: print(item_dict[i[0]])
给id为1的用户推荐10部电影:
1 In []: rec_for_user('1') 2 Out[]: recommendations for user 1: 3 L.A. Confidential (1997) 4 Secrets & Lies (1996) 5 Ran (1985) 6 Lawrence of Arabia (1962) 7 One Flew Over the Cuckoo's Nest (1975) 8 Raise the Red * (1991) 9 In the Name of the Father (1993) 10 City of Lost Children, The (1995) 11 Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963) 12 Faust (1994)