第二章 利用用户行为数据
- 用户行为数据简介
最简单的存在形式是:日志
行为数据的反馈形式以及对比:
- 用户行为分析
设计算法之前的分析,更有针对性的进行算法设计。
用户活跃度和物品流行度的分布:满足Power Law分布(长尾分布)
用户活跃度和物品流行度的关系:一般规律:越是老用户越去倾向于浏览冷门的用户。
一般而言,仅仅基于用户行为数据的算法叫做协同过滤算法(Collaborative filtering algorithm),有很多方法neighborhood-base,latent factor model,random walk on graph…
对于neighborhood-base,有基于用户的协同过滤算法以及基于物品的协同过滤算法。之后的笔记会提到。 - 实验设计以及算法测评
交叉验证(附上数据集拆分python代码)
def SplitData(data, M, k, seed):
test = []
train = []
random.seed(seed)
for user, item in data:
if random.randint(0, M) == k:
test.append([user, item])
else:
train.append([user, item])
return train, test
- 测评指标:
通过precision & recall 来测评
推荐的物品用N个,R(u)表示,喜欢的用T(u)表示:
Python实现如下:
def Recall(train, test, N):
hit = 0
all = 0
for user in train.keys():
tu = test[user]
rank = GetRecommandation(user, N)
for item, pui in rank:
if item in rank:
hit += 1
all += len(tu)
return hit / (all * 1.0)
def Precision(train, test, N):
hit = 0
all = 0
for user in train.keys():
tu = test[user]
rank = GetRecommandation(user, N)
for item, pui in rank:
if item in rank:
hit += 1
all += N
return hit / (all * 1.0)
通过覆盖率(coverage)
覆盖率的Python实现:
def Coverage(train, test, N):
recommend_item = set()
all_item = set()
for user in train.keys():
for item in train[user].keys():
all_item.add(item)
rank = GetRecommandation(user, N)
for item, pui in rank:
recommend_item.add(item)
return len(recommend_item) / (len(all_item)*1.0)
通过新颖度测评:
计算推荐物品的平均流行度,流行度很高则新颖度很低,否则新颖度比较高。Python代码实现:
def Popularity(train, test, N):
item_popularity = dict()
for user, items in train.items():
for item in items.keys():
if item not in item_popularity:
item_popularity[item] = 0
item_popularity += 1
ret = 0
n = 0
for user in train.keys():
rank = GetRecommandation(user, N)
for item, pui in rank:
ret += math.log(1 + item_popularity[item])
n += 1
ret /= n*1.0
return ret
取对数是的长尾更为稳定。