Slope One 算法是由 Daniel Lemire 教授在 2005 年提出的一个 Item-Based 推荐算法。
Slope One 算法试图同时满足这样的的 5 个目标:
1. 易于实现和维护:普通工程师可以轻松解释所有的聚合数据,并且算法易于实现和测试。
2. 运行时可更新的:新增一个评分项,应该对预测结果即时产生影响。
3. 高效率的查询响应:快速的执行查询,可能需要付出更多的空间占用作为代价。
4. 对初次访问者要求少:对于一个评分项目很少的用户,也应该可以获得有效的推荐。
5. 合理的准确性:与最准确的方法相比,此方法应该是有竞争力的,准确性方面的微小增长不能以简单性和扩展性的大量牺牲为代价。
使用这个图可以简明扼要的说明一下 Slope One 算法。
1. User A 给 Item I 打分为 1;给 Item J 打分为 1.5。
2. Uesr B 给 Item I 打分为 2。
3. 问题是:User B 给 Item J 打分为多少?
4. 使用 Slope One 算法,答案是:2.5,2+(1.5-1)=2.5。
是不是非常简单?!Slope One 算法就是这么简单,而且它居然还相当有效!详细的试验分析可以看这里“Slope One Predictors for Online Rating-Based Collaborative Filtering”。
喜欢 Python 的朋友可以看这篇 Blog,“tutorial about how to implement Slope One in Python”, 非常详细的介绍了 Slope One 算法在 Python 下的实现步骤。当然了,这只是一个非常简单的实现,你可以使用 MovieLens 或者 EachMovie 的数据集进行一些简单地试验。但如果真正要把它投入到商业环境,还有许多其他的工作必须做好。
python实现
During a lunchtime conversation the other day, a coworker mentioned that he was hacking in his spare time on an entry for the Netflix Prize. This got me to thinking about collaborative filtering: why had I never seen a good description of how to do it? I suspect that people who might ordinarily have a casual interest in the subject hear that there are some statistics involved, whereupon they immediately freeze in the mathematical headlights, and turn the conversation to something else, anything else. In early 2005, a researcher named Daniel Lemire published, with Anna Maclachlan, a paper with the jazzy title of “Slope One Predictors for Online Rating-Based Collaborative Filtering“. This is an important paper, because it presents a family of really simplecollaborative filtering schemes. I mean really simple: there are no statistics involved, just a little bit of linear algebra. Unfortunately, because the Lemire/Maclachlan paper is aimed at an academic audience, it’s not trivial to tell by merely reading it how simple the technique it presents is, so what follows is my attempt to translate its ideas into my native language of Attention Deficit Programmerese. To make things more concrete, I’m going to present an implementation in less than 40 lines of Python (and I’ll try to explain any obscure bits of Python as I go). Cool, huh? Regardless of the underlying implementation, collaborative filters tend to try to solve the same broad problem using much the same data.
- You have your crowd of users.
- You have your big pile of items.
- Some of your users have been generous enough to tell you what they think of some of your items.
- You want to hawk more items to a user, and you’d prefer to make your suggestions relevant to their likely interests.
Items 1 through 3 suggest that you could use the opinions that people have recorded about shiny doodads they have seen, to give a good guess as to which doodads they haven’t seen, but might like. I am not going to explain the mathematics behind the Slope One predictor, because Lemire and Maclachlan’s paper pretty much pulls the numerical justification out of a hat. I’m not immersed in the literature of the field, so I’ll be lazy and take them at their word when they say their approach works; and hey, that’s easy code to write, so let’s get hacking! (The scheme I’m going to talk about is referred to as “Weighted Slope One” in the Lemire and Maclachlan paper.) My Slope One implementation stores its data in two 2D matrices. The X and Y axes are identical in each matrix. If the items to be rated are “cuttlefish”, “octopus”, and “squid”, we end up with two 3×3 matrices, each with the following rows and columns: The coloured cells above indicate which elements of the matrix contain useful information. I’ll describe why most of the matrix is empty later. To keep things simple, let’s represent our matrices as two levels of Python dict objects (a dict is simply a hash table, if you’re not familiar with Python). The key of each dict is an item ID, so to get the matrix entry for “squid vs cuttlefish”, we look in the first-level dict for squid, then the second-level dict for cuttlefish.
The self.diffs matrix will hold differential ratings (I’ll describe what these are soon), and the self.freqs matrix will contain the number of times we’ve computed a differential rating for each pair of items. We’ll fill in the first and second levels of the dicts once we know what our data looks like, so let’s talk about the shape of the data. We need to be able to express ratings such as “Alice rated a squid as 1, and a cuttlefish as 4″. This is easy to express with two levels of dicts; the first maps from a user to all of that user’s ratings, and the second maps items to their ratings. In other words, our user data looks like this in Python:
The time has come to load the data into our matrices; let’s call the method update.
To fill in the matrices, we must look at each rating from every user.
For each user and rating they have submitted, we will be iterating over every rating they have submitted a second time. The first thing we do is make sure that the necessary second-level dicts in each matrix exist.
The setdefault call is an unfortunately ugly Pythonic idiom that means “if there’s no value for the given key (the first parameter) in the dict, assign it the given value (the second parameter)”. Inside this loop, we iterate over the user’s ratings again, to compare each rating against every other.
Here, we ensure that there are zeroes present in the matrix for every pair of ratings. (Notice that we’re using an integer zero to count frequencies, and a floating point zero for ratings.) For every pair of ratings, we modify the appropriate cell in self.freqs by adding one to the number of times we’ve seen that pair of ratings.
And to update the self.diffs matrix, we compute the difference between the ratings for the two items.
This is why I called self.diffs the “differential rating” matrix. Every time a user rates an item, we update the appropriate cells in this matrix to reflect the accumulated difference between that rating and every other rating the user has made. Once we’re done, we scale each differential rating based on the number of times we’ve seen that pair of items.
Now let’s pull the whole thing together, so you can see how simple this updatemethod is.
That’s a measly 13 lines of code, 4 of which are due to us using dict objects to represent matrices. Nice! (By the way, notice that the update method made no use of the user IDs we passed in. This technique doesn’t care who rated what, just how an item has been rated. Collaborative filtering techniques that don’t use user information directly are called “item-based”, while those that do are called “user-based”.) It’s easy to fill out our matrices:
With a little data in place, we’d like to make a prediction. To do this, we’ll pass the ratings a user has made to the predict method; it will return predictions of ratings for items the user hasn’t rated.
We begin with two empty dicts; preds will accumulate prediction data, and freqswill accumulate counts.
We iterate over every item the user has rated, for items that have been rated by anyone (the try/except clause below lets us skip over unrated item pairs).
We make sure there are zeroes present in our dicts before we accumulate anything into them.
The accumulation step is simple.
We must filter the result a little before we return it. We return a new dict, containing weighted predictions for every item, but omitting items that the user has rated (no point in predicting those), and omitting items for which we have a cumulative frequency of zero.
And now let’s see the predict method without all of those intrusive comments, again to see how simple it is.
Our proof-of-concept Weighted Slope One predictor clocks in at 33 lines of Python. If we had a sparse 2D matrix class handy, we could trim six of those lines. That’s not bad! It wouldn’t be difficult to go from this trivial implementation to something truly efficient. From the way we construct the matrices in the update method, we can make a few potentially space saving observations.
- Some portions of each matrix are entirely empty. They don’t contain zeroes; they contain nothing, because nobody has rated those pairs of items. If Alice rates squid and cuttlefish, while Bob rates squid and octopus, there will be no entries for cuttlefish vs octopus in either matrix.
- The diagonal of the self.diffs matrix will always be either empty or zero, so there’s no need to store values on the diagonal of the self.freqsmatrix.
- Each entry in the upper right of the self.diffs matrix is the negative of the symmetric entry in the bottom left. The upper right of the self.freqsmatrix is symmetric to the bottom left (corresponding entries have identical values).
We could thus represent each matrix as a strictly lower triangular matrix. Because much of a matrix is going to be empty for real databases of items and ratings, we would save a lot by storing it in sparse form. If you want to build a web site that uses collaborative filtering, and you’re not a specialist, the Slope One scheme has a lot going for it. It is simple to implement, and the arithmetic is easy to understand (in stark contrast to techniques that use singular value decomposition, cosine correlation or Pearson’s correlation coefficient, or other gymnastic statistical approaches). It’s also easy to see that updates and predictions can both be performed in data parallel style (predictions are easy to express in mapreduce form). Better yet, it’s easy and cheap to update the Slope One data structures on the fly, instead of taking everything offline and performing a heavyweight nightly thrashing of a database. You can download a copy of my sample Weighted Slope One implementation as slope_one.py. Have fun!