用于n *（n - 1）/ 2算法的MySQL架构

I'm currently developing a website where users can search for other users based on attributes (age, height, town, education, etc.). I now want to implement some kind of rating between user profiles. The rating is calculated via its own algorithm based on similiarity between the 2 given profiles. User A has a rating "match rating" of 85 with User B and 79 with User C for example. B and C have a rating of 94 and so on....

我目前正在开发一个网站,用户可以根据属性(年龄,身高,城镇,教育等)搜索其他用户。我现在想在用户配置文件之间实现某种评级。基于2个给定配置文件之间的相似性,通过其自己的算法计算评级。用户A的评级“匹配评级”为85,用户B为79,用户C为79。 B和C的评级为94等等......

The user should be able to search for certain attributes and filter the results by rating.

用户应该能够搜索某些属性并按评级过滤结果。

Since the rating differs from profile to profile and also depends on the user doing the search, I can't simply add a field to my users table and use ORDER BY. So far I came up with 2 solutions:

由于评级因配置文件而异,并且还取决于用户进行搜索,因此我不能简单地向用户表添加字段并使用ORDER BY。到目前为止,我想出了两个解决方案:

My first solution was to have a nightly batch job, that calculates the rating for every possible user combination and stores it in a separate table (user1, user2, rating). I then can join this table with the user table and order the result by rating. After doing some math I figured that this solution doesn't scale that well.

我的第一个解决方案是每晚进行批处理作业,计算每个可能的用户组合的评级,并将其存储在单独的表(user1,user2,rating)中。然后,我可以将此表与用户表连接,并按评级对结果进行排序。在做了一些数学后,我认为这个解决方案不能很好地扩展。

Based on the formula n * (n - 1) / 2 there are 45 possible combination for 10 users. For 1.000 users I suddenly have to insert 499.500 rating combinations into my rating table.

基于公式n *(n-1)/ 2,对于10个用户有45种可能的组合。对于1.000用户,我突然需要在我的评级表中插入499.500个评级组合。
The second solution was to leave MySQL be and just calculate the rating on the fly within my application. This also doesn't scale well. Let's say the search should only return 100 results to the UI (with the highest rated on top). If I have 10.000 users and I want to do a search for every user living in New York sorted by rating, I have to load EVERY user that is living in NY into my app (let's say 3.000), apply the algorithm and then return only the top 100 to the user. This way I have loaded 2.900 useless user objects from the DB and wasted CPU on the algorithm without ever doing anything with it.

第二个解决方案是让MySQL离开并在我的应用程序中即时计算评级。这也不能很好地扩展。假设搜索应该只返回100个结果到UI(顶部评分最高)。如果我有10.000个用户并且我想搜索生活在纽约的每个用户按等级排序,我必须将居住在纽约的每个用户加载到我的应用程序中(假设为3.000),应用算法然后仅返回用户的前100名。这样我就从数据库中加载了2.900个无用的用户对象,并在算法上浪费了CPU,而没有对它做任何事情。

Any ideas how I can design this in my MySQL db or web app so that a user can have an individual rating with every other user in a way that the system scales beyond a couple thousand users?

任何想法如何在我的MySQL数据库或网络应用程序中设计这个,以便用户可以以系统扩展到超过几千个用户的方式与每个其他用户进行单独评级?

3 个解决方案

#1

If you have to match every user against every other user, the algorithm is O(N^2), whatever you do.

如果必须将每个用户与每个其他用户匹配,则算法为O(N ^ 2),无论您做什么。

If you can exploit some sort of 1-dimensional "metric", then you can try and associate each user with a single synthetic value. But that's awkward and could be impossible.

如果您可以利用某种一维“度量”,那么您可以尝试将每个用户与一个合成值相关联。但这很尴尬,可能是不可能的。

But what you can do is to note which users require a change in their profiles (whenever any of the parameters on which the matching is based, changes). At that point you can batch-recalculate the table for those users only, thus working in O(N): if you have 10000 users and only 10 require recalculation, you have to examine 100,000 records instead of 100,000,000.

但您可以做的是注意哪些用户需要更改其配置文件(无论何时匹配所基于的任何参数,更改)。此时,您可以仅为这些用户批量重新计算表,从而在O(N)中工作:如果您有10000个用户且只有10个需要重新计算,则必须检查100,000个记录而不是100,000,000个。

Other strategies would be to only run the main algorithm for records which have the greater chance of being compared: in your example, "same city". Or when updating records (but this would require to store (user_1, user_2, ranking, last_calculated), only recalculate those records with high ranking, very old, or never calculated. Lowest ranked matches aren't likely to change so much that they float to the top in a short time.

其他策略是仅运行主要算法来记录更有可能被比较的记录:在您的示例中,“同一个城市”。或者在更新记录时(但这需要存储(user_1,user_2,排名,last_calculated),只重新计算那些排名较高,非常老或从未计算过的记录。排名最低的匹配不太可能改变太多以至于它们浮动在短时间内达到顶峰。

UPDATE

The problem is also operating with O(N^2) storage space.

问题还在于使用O(N ^ 2)存储空间。

How to reduce this space? I think I can see two approaches. One is to not put some information in the match table at all. The "match" function makes the more sense the more it is rigid and steep; having ten thousand "good matches" would mean that matching means very little. So we would still need lotsa recalculations when User1 changes some key data, in case it brings some of User1's "no-no" matches back into the "maybe" zone. But we would keep a smaller clique of active matches for each user.

如何减少这个空间?我想我可以看到两种方法。一种是根本不在匹配表中放置一些信息。 “匹配”功能越有意义就越坚硬和陡峭;拥有一万个“好匹配”意味着匹配意味着很少。因此,当User1更改某些关键数据时,我们仍然需要大量重新计算,以防它将某些User1的“禁止”匹配带回“可能”区域。但是我们会为每个用户保留一小部分活动匹配。

Storage would still grow quadratically, but less steeply.

存储仍然会以二次方式增长,但不会那么陡峭。

Another strategy would be to recalculate the match, and then we would need to develop some method for quickly selecting which users are likely to have a good match (thus limiting the number of rows retrieved by the JOIN), and some method to quickly calculate a match; which could entail somehow rewriting the match between User1 and User2 to a very simple function of a subset of DataUser1, DataUser2 (maybe using ancillary columns).

另一种策略是重新计算匹配,然后我们需要开发一些方法来快速选择哪些用户可能具有良好的匹配(从而限制JOIN检索的行数),以及一些快速计算的方法比赛;这可能需要以某种方式将User1和User2之间的匹配重写为DataUser1,DataUser2子集的一个非常简单的函数(可能使用辅助列)。

The challenge would be to leverage MySQL capabilities and offload some calculations the the MySQL engine.

挑战在于利用MySQL功能并卸载MySQL引擎的一些计算。

To this purpose you might perhaps "map" some data, at input time (therefore in O(k)), to spatial information, or to strings and employ Levenshtein distance.

为此,您可能会在输入时(因此在O(k)中)将某些数据“映射”到空间信息或字符串并使用Levenshtein距离。

The storage for a single user would grow, but it would grow linearly, not quadratically, and MySQL SPATIAL indexes are very efficient.

单个用户的存储空间会增长,但它会线性增长,而不是二次增长,MySQL SPATIAL索引非常有效。

#2

If the search should only return the top 100 best matches, then why not just store those? It sounds like you would never want to search the bottom end of the results anyway, so just don't calculate them.

如果搜索应该只返回前100个最佳匹配,那么为什么不只是存储那些?听起来你永远不想搜索结果的底端,所以不要计算它们。

That way, your storage space is only o(n), rather than o(n^2), and updates should be, as well. If someone really wants to see matches past the first 100 (and you want to let them) then you have the option of running the query in real time at that point.

这样,您的存储空间仅为o(n),而不是o(n ^ 2),并且更新也应该是。如果有人真的想看到前100个匹配(并且你想让它们),那么你可以选择在那时实时运行查询。

#3

I agree with everything @Iserni says.

我同意@Iserni所说的一切。

If you have a web app and users need to "login", then you might have an opportunity to create that user's rankings at that time and stash them into a temporary table (or rows in an existing table).

如果您有一个Web应用程序并且用户需要“登录”,那么您可能有机会在那时创建该用户的排名并将它们存储到临时表(或现有表中的行)中。

This will work in a reasonable amount of time (a few seconds) if all the data needed for the calculation fits into memory. The database engine should then be doing a full table scan and creating all the ratings.

如果计算所需的所有数据都适合内存,这将在合理的时间(几秒)内工作。然后,数据库引擎应该进行全表扫描并创建所有评级。

This should work reasonably well for one user logging in. Passably for two . . . but it is not going to scale very well if you have, say, a dozen users logging in within one second.

对于一个用户登录,这应该可以很好地工作。可以两个人通过。。。但是,如果您有十几个用户在一秒钟内登录,那么它的扩展性就不会很好。

Fundamentally, though, your rating does not scale well. You have to do a comparison of all users to all users to get the results. Whether this is batch (at night) or real-time (when someone has a query) doesn't change the nature of the problem. It is going to use a lot of computing resources, and multiple users making requests at the same time will be a bottleneck.

但从根本上说,您的评分不能很好地扩展。您必须将所有用户与所有用户进行比较才能获得结果。无论是批量(夜间)还是实时(当有人查询时)都不会改变问题的性质。它将使用大量的计算资源,同时发出请求的多个用户将成为瓶颈。

#1