I'm trying to develop a way of taking an entity with a number of properties and searching for similar entities in the database (matching as many of the properties in the correct order as possible). The idea is that it would then return a % of how similar it is.
我正在尝试开发一种方法来获取具有许多属性的实体并在数据库中搜索类似的实体(尽可能以正确的顺序匹配尽可能多的属性)。这个想法是它会返回它的相似程度的百分比。
The order of the properties should also be taken into account, so the properties at the beginning are more important than the ones at the end.
还应考虑属性的顺序,因此开头的属性比最后的属性更重要。
For example:
Item 1 - A, B, C, D, E
项目1 - A,B,C,D,E
Item 2 - A, B, C, D, E
项目2 - A,B,C,D,E
Would be a 100% match
将100%匹配
Item 1 - A, B, C, D, E
项目1 - A,B,C,D,E
Item 2 - B, C, A, D, E
项目2 - B,C,A,D,E
This wouldn't be a perfect match as the properties are in a different order
这不是一个完美的匹配,因为属性的顺序不同
Item 1 - A, B, C, D, E
项目1 - A,B,C,D,E
Item 2 - F, G, H, I, A
第2项 - F,G,H,I,A
Would be a low match as only one property is the same and it is in position 5
将是一个低匹配,因为只有一个属性是相同的,它位于第5位
This algorithm will run for thousands and thousands of records so it needs to be high performing and efficient. Any thoughts as to how I could do this in PHP/MySQL in a fast and efficient manner?
该算法将运行成千上万的记录,因此需要具有高性能和高效率。有关如何以快速有效的方式在PHP / MySQL中执行此操作的任何想法?
I was considering levenshtein but as far as I can tell that would also look at the distance between two completely different words in terms of spelling. Doesn't appear to be ideal for this scenario unless I'm just using it in the wrong way..
我正在考虑levenshtein但据我所知,这也将考虑拼写方面两个完全不同的单词之间的距离。除非我只是以错误的方式使用它,否则似乎不适合这种情况。
It might be that it could be done solely in MySQL, perhaps using a full text search or something.
它可能只能在MySQL中完成,可能使用全文搜索或其他东西。
This seems like a nice solution, though not designed for this scenario. Perhaps binary comparison could be used in some way?
这似乎是一个很好的解决方案,虽然不是为这种情况设计的。也许二进制比较可以用某种方式?
2 个解决方案
#1
2
what i'd do is encode the order and property value into a number. numbers have the advantage of fast comparisons.
我要做的是将订单和属性值编码为数字。数字具有快速比较的优点。
this is a general idea and may still need some work but i hope it would help in some way.
这是一个普遍的想法,可能仍然需要一些工作,但我希望它会在某种程度上有所帮助。
calculate a number (some form of hash) for each property and multiply the number representative of the order of appearance the property for an item.
计算每个属性的数字(某种形式的散列),并将代表项目属性的出现顺序的数字相乘。
say item1 has 3 properties A, B and C.
说item1有3个属性A,B和C.
hash(A) = 123, hash(B) = 345, hash(C) = 456
hash(A)= 123,hash(B)= 345,hash(C)= 456
then multiply that by the order of appearance given that we have a know number of properties:
然后将它乘以出现的顺序,假设我们有一定数量的属性:
(hash(A) * 1,000,00) + (hash(B) * 1,000) + (hash(C) * 1) = someval
(hash(A)* 1,000,00)+(hash(B)* 1,000)+(hash(C)* 1)= someval
magnitude of the multiplier can be tweaked to reflect your data set. you'll have to identify the hash function. soundex maybe?
可以调整乘数的大小以反映您的数据集。你必须确定哈希函数。 soundex也许?
the problem is now reduced to a question of uniqueness due to hash collisions but we can be pretty sure about properties that don't match.
现在问题由于哈希冲突而缩小为唯一性问题,但我们可以非常确定不匹配的属性。
also, this would have the advantage of relative ease of checking if a property appears in another item in different order by using the magnitude of the multiplier to extract the hash value from the number generated.
此外,通过使用乘数的大小从生成的数字中提取散列值,这将具有相对容易检查属性是否以不同顺序出现在另一个项目中的优点。
HTH.
edit: example for checking matches
编辑:检查匹配的示例
given item1(a b c) and item2(a b c). the computed hash of items would be equal. this is a best case scenario. no further computations are required.
给定项目1(a b c)和项目2(a b c)。计算的项目哈希值相等。这是最好的情况。无需进一步计算。
given item1(a b c) and item2(d e a). computed hash of items are not equal. proceed to breaking down property hashes...
给定项目1(a b c)和项目2(d e a)。项目的计算哈希值不相等。继续打破财产哈希......
say a hash table for properties a = 1, b = 2, c = 3, d = 4, e = 5 with 10^n for multiplier. computed hash for item1 is 123 and item2 is 451, break down the computed hash for each property and compare for all combinations of properties one for each item1 (which becomes item1(1 2 3) ) and item2 (which becomes item2(4 5 1) ). then compute the score.
比如属性的哈希表a = 1,b = 2,c = 3,d = 4,e = 5,乘数为10 ^ n。 item1的计算哈希值为123,项目2为451,分解每个属性的计算哈希值,并比较每个item1(变为item1(1 2 3))和item2(变为item2(4 5 1)的属性的所有组合。 ))。然后计算得分。
another way of looking at it would be comparing the properties one by one, except this time, you're playing with numbers instead of the actual string values
另一种看待它的方法是逐个比较属性,除了这次,你正在玩数字而不是实际的字符串值
#2
1
You can draw inspiration (or flat out algorithms) from various sequence alignment algorithms like Smith-Waterman. Indeed what you're looking for very much seems to be a description of sequence alignment. I am, however, uncertain if it's even possible to do this as an SQL query.
您可以从Smith-Waterman等各种序列比对算法中获取灵感(或平坦的算法)。实际上你正在寻找的东西似乎是对序列比对的描述。但是,我不确定是否可以将其作为SQL查询执行此操作。
#1
2
what i'd do is encode the order and property value into a number. numbers have the advantage of fast comparisons.
我要做的是将订单和属性值编码为数字。数字具有快速比较的优点。
this is a general idea and may still need some work but i hope it would help in some way.
这是一个普遍的想法,可能仍然需要一些工作,但我希望它会在某种程度上有所帮助。
calculate a number (some form of hash) for each property and multiply the number representative of the order of appearance the property for an item.
计算每个属性的数字(某种形式的散列),并将代表项目属性的出现顺序的数字相乘。
say item1 has 3 properties A, B and C.
说item1有3个属性A,B和C.
hash(A) = 123, hash(B) = 345, hash(C) = 456
hash(A)= 123,hash(B)= 345,hash(C)= 456
then multiply that by the order of appearance given that we have a know number of properties:
然后将它乘以出现的顺序,假设我们有一定数量的属性:
(hash(A) * 1,000,00) + (hash(B) * 1,000) + (hash(C) * 1) = someval
(hash(A)* 1,000,00)+(hash(B)* 1,000)+(hash(C)* 1)= someval
magnitude of the multiplier can be tweaked to reflect your data set. you'll have to identify the hash function. soundex maybe?
可以调整乘数的大小以反映您的数据集。你必须确定哈希函数。 soundex也许?
the problem is now reduced to a question of uniqueness due to hash collisions but we can be pretty sure about properties that don't match.
现在问题由于哈希冲突而缩小为唯一性问题,但我们可以非常确定不匹配的属性。
also, this would have the advantage of relative ease of checking if a property appears in another item in different order by using the magnitude of the multiplier to extract the hash value from the number generated.
此外,通过使用乘数的大小从生成的数字中提取散列值,这将具有相对容易检查属性是否以不同顺序出现在另一个项目中的优点。
HTH.
edit: example for checking matches
编辑:检查匹配的示例
given item1(a b c) and item2(a b c). the computed hash of items would be equal. this is a best case scenario. no further computations are required.
给定项目1(a b c)和项目2(a b c)。计算的项目哈希值相等。这是最好的情况。无需进一步计算。
given item1(a b c) and item2(d e a). computed hash of items are not equal. proceed to breaking down property hashes...
给定项目1(a b c)和项目2(d e a)。项目的计算哈希值不相等。继续打破财产哈希......
say a hash table for properties a = 1, b = 2, c = 3, d = 4, e = 5 with 10^n for multiplier. computed hash for item1 is 123 and item2 is 451, break down the computed hash for each property and compare for all combinations of properties one for each item1 (which becomes item1(1 2 3) ) and item2 (which becomes item2(4 5 1) ). then compute the score.
比如属性的哈希表a = 1,b = 2,c = 3,d = 4,e = 5,乘数为10 ^ n。 item1的计算哈希值为123,项目2为451,分解每个属性的计算哈希值,并比较每个item1(变为item1(1 2 3))和item2(变为item2(4 5 1)的属性的所有组合。 ))。然后计算得分。
another way of looking at it would be comparing the properties one by one, except this time, you're playing with numbers instead of the actual string values
另一种看待它的方法是逐个比较属性,除了这次,你正在玩数字而不是实际的字符串值
#2
1
You can draw inspiration (or flat out algorithms) from various sequence alignment algorithms like Smith-Waterman. Indeed what you're looking for very much seems to be a description of sequence alignment. I am, however, uncertain if it's even possible to do this as an SQL query.
您可以从Smith-Waterman等各种序列比对算法中获取灵感(或平坦的算法)。实际上你正在寻找的东西似乎是对序列比对的描述。但是,我不确定是否可以将其作为SQL查询执行此操作。