I am tasked with matching free form text to data in a database. What I mean by freeform, is that it is a textbox, and someone can type something/anything. For the most part, these entries are valid. I would like to find a list of values from a table that resemble what was typed in. Before you ask, I have no control of said textbox, nor the people that type into it. I am looking for techniques, not specific technologies.
我的任务是将*格式文本与数据库中的数据进行匹配。我的意思是*形式,它是一个文本框,有人可以输入任何东西。在大多数情况下,这些条目是有效的。我想从一个类似于输入内容的表中找到一个值列表。在你问之前,我无法控制所述文本框,也无法控制输入它的人。我在寻找技术,而不是特定的技术。
Things I have tried:
我尝试过的事情:
- Clearing out the common words from both the criteria as well as the list. ie (the, of, in, etc.)
- SOUNDEX function in sql, it is very weak, and not quite helpfull.
- The Levenshtein Distance algorithm and am pretty happy with the results, but it still needs lots of polish.
清除标准和列表中的常用词。即(the,of,in等)
在sql中的SOUNDEX函数,它非常弱,并且不太有帮助。
Levenshtein距离算法对结果非常满意,但仍需要大量修饰。
For example I have this list:
例如,我有这个列表:
- The Hobbit: An Unexpected Journey
- The Hobbit: The Desolation of Smaug
- The Hobbit: There and Back Again
- Iron Man 3
- Despicable Me 2
- Fast & Furious 6
- Monsters University
- The Hunger Games: Catching Fire
- Man of Steel
- Gravity
- Thor: The Dark World
- The Croods
- World War Z
霍比特人:意外旅程
霍比特人2:史矛革之战
霍比特人:在那里再回来
钢铁侠3
卑鄙的我2
速度与激情6
饥饿游戏:星火燎原
钢铁之躯
雷神:黑暗世界
第二次世界大战
The users input could be:
用户输入可以是:
- hobit unexpected journ
- The word 'hobit' is not spelled right
- Expected result:
- The Hobbit: An Unexpected Journey
- The Hobbit: There and Back Again
- The Hobbit: The Desolation of Smaug
霍比特人:意外旅程
霍比特人:在那里再回来
霍比特人2:史矛革之战
“hobit”这个词拼写不正确
预期结果:霍比特人:意想不到的旅程霍比特人:那里又回来了霍比特人:史矛革的荒凉
- hunger game
- Expected result:
- The Hunger Games: Catching Fire
饥饿游戏:星火燎原
预期结果:饥饿游戏:着火
- Expected result:
hobit意外的journ“hobit”这个词拼写不正确预期结果:Hobbit:意想不到的旅程霍比特人:那里又回来了霍比特人:史矛革的荒凉
饥饿游戏预期结果:饥饿游戏:着火
What I guess I'm asking is what other methods can I use to calculate these results. My Stack is .Net 4.0 and MSSQL 2008 R2
我想我想问的是我可以使用其他方法来计算这些结果。我的堆栈是.Net 4.0和MSSQL 2008 R2
1 个解决方案
#1
1
I would try an algorithm like the following:
我会尝试如下算法:
- common words from both the criteria as well as the list. (the, of, in, etc.)
- for each criteria word check if it's included in an entry of the list
- if it's included, assign some score/value for this criteria word
- if it's not included, check the Levenshtein Distance between the criteria word, and any of the word in the enrty of the list you are checking against
- then assign a score/value for the lowest Levenshtein Distance you have found (maybe it's better to ignore any Levenshtein Distance higher than 3/4)
然后为您找到的最低Levenshtein距离指定分数/值(也许最好忽略任何高于3/4的Levenshtein距离)
如果包含,则为该标准词指定一些分数/值
如果它不包括在内,请检查标准词之间的Levenshtein距离,以及您要检查的列表的enrty中的任何单词,然后为您找到的最低Levenshtein距离指定分数/值(也许最好忽略任何Levenshtein距离高于3/4)
- when you have checked all the criteria word respect the current entry of the list, check how many word of the current entry are not included in the criteria, and assign a negative score/value for each of these word
- sum up all the score/value: now you have a single score/value for these criteria against a single entry of your list
来自标准和列表的常用词。 (,等等)
对于每个标准单词检查它是否包含在列表的条目中(如果它包含在内),为该标准单词指定一些分数/值(如果未包括),检查标准单词之间的Levenshtein距离,以及enrty中的任何单词。您要检查的列表然后为您找到的最低Levenshtein距离指定分数/值(也许最好忽略任何高于3/4的Levenshtein距离)
当您检查了所有标准字时,尊重列表的当前条目,检查当前条目中有多少单词未包含在标准中,并为每个单词指定负分数/值
总结所有分数/值:现在,对于列表中的单个条目,您可以获得这些条件的单个分数/值
Repeat this for any entry in your list.
对列表中的任何条目重复此操作。
If the data you are effectively analysing are films title:
如果您有效分析的数据是电影标题:
- you should add some modifier, like using a multiplying factor on the value/score for the most recent films.
- you can speed up things by having 2 lists to check against: one with the most searched/recent films, and a second list with all the other titles (and if you get enough hit by checking the firs list, you can skip the check against the second list)
你应该添加一些修饰符,比如在最近的电影的价值/得分上使用倍增因子。
你可以通过检查2个列表来加快速度:一个搜索最多/最近的电影,另一个列表包含所有其他标题(如果你通过检查第一个列表获得足够的命中,你可以跳过检查第二个清单)
#1
1
I would try an algorithm like the following:
我会尝试如下算法:
- common words from both the criteria as well as the list. (the, of, in, etc.)
- for each criteria word check if it's included in an entry of the list
- if it's included, assign some score/value for this criteria word
- if it's not included, check the Levenshtein Distance between the criteria word, and any of the word in the enrty of the list you are checking against
- then assign a score/value for the lowest Levenshtein Distance you have found (maybe it's better to ignore any Levenshtein Distance higher than 3/4)
然后为您找到的最低Levenshtein距离指定分数/值(也许最好忽略任何高于3/4的Levenshtein距离)
如果包含,则为该标准词指定一些分数/值
如果它不包括在内,请检查标准词之间的Levenshtein距离,以及您要检查的列表的enrty中的任何单词,然后为您找到的最低Levenshtein距离指定分数/值(也许最好忽略任何Levenshtein距离高于3/4)
- when you have checked all the criteria word respect the current entry of the list, check how many word of the current entry are not included in the criteria, and assign a negative score/value for each of these word
- sum up all the score/value: now you have a single score/value for these criteria against a single entry of your list
来自标准和列表的常用词。 (,等等)
对于每个标准单词检查它是否包含在列表的条目中(如果它包含在内),为该标准单词指定一些分数/值(如果未包括),检查标准单词之间的Levenshtein距离,以及enrty中的任何单词。您要检查的列表然后为您找到的最低Levenshtein距离指定分数/值(也许最好忽略任何高于3/4的Levenshtein距离)
当您检查了所有标准字时,尊重列表的当前条目,检查当前条目中有多少单词未包含在标准中,并为每个单词指定负分数/值
总结所有分数/值:现在,对于列表中的单个条目,您可以获得这些条件的单个分数/值
Repeat this for any entry in your list.
对列表中的任何条目重复此操作。
If the data you are effectively analysing are films title:
如果您有效分析的数据是电影标题:
- you should add some modifier, like using a multiplying factor on the value/score for the most recent films.
- you can speed up things by having 2 lists to check against: one with the most searched/recent films, and a second list with all the other titles (and if you get enough hit by checking the firs list, you can skip the check against the second list)
你应该添加一些修饰符,比如在最近的电影的价值/得分上使用倍增因子。
你可以通过检查2个列表来加快速度:一个搜索最多/最近的电影,另一个列表包含所有其他标题(如果你通过检查第一个列表获得足够的命中,你可以跳过检查第二个清单)