你如何实现“你的意思”？ [重复]

Possible Duplicate:
How does the Google “Did you mean?” Algorithm work?

可能重复:Google如何“你的意思是?”算法有效吗?

Suppose you have a search system already in your website. How can you implement the "Did you mean:<spell_checked_word>" like Google does in some search queries?

假设您的网站中已有搜索系统。你如何在一些搜索查询中实现像谷歌那样的“你的意思是: ”吗?

17 个解决方案

#1

Actually what Google does is very much non-trivial and also at first counter-intuitive. They don't do anything like check against a dictionary, but rather they make use of statistics to identify "similar" queries that returned more results than your query, the exact algorithm is of course not known.

实际上,谷歌所做的事情非常重要,而且最初也是反直觉的。他们没有做任何事情,比如检查字典,而是他们利用统计数据来识别返回比查询更多结果的“类似”查询,确切的算法当然是未知的。

There are different sub-problems to solve here, as a fundamental basis for all Natural Language Processing statistics related there is one must have book: Foundation of Statistical Natural Language Processing.

这里有不同的子问题需要解决,作为所有自然语言处理统计相关的基础,必须有一本书:统计自然语言处理基础。

Concretely to solve the problem of word/query similarity I have had good results with using Edit Distance, a mathematical measure of string similarity that works surprisingly well. I used to use Levenshtein but the others may be worth looking into.

具体地说,为了解决单词/查询相似性的问题,我使用编辑距离得到了很好的结果,编辑距离是字符串相似性的数学度量,效果出奇的好。我曾经使用Levenshtein,但其他人可能值得研究。

Soundex - in my experience - is crap.

Soundex - 根据我的经验 - 是废话。

Actually efficiently storing and searching a large dictionary of misspelled words and having sub second retrieval is again non-trivial, your best bet is to make use of existing full text indexing and retrieval engines (i.e. not your database's one), of which Lucene is currently one of the best and coincidentally ported to many many platforms.

实际上有效地存储和搜索拼写错误的单词的大型词典并进行次级检索也是非常重要的,最好的办法是利用现有的全文索引和检索引擎(即不是数据库的引擎),其中Lucene目前最好的,巧合地移植到许多平台之一。

#2

Google's Dr Norvig has outlined how it works; he even gives a 20ish line Python implementation:

谷歌的Norvig博士概述了它的运作方式;他甚至给出了一个20行的Python实现:

http://googlesystem.blogspot.com/2007/04/simplified-version-of-googles-spell.html

http://www.norvig.com/spell-correct.html

Dr Norvig also discusses the "did you mean" in this excellent talk. Dr Norvig is head of research at Google - when asked how "did you mean" is implemented, his answer is authoritive.

Norvig博士还在这次精彩的演讲中讨论了“你的意思”。 Norvig博士是Google的研究负责人 - 当被问及“你的意思”是如何实施的时候,他的回答是权威性的。

So its spell-checking, presumably with a dynamic dictionary build from other searches or even actual internet phrases and such. But that's still spell checking.

所以它的拼写检查,大概是用其他搜索甚至是实际的互联网短语构建的动态字典。但那仍然是拼写检查。

SOUNDEX and other guesses don't get a look in, people!

人们,SOUNDEX和其他猜测都没有看到!

#3

Check this article on wikipedia about the Levenshtein distance. Make sure you take a good look at Possible improvements.

在*上查看关于Levenshtein距离的这篇文章。确保您仔细查看可能的改进。

#4

I was pleasantly surprised that someone has asked how to create a state-of-the-art spelling suggestion system for search engines. I have been working on this subject for more than a year for a search engine company and I can point to information on the public domain on the subject.

有人问我如何为搜索引擎创建最先进的拼写建议系统,我感到非常惊喜。我已经为一家搜索引擎公司工作了一年多这个主题,我可以指出有关该主题的公共领域的信息。

As was mentioned in a previous post, Google (and Microsoft and Yahoo!) do not use any predefined dictionary nor do they employ hordes of linguists that ponder over the possible misspellings of queries. That would be impossible due to the scale of the problem but also because it is not clear that people could actually correctly identify when and if a query is misspelled.

正如前一篇文章中所提到的,谷歌(以及微软和雅虎)不使用任何预定义的词典,也没有聘请大量语言学家来思考可能的拼写错误。由于问题的严重性,这是不可能的,但也因为人们无法确切地确定何时以及查询是否拼写错误。

Instead there is a simple and rather effective principle that is also valid for all European languages. Get all the unique queries on your search logs, calculate the edit distance between all pairs of queries, assuming that the reference query is the one that has the highest count.

相反,有一个简单而有效的原则,也适用于所有欧洲语言。获取搜索日志上的所有唯一查询,计算所有查询对之间的编辑距离,假设参考查询是具有最高计数的查询。

This simple algorithm will work great for many types of queries. If you want to take it to the next level then I suggest you read the paper by Microsoft Research on that subject. You can find it here

这种简单的算法适用于许多类型的查询。如果你想把它提升到一个新的水平,那么我建议你阅读微软研究院关于该主题的论文。你可以在这里找到它

The paper has a great introduction but after that you will need to be knowledgeable with concepts such as the Hidden Markov Model.

本文有一个很好的介绍,但之后你需要了解隐藏马尔可夫模型等概念。

#5

I would suggest looking at SOUNDEX to find similar words in your database.

我建议看看SOUNDEX在你的数据库中找到类似的单词。

You can also access google own dictionary by using the Google API spelling suggestion request.

您还可以使用Google API拼写建议请求访问Google自己的字典。

#6

You may want to look at Peter Norvig's "How to Write a Spelling Corrector" article.

您可能想看看Peter Norvig的“如何编写拼写校正器”一文。

#7

I believe Google logs all queries and identifies when someone makes a spelling correction. This correction may then be suggested when others supply the same first query. This will work for any language, in fact any string of any characters.

我相信Google会记录所有查询并确定有人进行拼写更正的时间。然后,当其他人提供相同的第一查询时,可以建议该校正。这适用于任何语言,实际上是任何字符串。

#8

http://en.wikipedia.org/wiki/N-gram#Google_use_of_N-gram

#9

I think this depends on how big your website it. On our local Intranet which is used by about 500 member of staff, I simply look at the search phrases that returned zero results and enter that search phrase with the new suggested search phrase into a SQL table.

我认为这取决于你的网站有多大。在大约500名工作人员使用的本地Intranet上,我只是查看返回零结果的搜索短语,并将带有新建议搜索短语的搜索短语输入到SQL表中。

I them call on that table if no search results has been returned, however, this only works if the site is relatively small and I only do it for search phrases which are the most common.

如果没有返回任何搜索结果,我会在该表上调用,但是,这仅在站点相对较小时才有效,而且我只对最常见的搜索短语执行此操作。

You might also want to look at my answer to a similar question:

您可能还想查看我对类似问题的回答:

"Similar Posts" like functionality using MS SQL Server?

“类似帖子”喜欢使用MS SQL Server的功能?

#10

If you have industry specific translations, you will likely need a thesaurus. For example, I worked in the jewelry industry and there were abbreviate in our descriptions such as kt - karat, rd - round, cwt - carat weight... Endeca (the search engine at that job) has a thesaurus that will translate from common misspellings, but it does require manual intervention.

如果您有特定行业的翻译,您可能需要一个词库。例如,我在珠宝行业工作,在我们的描述中有缩写,例如kt - karat,rd - round,cwt - carat weight ... Endeca(该工作的搜索引擎)有一个词库,将从常见翻译拼写错误,但确实需要人工干预。

#11

I do it with Lucene's Spell Checker.

我是用Lucene的拼写检查器做的。

#12

Soundex is good for phonetic matches, but works best with peoples' names (it was originally developed for census data)

Soundex适用于语音匹配,但最适合人们的名字(最初是为人口普查数据开发的)

Also check out Full-Text-Indexing, the syntax is different from Google logic, but it's very quick and can deal with similar language elements.

另请查看全文索引,语法与Google逻辑不同,但它非常快,可以处理类似的语言元素。

#13

Soundex and "Porter stemming" (soundex is trivial, not sure about porter stemming).

Soundex和“Porter stemming”(soundex是微不足道的,不确定搬运工干预)。

#14

There's something called aspell that might help: http://blog.evanweaver.com/files/doc/fauna/raspell/classes/Aspell.html

有一种叫做aspell的东西可能会有所帮助:http://blog.evanweaver.com/files/doc/fauna/raspell/classes/Aspell.html

There's a ruby gem for it, but I don't know how to talk to it from python http://blog.evanweaver.com/files/doc/fauna/raspell/files/README.html

有一个红宝石的宝石,但我不知道如何从python http://blog.evanweaver.com/files/doc/fauna/raspell/files/README.html与它交谈

Here's a quote from the ruby implementation

这是来自ruby实现的引用

Usage

Aspell lets you check words and suggest corrections. For example:

Aspell允许您检查单词并建议更正。例如:
  string = "my haert wil go on"

  string.gsub(/[\w\']+/) do |word|
    if !speller.check(word)
      # word is wrong
      puts "Possible correction for #{word}:"
      puts speller.suggest(word).first
    end
  end

This outputs:

Possible correction for haert: heart Possible correction for wil: Will

haert可能的纠正:心脏可能的纠正:威尔

#15

Implementing spelling correction for search engines in an effective way is not trivial (you can't just compute the edit/levenshtein distance to every possible word). A solution based on k-gram indexes is described in Introduction to Information Retrieval (full text available online).

以有效的方式对搜索引擎实施拼写校正并非易事(您不能只计算每个可能单词的编辑/ levenshtein距离)。基于k-gram索引的解决方案在信息检索简介(在线全文)中有所描述。

#16

U could use ngram for the comparisment: http://en.wikipedia.org/wiki/N-gram

你可以使用ngram进行比较:http://en.wikipedia.org/wiki/N-gram

Using python ngram module: http://packages.python.org/ngram/index.html

使用python ngram模块:http://packages.python.org/ngram/index.html

import ngram

G2 = ngram.NGram([  "iis7 configure ftp 7.5",
                    "ubunto configre 8.5",
                    "mac configure ftp"])

print "String", "\t", "Similarity"
for i in G2.search("iis7 configurftp 7.5", threshold=0.1):
    print i[1], "\t", i[0]

U get:

>>> 
String  Similarity
0.76    "iis7 configure ftp 7.5"    
0.24    "mac configure ftp"
0.19    "ubunto configre 8.5"

#17

Why not use google's did you mean in your code.For how see here http://narenonit.blogspot.com/2012/08/trick-for-using-googles-did-you-mean.html

为什么不使用谷歌你的意思是在你的代码中。如何在这里看到http://narenonit.blogspot.com/2012/08/trick-for-using-googles-did-you-mean.html

#1