I am working on a spell checker in C++ and I'm stuck at a certain step in the implementation.
我正在用c++编写一个拼写检查程序,在实现过程中,我被卡住了。
Let's say we have a text file with correctly spelled words and an inputted string we would like to check for spelling mistakes. If that string is a misspelled word, I can easily find its correct form by checking all words in the text file and choosing the one that differs from it with a minimum of letters. For that type of input, I've implemented a function that calculates the Levenshtein edit distance between 2 strings. So far so good.
假设我们有一个拼写正确的文本文件和一个输入字符串,我们希望检查拼写错误。如果这个字符串是一个拼写错误的单词,我可以通过检查文本文件中的所有单词,并选择一个与它不同的最小字母,轻松找到它的正确形式。对于这种类型的输入,我实现了一个函数,它计算两个字符串之间的Levenshtein编辑距离。目前为止一切都很顺利。
Now, the tough part: what if the inputted string is a combination of misspelled words? For example, "iloevcokies". Taking into account that "i", "love" and "cookies" are words that can be found in the text file, how can I use the already-implemented Levenshtein function to determine which words from the file are suitable for a correction? Also, how would I insert blanks in the correct positions?
现在,最困难的部分是:如果输入的字符串是拼错的单词的组合呢?例如,“iloevcokies”。考虑到“i”、“love”和“cookies”是文本文件中可以找到的单词,如何使用已经实现的Levenshtein函数来确定文件中哪些单词适合进行更正?另外,如何在正确的位置插入空格?
Any idea is welcome :)
任何想法都是受欢迎的。
3 个解决方案
#1
5
Spelling correction for phrases can be done in a few ways. One way requires having an index of word bi-grams and tri-grams. These of course could be immense. Another option would be to try permutations of the word with spaces inserted, then doing a lookup on each word in the resulting phrase. Take a look at a simple implementation of a spell checker by Peter Norvig from Google. Either way, consider using an n-gram index for better performance, there are libraries available in C++ for reference.
短语的拼写校正可以用几种方法来完成。一种方法需要有单词bi-grams和trig索引。这些当然是巨大的。另一种方法是尝试使用插入空格的单词的排列,然后对结果短语中的每个单词进行查找。看看谷歌的Peter Norvig的拼写检查器的一个简单实现。无论哪种方式,考虑使用n-gram索引以获得更好的性能,在c++中有一些库可供参考。
Google and other search engines are able to do spelling correction on phrases because they have a large index of queries and associated result sets, which allows them to calculate a statistically good guess. Overall, the spelling correction problem can become very complex with methods like context-sensitive correction and phonetic correction. Given that using permutations of possible sub-terms can become expensive you can utilize certain types of heuristics, however this can get out of scope quick.
谷歌和其他搜索引擎能够对短语进行拼写纠正,因为它们有一个大的查询索引和关联的结果集,这使得它们能够计算出统计上的好猜测。总的来说,拼写纠正问题可以通过上下文敏感的纠正和语音纠正等方法变得非常复杂。考虑到使用可能的子项排列可能会变得昂贵,您可以使用某些类型的启发式,但是这可能很快超出范围。
You may also consider using and existing spelling library, such as aspell.
您还可以考虑使用和现有的拼写库,如aspell。
#2
0
A starting point for an idea: one of the top hits of your L-distance for "iloevcokies" should be "cookies". If you can change your L-distance function to also track and return a min-index and max-index (i.e., this match is best starting from character 5 and going to character 10) then you could remove that substring and re-check L-distance for the string before it and after it, then concatenate those for a suggestion....
一个想法的起点:“iloevcokies”的L-distance最受欢迎的一个点应该是“cookie”。如果你可以改变你的L-distance函数来跟踪和返回一个min-index和max-index(例如。,这场比赛最好从5和将字符10),那么你可以删除子串和逐字逐句L-distance字符串之前和之后,然后连接这些建议....
Just a thought, good luck....
只是一个想法,好运....
#3
0
I will suppose that you have an existing index, on which you run your levenshtein distance (for example, a Trie, but any sorted index generally work well).
我假设您有一个现有的索引,在上面运行levenshtein距离(例如,Trie,但是任何排序的索引通常都很好)。
You can consider the addition of white-spaces as a regular edit operation, it's just that there is a twist: you need (then) to get back to the root of your index for the next word.
您可以将空格作为一个常规的编辑操作来考虑,这只是一种扭曲:您需要(然后)返回到下一个单词的索引根。
This way you get the same index, almost the same route, approximately the same traversal, and it should not even impact your running time that much.
通过这种方式,您可以得到相同的索引、几乎相同的路由、大致相同的遍历,而且它甚至不会对您的运行时间产生太大的影响。
#1
5
Spelling correction for phrases can be done in a few ways. One way requires having an index of word bi-grams and tri-grams. These of course could be immense. Another option would be to try permutations of the word with spaces inserted, then doing a lookup on each word in the resulting phrase. Take a look at a simple implementation of a spell checker by Peter Norvig from Google. Either way, consider using an n-gram index for better performance, there are libraries available in C++ for reference.
短语的拼写校正可以用几种方法来完成。一种方法需要有单词bi-grams和trig索引。这些当然是巨大的。另一种方法是尝试使用插入空格的单词的排列,然后对结果短语中的每个单词进行查找。看看谷歌的Peter Norvig的拼写检查器的一个简单实现。无论哪种方式,考虑使用n-gram索引以获得更好的性能,在c++中有一些库可供参考。
Google and other search engines are able to do spelling correction on phrases because they have a large index of queries and associated result sets, which allows them to calculate a statistically good guess. Overall, the spelling correction problem can become very complex with methods like context-sensitive correction and phonetic correction. Given that using permutations of possible sub-terms can become expensive you can utilize certain types of heuristics, however this can get out of scope quick.
谷歌和其他搜索引擎能够对短语进行拼写纠正,因为它们有一个大的查询索引和关联的结果集,这使得它们能够计算出统计上的好猜测。总的来说,拼写纠正问题可以通过上下文敏感的纠正和语音纠正等方法变得非常复杂。考虑到使用可能的子项排列可能会变得昂贵,您可以使用某些类型的启发式,但是这可能很快超出范围。
You may also consider using and existing spelling library, such as aspell.
您还可以考虑使用和现有的拼写库,如aspell。
#2
0
A starting point for an idea: one of the top hits of your L-distance for "iloevcokies" should be "cookies". If you can change your L-distance function to also track and return a min-index and max-index (i.e., this match is best starting from character 5 and going to character 10) then you could remove that substring and re-check L-distance for the string before it and after it, then concatenate those for a suggestion....
一个想法的起点:“iloevcokies”的L-distance最受欢迎的一个点应该是“cookie”。如果你可以改变你的L-distance函数来跟踪和返回一个min-index和max-index(例如。,这场比赛最好从5和将字符10),那么你可以删除子串和逐字逐句L-distance字符串之前和之后,然后连接这些建议....
Just a thought, good luck....
只是一个想法,好运....
#3
0
I will suppose that you have an existing index, on which you run your levenshtein distance (for example, a Trie, but any sorted index generally work well).
我假设您有一个现有的索引,在上面运行levenshtein距离(例如,Trie,但是任何排序的索引通常都很好)。
You can consider the addition of white-spaces as a regular edit operation, it's just that there is a twist: you need (then) to get back to the root of your index for the next word.
您可以将空格作为一个常规的编辑操作来考虑,这只是一种扭曲:您需要(然后)返回到下一个单词的索引根。
This way you get the same index, almost the same route, approximately the same traversal, and it should not even impact your running time that much.
通过这种方式,您可以得到相同的索引、几乎相同的路由、大致相同的遍历,而且它甚至不会对您的运行时间产生太大的影响。