I need to search several thousand plaintext files for a set of names. I'm generating trigrams to retain context. I need to account for minor misspellings, so I'm using a Levenshtein distance calculation, function lev(). I need the final result to return the name with a hit, the filename the hit was in, and the trigram that was marked a hit. My python program works as expected, but very slowly. I'm searching for a faster way to do this search, preferably in python, but my Googlefu has failed me. A generic-ified version of the program is below:
我需要搜索几千个纯文本文件以获取一组名称。我正在生成三元组以保留上下文。我需要考虑一些小的拼写错误,所以我使用的是Levenshtein距离计算函数lev()。我需要最终结果返回带有命中的名称,命中所在的文件名以及标记为命中的三元组。我的python程序按预期工作,但速度很慢。我正在寻找一种更快的方式来进行搜索,最好是在python中,但我的Googlefu让我失望了。该程序的通用版本如下:
from sklearn.feature_extraction.text import CountVectorizer
import os
textfiles = []
newgrams = set()
ngrams = []
hitlist = []
path = 'path of folder of textfiles'
names = ['john james doe', 'jane jill doe']
vectorizer = CountVectorizer(input = 'filename', ngram_range = (3,3),
strip_accents='unicode', stop_words='english',
token_pattern='[a-zA-Z\-]\\w*',
encoding='utf-8', decode_error = 'replace', lowercase = True)
ngramer = vectorizer.build_analyzer()
for dirpath, dirnames, filenames in os.walk(path):
for files in filenames:
if files.endswith('.txt'):
textfiles.append(files)
ctFiles = len(textfiles)
ctNames = len(names)
for i in range(ctFiles):
newgrams = set(ngramer(path+'/'+textfiles[i]))
ngrams.append(newgrams)
for i in range(ctNames):
splitname = names[i].split()
for j in range(ctFiles):
tempset = set()
for k in range(len(splitname)):
if k == 0:
## subset only the trigrams that "match" first name
for trigram in ngrams[j]:
for word in trigram.split():
if lev(splitname[k], word) < 2:
tempset.add(trigram)
else:
## search that subset for middle/last name
if len(tempset) > 0:
for trigram in tempset:
for word in trigram.split():
if lev(splitname[k], word) < 2:
hitlist.append([names[i], textfiles[j], trigram])
print(hitlist) ## eventually save to CSV
2 个解决方案
#1
0
Levenshtein is very expensive, and I wouldn't recommend using it in fuzzy matching on this many documents (unless you want to build a levenshtein automata to generate an index of tokens n steps away from every word in your files).
Levenshtein是非常昂贵的,我不建议在这么多文档中使用它进行模糊匹配(除非你想构建一个levenshtein自动机来生成一个标记索引,距离文件中的每个单词都是n步)。
Trigram indexing should be fast and accurate on its own for words of a certain length, although you mention names and if that means multiple words, chunks need to be indexed as well then that needs to be implemented.
对于一定长度的单词,Trigram索引应该快速而准确,尽管你提到了名称,如果这意味着多个单词,那么需要对块进行索引,然后才需要实现。
If you try trigram indexing on its own, and aren't satisfied with the accuracy, you can try adding a trigram chunk index aka (Ban, ana, nan) as a tuple in addition to Ban, nan, and ana as individual trigrams, but in a separate index. This will have an even larger decrease in accuracy as character length decreases, and so that should be accounted for.
如果你自己尝试三元组索引,并且对准确性不满意,你可以尝试添加一个三元组块索引(Ban,ana,nan)作为元组,除了Ban,nan和ana作为单独的三元组,但在一个单独的索引中。随着字符长度的减小,这将在准确度上有更大的降低,因此应该考虑到这一点。
Key here is that levenshtein executes at O(length of query^length of word in file*number of words in files) while token/trigram/chunk indexing executes at O(log(number of words in files)*number of tokens/chunks/trigrams in query).
这里的关键是levenshtein在O处执行(查询的长度^文件中的单词长度*文件中的单词数),而令牌/ trigram / chunk索引在O处执行(log(文件中的单词数)*令牌/块的数量查询中的/ trigrams)。
#2
0
I am using fuzzywuzzy, it's pretty fast on my dataset (100K sentences) https://github.com/seatgeek/fuzzywuzzy
我正在使用fuzzywuzzy,它在我的数据集上非常快(100K句子)https://github.com/seatgeek/fuzzywuzzy
#1
0
Levenshtein is very expensive, and I wouldn't recommend using it in fuzzy matching on this many documents (unless you want to build a levenshtein automata to generate an index of tokens n steps away from every word in your files).
Levenshtein是非常昂贵的,我不建议在这么多文档中使用它进行模糊匹配(除非你想构建一个levenshtein自动机来生成一个标记索引,距离文件中的每个单词都是n步)。
Trigram indexing should be fast and accurate on its own for words of a certain length, although you mention names and if that means multiple words, chunks need to be indexed as well then that needs to be implemented.
对于一定长度的单词,Trigram索引应该快速而准确,尽管你提到了名称,如果这意味着多个单词,那么需要对块进行索引,然后才需要实现。
If you try trigram indexing on its own, and aren't satisfied with the accuracy, you can try adding a trigram chunk index aka (Ban, ana, nan) as a tuple in addition to Ban, nan, and ana as individual trigrams, but in a separate index. This will have an even larger decrease in accuracy as character length decreases, and so that should be accounted for.
如果你自己尝试三元组索引,并且对准确性不满意,你可以尝试添加一个三元组块索引(Ban,ana,nan)作为元组,除了Ban,nan和ana作为单独的三元组,但在一个单独的索引中。随着字符长度的减小,这将在准确度上有更大的降低,因此应该考虑到这一点。
Key here is that levenshtein executes at O(length of query^length of word in file*number of words in files) while token/trigram/chunk indexing executes at O(log(number of words in files)*number of tokens/chunks/trigrams in query).
这里的关键是levenshtein在O处执行(查询的长度^文件中的单词长度*文件中的单词数),而令牌/ trigram / chunk索引在O处执行(log(文件中的单词数)*令牌/块的数量查询中的/ trigrams)。
#2
0
I am using fuzzywuzzy, it's pretty fast on my dataset (100K sentences) https://github.com/seatgeek/fuzzywuzzy
我正在使用fuzzywuzzy,它在我的数据集上非常快(100K句子)https://github.com/seatgeek/fuzzywuzzy