python -regex匹配单词列表

I have a python script which has probably 100 or so regex lines each line matching certain words.

我有一个python脚本,每行可能有100个左右的正则表达式行匹配某些单词。

The script obviously consumes up to 100% cpu each time it is run (I basically pass a sentence to it and it will return any matched words found).

该脚本每次运行时显然会消耗高达100%的CPU(我基本上将一个句子传递给它,它将返回找到的任何匹配的单词)。

I want to combine these into around 4 or 5 different "compiled" regex parsers such as:

我想将它们组合成大约4或5个不同的“编译”正则表达式解析器,例如:

>>> words = ('hello', 'good\-bye', 'red', 'blue')
>>> pattern = re.compile('(' + '|'.join(words) + ')', re.IGNORECASE)

How many words can I safely have in this and would it make a difference? Right now if I run a loop on a thousand random sentences it processes maybe 10 a second, looking to drastically increase this speed so it does like 500 a second (if possible).

我可以安全地拥有多少个单词,它会有所作为吗?现在,如果我在一千个随机句子上运行一个循环,它每秒处理10个,希望大幅提高这个速度,所以它确实像500秒一样(如果可能的话)。

Also, is it possible to a list like this?

此外,这样的列表是否可能?

>>> words = ('\d{4,4}\.\d{2,2}\.\d{2,2}', '\d{2,2}\s\d{2,2}\s\d{4,4}\.')
>>> pattern = re.compile('(' + '|'.join(words) + ')', re.IGNORECASE)
>>> print pattern.findall("Today is 2010 11 08)

1 个解决方案

#1

Your algorithm here is basically O(N*M*L) (where N is the length of the sentence, M is the number of words you're looking for, and L is the longest word you're looking for) for each sentence. Using a regex won't speed this up any more than just using find. The only thing it does give you is the ability to match patterns like your second example.

你的算法基本上是O(N * M * L)(其中N是句子的长度,M是你要找的单词数,L是你要找的最长的单词)。。使用正则表达式不会仅仅使用find来加快速度。它唯一能给你的是能够匹配你的第二个例子的模式。

If you want to just find words, a Trie would be a much better approach. The implementation is really easy, too:

如果你想找到单词,Trie将是一个更好的方法。实现也非常简单:

TERMINAL = 'TERMINAL' # Marks the end of a word

def build(*words, trie={}):
    for word in words:
        pointer = trie
        for ch in word:
            pt = pt.setdefault(ch, {TERMINAL:False})
        pt[TERMINAL] = True
    return trie

def find(input, trie):
    results = []
    for i in range(len(input)):
        pt = trie
        for j in range(i, len(input)+1):
            if pt[TERMINAL]:
                results.append(input[i:j])
            if j >= len(input) or input[j] not in pt:
                break
            pt = pt[input[j]]
    return results

This returns all words in your sentence that are in the trie. Running time is O(N*L), meaning that you can add as many words as you want without slowing down the algorithm.

这将返回句子中句子中的所有单词。运行时间为O(N * L),这意味着您可以根据需要添加任意数量的单词,而不会降低算法速度。

#1