Pandas过滤串联的多个子串

I need to filter rows in a pandas dataframe so that a specific string column contains at least one of a list of provided substrings. The substrings may have unusual / regex characters. The comparison should not involve regex and is case insensitive.

我需要过滤pandas数据帧中的行,以便特定的字符串列包含至少一个提供的子字符串列表。子串可能有异常/正则表达式字符。比较不应涉及正则表达式,并且不区分大小写。

For example:

lst = ['kdSj;af-!?', 'aBC+dsfa?\-', 'sdKaJg|dksaf-*']

I currently apply the mask like this:

我目前正在应用这样的面具:

mask = np.logical_or.reduce([df[col].str.contains(i, regex=False, case=False) for i in lst])
df = df[mask]

My dataframe is large (~1mio rows) and lst has length 100. Is there a more efficient way? For example, if the first item in lst is found, we should not have to test any subsequent strings for that row.

我的数据帧很大(~1mio行),而lst的长度为100.是否有更有效的方法?例如,如果找到lst中的第一项,我们就不必测试该行的任何后续字符串。

2 个解决方案

#1

If you're sticking to using pure-pandas, for both performance and practicality I think you should use regex for this task. However, you will need to properly escape any special characters in the substrings first to ensure that they are matched literally (and not used as regex meta characters).

如果你坚持使用纯熊猫,为了性能和实用性,我认为你应该使用正则表达式完成这项任务。但是,您需要首先正确地转义子字符串中的任何特殊字符,以确保它们按字面匹配(并且不用作正则表达式元字符)。

This is easy to do using re.escape:

使用re.escape很容易做到这一点:

>>> import re
>>> esc_lst = [re.escape(s) for s in lst]

These escaped substrings can then be joined using a regex pipe |. Each of the substrings can be checked against a string until one matches (or they have all been tested).

然后可以使用正则表达式管道连接这些转义的子字符串。可以针对字符串检查每个子字符串,直到匹配(或者它们都已经过测试)。

>>> pattern = '|'.join(esc_lst)

The masking stage then becomes a single low-level loop through the rows:

然后,屏蔽阶段成为通过行的单个低级循环:

df[col].str.contains(pattern, case=False)

Here's a simple setup to get a sense of performance:

这是一个简单的设置,以获得性能感:

from random import randint, seed

seed(321)

# 100 substrings of 5 characters
lst = [''.join([chr(randint(0, 256)) for _ in range(5)]) for _ in range(100)]

# 50000 strings of 20 characters
strings = [''.join([chr(randint(0, 256)) for _ in range(20)]) for _ in range(50000)]

col = pd.Series(strings)
esc_lst = [re.escape(s) for s in lst]
pattern = '|'.join(esc_lst)

The proposed method takes about 1 second (so maybe up to 20 seconds for 1 million rows):

建议的方法大约需要1秒钟(对于100万行,可能需要长达20秒):

%timeit col.str.contains(pattern, case=False)
1 loop, best of 3: 981 ms per loop

The method in the question took approximately 5 seconds using the same input data.

使用相同的输入数据,问题中的方法大约需要5秒钟。

It's worth noting that these times are 'worst case' in the sense that there were no matches (so all substrings were checked). If there are matches than the timing will improve.

值得注意的是,这些时间是“最坏情况”,因为没有匹配(所以所有子串都被检查)。如果有比赛时间会有所改善。

#2

You could try using the Aho-Corasick algorithm. In the average case, it is O(n+m+p) where n is length of the search strings and m is the length of the searched text and p is the number of output matches.

您可以尝试使用Aho-Corasick算法。在平均情况下,它是O(n + m + p),其中n是搜索字符串的长度,m是搜索文本的长度,p是输出匹配的数量。

The Aho-Corasick algorithm is often used to find multiple patterns (needles) in an input text (the haystack).

Aho-Corasick算法通常用于在输入文本(大海捞针)中查找多个模式(针)。

pyahocorasick is a Python wrapper around a C implementation of the algorithm.

pyahocorasick是围绕算法的C实现的Python包装器。

Let's compare how fast it is versus some alternatives. Below is a benchmark showing using_aho_corasick to be over 30x faster than the original method (shown in the question) on a 50K-row DataFrame test case:

让我们比较它与某些替代方案的速度。以下是一个基准测试,显示using_aho_corasick比50K行DataFrame测试用例中的原始方法(在问题中显示)快30倍以上:

|                    |     speed factor | ms per loop |
|                    | compared to orig |             |
|--------------------+------------------+-------------|
| using_aho_corasick |            30.7x |         140 |
| using_regex        |             2.7x |        1580 |
| orig               |             1.0x |        4300 |

In [89]: %timeit using_ahocorasick(col, lst)
10 loops, best of 3: 140 ms per loop

In [88]: %timeit using_regex(col, lst)
1 loop, best of 3: 1.58 s per loop

In [91]: %timeit orig(col, lst)
1 loop, best of 3: 4.3 s per loop

Here the setup used for the benchmark. It also verifies that the output matches the result returned by orig:

这里是用于基准测试的设置。它还验证输出是否与orig返回的结果匹配:

import numpy as np
import random
import pandas as pd
import ahocorasick
import re

random.seed(321)

def orig(col, lst):
    mask = np.logical_or.reduce([col.str.contains(i, regex=False, case=False) 
                                 for i in lst])
    return mask

def using_regex(col, lst):
    """https://*.com/a/48590850/190597 (Alex Riley)"""
    esc_lst = [re.escape(s) for s in lst]
    pattern = '|'.join(esc_lst)
    mask = col.str.contains(pattern, case=False)
    return mask

def using_ahocorasick(col, lst):
    A = ahocorasick.Automaton(ahocorasick.STORE_INTS)
    for word in lst:
        A.add_word(word.lower())
    A.make_automaton() 
    col = col.str.lower()
    mask = col.apply(lambda x: bool(list(A.iter(x))))
    return mask

N = 50000
# 100 substrings of 5 characters
lst = [''.join([chr(random.randint(0, 256)) for _ in range(5)]) for _ in range(100)]

# N strings of 20 characters
strings = [''.join([chr(random.randint(0, 256)) for _ in range(20)]) for _ in range(N)]
# make about 10% of the strings match a string from lst; this helps check that our method works
strings = [_ if random.randint(0, 99) < 10 else _+random.choice(lst) for _ in strings]

col = pd.Series(strings)

expected = orig(col, lst)
for name, result in [('using_regex', using_regex(col, lst)),
                     ('using_ahocorasick', using_ahocorasick(col, lst))]:
    status = 'pass' if np.allclose(expected, result) else 'fail'
    print('{}: {}'.format(name, status))

#1