This question already has an answer here:
这个问题在这里已有答案:
- Pandas filtering for multiple substrings in series 2 answers
- Pandas过滤系列2中的多个子串
Is there any function that would be the equivalent of a combination of df.isin()
and df[col].str.contains()
?
是否有任何函数相当于df.isin()和df [col] .str.contains()的组合?
For example, say I have the series s = pd.Series(['cat','hat','dog','fog','pet'])
, and I want to find all places where s
contains any of ['og', 'at']
, I would want to get everything but pet.
例如,假设我有系列s = pd.Series(['cat','hat','dog','fog','pet']),我想找到s包含任何[的所有地方] 'og','at'],我想得到除了宠物之外的所有东西。
I have a solution, but it's rather inelegant:
我有一个解决方案,但它相当不优雅:
searchfor = ['og', 'at']
found = [s.str.contains(x) for x in searchfor]
result = pd.DataFrame[found]
result.any()
Is there a better way to do this?
有一个更好的方法吗?
2 个解决方案
#1
81
One option is just to use the regex |
character to try to match each of the substrings in the words in your Series s
(still using str.contains
).
一种选择就是使用正则表达式尝试匹配系列中单词中每个子串的字符(仍然使用str.contains)。
You can construct the regex by joining the words in searchfor
with |
:
您可以通过将searchfor中的单词与|连接来构造正则表达式:
>>> searchfor = ['og', 'at']
>>> s[s.str.contains('|'.join(searchfor))]
0 cat
1 hat
2 dog
3 fog
dtype: object
As @AndyHayden noted in the comments below, take care if your substrings have special characters such as $
and ^
which you want to match literally. These characters have specific meanings in the context of regular expressions and will affect the matching.
正如@AndyHayden在下面的评论中指出的那样,请注意你的子字符串是否有特殊字符,例如$和^,你想要字面上匹配。这些字符在正则表达式的上下文中具有特定含义,并将影响匹配。
You can make your list of substrings safer by escaping non-alphanumeric characters with re.escape
:
您可以使用re.escape转义非字母数字字符,从而使子字符串列表更安全:
>>> import re
>>> matches = ['$money', 'x^y']
>>> safe_matches = [re.escape(m) for m in matches]
>>> safe_matches
['\\$money', 'x\\^y']
The strings with in this new list will match each character literally when used with str.contains
.
与str.contains一起使用时,此新列表中的字符串将逐字匹配每个字符。
#2
21
You can use str.contains
alone with a regex pattern using OR (|)
:
您可以使用OR(|)单独使用str.contains和正则表达式模式:
s[s.str.contains('og|at')]
Or you could add the series to a dataframe
then use str.contains
:
或者您可以将系列添加到数据帧,然后使用str.contains:
df = pd.DataFrame(s)
df[s.str.contains('og|at')]
Output:
输出:
0 cat
1 hat
2 dog
3 fog
#1
81
One option is just to use the regex |
character to try to match each of the substrings in the words in your Series s
(still using str.contains
).
一种选择就是使用正则表达式尝试匹配系列中单词中每个子串的字符(仍然使用str.contains)。
You can construct the regex by joining the words in searchfor
with |
:
您可以通过将searchfor中的单词与|连接来构造正则表达式:
>>> searchfor = ['og', 'at']
>>> s[s.str.contains('|'.join(searchfor))]
0 cat
1 hat
2 dog
3 fog
dtype: object
As @AndyHayden noted in the comments below, take care if your substrings have special characters such as $
and ^
which you want to match literally. These characters have specific meanings in the context of regular expressions and will affect the matching.
正如@AndyHayden在下面的评论中指出的那样,请注意你的子字符串是否有特殊字符,例如$和^,你想要字面上匹配。这些字符在正则表达式的上下文中具有特定含义,并将影响匹配。
You can make your list of substrings safer by escaping non-alphanumeric characters with re.escape
:
您可以使用re.escape转义非字母数字字符,从而使子字符串列表更安全:
>>> import re
>>> matches = ['$money', 'x^y']
>>> safe_matches = [re.escape(m) for m in matches]
>>> safe_matches
['\\$money', 'x\\^y']
The strings with in this new list will match each character literally when used with str.contains
.
与str.contains一起使用时,此新列表中的字符串将逐字匹配每个字符。
#2
21
You can use str.contains
alone with a regex pattern using OR (|)
:
您可以使用OR(|)单独使用str.contains和正则表达式模式:
s[s.str.contains('og|at')]
Or you could add the series to a dataframe
then use str.contains
:
或者您可以将系列添加到数据帧,然后使用str.contains:
df = pd.DataFrame(s)
df[s.str.contains('og|at')]
Output:
输出:
0 cat
1 hat
2 dog
3 fog