Say I have a dataframe my_df
with a column 'brand'
, I would like to drop any rows where brand is either toyota
or bmw
.
假设我有一个带有'品牌'列的数据框my_df,我想放弃任何品牌是丰田或宝马的行。
I thought the following would do it:
我认为以下会这样做:
my_regex = re.compile('^(bmw$|toyota$).*$')
my_function = lambda x: my_regex.match(x.lower())
my_df[~df['brand'].apply(my_function)]
but I get the error:
但我得到错误:
ValueError: cannot index with vector containing NA / NaN values
Why? How can I filter my DataFrame using a regex?
为什么?如何使用正则表达式过滤我的DataFrame?
1 个解决方案
#1
8
I think re.match
returns None
when there is no match and that breaks the indexing; below is an alternative solution using pandas vectorized string methods; note that pandas string methods can handle null values:
我认为re.match在没有匹配时返回None并且会破坏索引;下面是使用pandas矢量化字符串方法的替代解决方案;请注意,pandas字符串方法可以处理空值:
>>> df = pd.DataFrame( {'brand':['BMW', 'FORD', np.nan, None, 'TOYOTA', 'AUDI']})
>>> df
brand
0 BMW
1 FORD
2 NaN
3 None
4 TOYOTA
5 AUDI
[6 rows x 1 columns]
>>> idx = df.brand.str.contains('^bmw$|^toyota$',
flags=re.IGNORECASE, regex=True, na=False)
>>> idx
0 True
1 False
2 False
3 False
4 True
5 False
Name: brand, dtype: bool
>>> df[~idx]
brand
1 FORD
2 NaN
3 None
5 AUDI
[4 rows x 1 columns]
#1
8
I think re.match
returns None
when there is no match and that breaks the indexing; below is an alternative solution using pandas vectorized string methods; note that pandas string methods can handle null values:
我认为re.match在没有匹配时返回None并且会破坏索引;下面是使用pandas矢量化字符串方法的替代解决方案;请注意,pandas字符串方法可以处理空值:
>>> df = pd.DataFrame( {'brand':['BMW', 'FORD', np.nan, None, 'TOYOTA', 'AUDI']})
>>> df
brand
0 BMW
1 FORD
2 NaN
3 None
4 TOYOTA
5 AUDI
[6 rows x 1 columns]
>>> idx = df.brand.str.contains('^bmw$|^toyota$',
flags=re.IGNORECASE, regex=True, na=False)
>>> idx
0 True
1 False
2 False
3 False
4 True
5 False
Name: brand, dtype: bool
>>> df[~idx]
brand
1 FORD
2 NaN
3 None
5 AUDI
[4 rows x 1 columns]