I have a DataFrame 'df' and a list of strings 'l'. I want to iterate through the list and find the rows of the DataFrame matching with strings from the list. Following code works fine if there are no brackets in the list elements. It seems like the regex is not defined properly and somehow the double brackets are not getting matched.
我有一个DataFrame 'df'和一个字符串'l'的列表。我想遍历列表并找到与列表中的字符串匹配的DataFrame的行。如果列表元素中没有括号,下面的代码可以正常工作。看起来regex没有被正确定义,而且双括号没有被匹配。
import pandas as pd
import re
d = {'col1': ['100-(abc)','qwe-100-(abc)', '100-(abc)1',
'xyz', 'xyz2', 'zzz'],
'col2': ['100', '1001','200', '300', '400', '500']}
df = pd.DataFrame(d)
lst = ['100-(abc)', 'xyz']
for l in lst:
print("======================")
pattern = re.compile(r"(" + l + ")$")
print(df[df.col1.str.contains(pattern, regex=True)])
result:
结果:
======================
Empty DataFrame
Columns: [col1, col2]
Index: []
======================
col1 col2
3 xyz 300
Expected result:
预期结果:
======================
col1 col2
0 100-(abc) 100
1 qwe-100-(abc) 1001
======================
col1 col2
3 xyz 300
3 个解决方案
#1
2
You need to understand that:
你需要明白:
Regex have some reserve certain characters for special use the opening parenthesis (, the closing parenthesis ), are one of them.
Regex有一些特定的字符用于特殊使用,比如开头的括号(即结尾的括号),就是其中之一。
If you want to use any of these characters as a literal in a regex, you need to escape them with a backslash. If you want to match 1+1=2
, the correct regex is 1\+1=2
. Otherwise, the plus sign has a special meaning. Same with parenthesis , if you want to match (abc)
you have to do \(abc\)
如果您想在regex中使用这些字符中的任何一个作为文字,您需要使用反斜杠来转义它们。如果想匹配1+1=2,正确的regex是1\+1=2。否则,加号就有特殊的含义。括号也一样,如果你想匹配(abc)你必须做(abc)
import pandas as pd
import re
d = {'col1': ['100-(abc)','qwe-100-(abc)', '100-(abc)1',
'xyz', 'xyz2', 'zzz'],
'col2': ['100', '1001','200', '300', '400', '500']}
df = pd.DataFrame(d)
lst = ['100-(abc)', 'xyz']
for l in lst:
print("======================")
if '(' in l:
match=l.replace('(','\(').replace(')','\)')
pattern = r"(" + match + ")$"
print(df[df.col1.str.contains(pattern, regex=True)])
else:
pattern = r"(" + l + ")$"
print(df[df.col1.str.contains(pattern, regex=True)])
output:
输出:
col1 col2
0 100-(abc) 100
1 qwe-100-(abc) 1001
======================
col1 col2
3 xyz 300
#2
1
Simply use isin
简单地使用型号
df[df.col1.isin(lst)]
col1 col2
0 100-(abc) 100
3 xyz 300
Edit: Add in a regex pattern along with isin
编辑:与isin一起添加regex模式。
df[(df.col1.isin(lst)) | (df.col1.str.contains('\d+-\(.*\)$', regex = True))]
You get
你得到
col1 col2
0 100-(abc) 100
1 qwe-100-(abc) 1001
3 xyz 300
#3
0
Try this: This will work for your case
试试这个:这个对你的案子有用
I have edited the code check this it gave exact output result.
我编辑了代码检查,它给出了准确的输出结果。
import pandas as pd
import re
d = {'col1': ['100-(abc)','qwe-100-(abc)', '100-(abc)1',
'xyz', 'xyz2', 'zzz'],
'col2': ['100', '1001','200', '300', '400', '500']}
df = pd.DataFrame(d)
#lst = ['100-(abc)', 'xyz']
lst2 = [r'\w.*[(abc)]$',r'xyz$',]
for index,l in enumerate(lst2):
print("======================")
pattern = re.compile(lst2[index])
print(df[df.col1.str.contains(pattern, regex=True)])
======================
col1 col2
0 100-(abc) 100
1 qwe-100-(abc) 1001
======================
col1 col2
3 xyz 300
This is what you wanted right.
这就是你想要的。
#1
2
You need to understand that:
你需要明白:
Regex have some reserve certain characters for special use the opening parenthesis (, the closing parenthesis ), are one of them.
Regex有一些特定的字符用于特殊使用,比如开头的括号(即结尾的括号),就是其中之一。
If you want to use any of these characters as a literal in a regex, you need to escape them with a backslash. If you want to match 1+1=2
, the correct regex is 1\+1=2
. Otherwise, the plus sign has a special meaning. Same with parenthesis , if you want to match (abc)
you have to do \(abc\)
如果您想在regex中使用这些字符中的任何一个作为文字,您需要使用反斜杠来转义它们。如果想匹配1+1=2,正确的regex是1\+1=2。否则,加号就有特殊的含义。括号也一样,如果你想匹配(abc)你必须做(abc)
import pandas as pd
import re
d = {'col1': ['100-(abc)','qwe-100-(abc)', '100-(abc)1',
'xyz', 'xyz2', 'zzz'],
'col2': ['100', '1001','200', '300', '400', '500']}
df = pd.DataFrame(d)
lst = ['100-(abc)', 'xyz']
for l in lst:
print("======================")
if '(' in l:
match=l.replace('(','\(').replace(')','\)')
pattern = r"(" + match + ")$"
print(df[df.col1.str.contains(pattern, regex=True)])
else:
pattern = r"(" + l + ")$"
print(df[df.col1.str.contains(pattern, regex=True)])
output:
输出:
col1 col2
0 100-(abc) 100
1 qwe-100-(abc) 1001
======================
col1 col2
3 xyz 300
#2
1
Simply use isin
简单地使用型号
df[df.col1.isin(lst)]
col1 col2
0 100-(abc) 100
3 xyz 300
Edit: Add in a regex pattern along with isin
编辑:与isin一起添加regex模式。
df[(df.col1.isin(lst)) | (df.col1.str.contains('\d+-\(.*\)$', regex = True))]
You get
你得到
col1 col2
0 100-(abc) 100
1 qwe-100-(abc) 1001
3 xyz 300
#3
0
Try this: This will work for your case
试试这个:这个对你的案子有用
I have edited the code check this it gave exact output result.
我编辑了代码检查,它给出了准确的输出结果。
import pandas as pd
import re
d = {'col1': ['100-(abc)','qwe-100-(abc)', '100-(abc)1',
'xyz', 'xyz2', 'zzz'],
'col2': ['100', '1001','200', '300', '400', '500']}
df = pd.DataFrame(d)
#lst = ['100-(abc)', 'xyz']
lst2 = [r'\w.*[(abc)]$',r'xyz$',]
for index,l in enumerate(lst2):
print("======================")
pattern = re.compile(lst2[index])
print(df[df.col1.str.contains(pattern, regex=True)])
======================
col1 col2
0 100-(abc) 100
1 qwe-100-(abc) 1001
======================
col1 col2
3 xyz 300
This is what you wanted right.
这就是你想要的。