通过分隔符计数和位置从数据框中提取特定文本

Learning regular expressions and stumbled into a bit of a wall. I have the following dataframe:

学习正则表达式,偶然发现了一点墙。我有以下数据帧:

item_data=pandas.DataFrame({'item':['001','002','003'],
'description':['Fishing,Hooks,12-inch','Fishing,Lines','Fish Eggs']})

For each description, I want to be extract everything prior to the second comma ",". If there is no comma, then the original description is retained

对于每个描述,我想在第二个逗号“,”之前提取所有内容。如果没有逗号,则保留原始描述

Results should look like this:

结果应如下所示:

item_data=pandas.DataFrame({'item':['001','002','003'],
'description':['Fishing,Hooks,12-inch','Fishing,Lines','Fish Eggs'],
'new_description':['Fishing,Hooks','Fishing,Lines', 'Fish Eggs']})

Any pointers would be much appreciated.

任何指针都将非常感激。

Thanks.

2 个解决方案

#1

Using a regexp...

使用正则表达式...

re.sub("^([^,]*,[^,]*),.*$", "\\1", x)

meaning is

^ start of string

^字符串的开始

( start capture

(开始捕捉

[^,] anything but a comma

[^,]除了逗号之外的任何东西

* zero or more times

*零次或多次

, a comma

一个逗号

[^,] anything but a comma

[^,]除了逗号之外的任何东西

* zero or more times

*零次或多次

) end of capture

)捕获结束

, another comma

,另一个逗号

.* anything
$ end of string

$ end of string

Replacing with the content of group 1 (\1) drops whatever is present after the second comma

替换为组1(\ 1)的内容会删除第二个逗号后出现的内容

#2

new_description = [",".join(i.split(",")[:2]) for i in item_data['description']]

#1

Using a regexp...

使用正则表达式...

re.sub("^([^,]*,[^,]*),.*$", "\\1", x)

meaning is

^ start of string

^字符串的开始

( start capture

(开始捕捉

[^,] anything but a comma

[^,]除了逗号之外的任何东西

* zero or more times

*零次或多次

, a comma

一个逗号

[^,] anything but a comma

[^,]除了逗号之外的任何东西

* zero or more times

*零次或多次

) end of capture

)捕获结束

, another comma

,另一个逗号

.* anything
$ end of string

$ end of string

Replacing with the content of group 1 (\1) drops whatever is present after the second comma

替换为组1(\ 1)的内容会删除第二个逗号后出现的内容

#2

new_description = [",".join(i.split(",")[:2]) for i in item_data['description']]

秒客网

通过分隔符计数和位置从数据框中提取特定文本

2 个解决方案

#1

#2

#1

#2

相关文章