如何在Python中使用正则表达式非捕获组格式

时间:2021-05-24 12:16:06

In the following code I want to get just the digits between '-' and 'u'. I thought i could apply regular expression non capturing groups format (?: … ) to ignore everything from '-' to the first digit. But output always include it. How can i use noncapturing groups format to generate correct ouput?

在下面的代码中,我想得到' - '和'u'之间的数字。我以为我可以应用正则表达式非捕获组格式(?:...)来忽略从“ - ”到第一个数字的所有内容。但输出总是包含它。如何使用非捕获组格式生成正确的输出?

df = pd.DataFrame(
    {'a' : [1,2,3,4], 
     'b' : ['41u -428u', '31u - 68u', '11u - 58u', '21u - 318u']
    })

df['b'].str.extract('((?:-[ ]*)[0-9]*)', expand=True)

如何在Python中使用正则表达式非捕获组格式 如何在Python中使用正则表达式非捕获组格式

2 个解决方案

#1


4  

It isn't included in the inner group, but it's still included as part of the outer group. A non-capturing group does't necessarily imply it isn't captured at all... just that that group does not explicitly get saved in the output. It is still captured as part of any enclosing groups.

它不包含在内部组中,但它仍然作为外部组的一部分包含在内。非捕获组并不一定意味着它根本没有被捕获......只是该组没有明确地保存在输出中。它仍然作为任何封闭组的一部分被捕获。

Just do not put them into the () that define the capturing:

只是不要将它们放入定义捕获的()中:

import pandas as pd

df = pd.DataFrame(
    {'a' : [1,2,3,4], 
     'b' : ['41u -428u', '31u - 68u', '11u - 58u', '21u - 318u']
    })

df['b'].str.extract(r'- ?(\d+)u', expand=True)

     0
0  428
1   68
2   58
3  318

That way you match anything that has a '-' in front (mabye followed by a aspace), a 'u' behind and numbers between the both.

这样你就匹配前面有' - '的东西(mabye后面跟一个aspace),后面跟'u'和两者之间的数字。

Where,

-      # literal hyphen
\s?    # optional space—or you could go with \s* if you expect more than one
(\d+)  # capture one or more digits 
u      # literal "u"

#2


3  

I think you're trying too complicated a regex. What about:

我认为你正在尝试过于复杂的正则表达式。关于什么:

df['b'].str.extract(r'-(.*)u', expand=True)

      0
0   428
1    68
2    58
3   318

#1


4  

It isn't included in the inner group, but it's still included as part of the outer group. A non-capturing group does't necessarily imply it isn't captured at all... just that that group does not explicitly get saved in the output. It is still captured as part of any enclosing groups.

它不包含在内部组中,但它仍然作为外部组的一部分包含在内。非捕获组并不一定意味着它根本没有被捕获......只是该组没有明确地保存在输出中。它仍然作为任何封闭组的一部分被捕获。

Just do not put them into the () that define the capturing:

只是不要将它们放入定义捕获的()中:

import pandas as pd

df = pd.DataFrame(
    {'a' : [1,2,3,4], 
     'b' : ['41u -428u', '31u - 68u', '11u - 58u', '21u - 318u']
    })

df['b'].str.extract(r'- ?(\d+)u', expand=True)

     0
0  428
1   68
2   58
3  318

That way you match anything that has a '-' in front (mabye followed by a aspace), a 'u' behind and numbers between the both.

这样你就匹配前面有' - '的东西(mabye后面跟一个aspace),后面跟'u'和两者之间的数字。

Where,

-      # literal hyphen
\s?    # optional space—or you could go with \s* if you expect more than one
(\d+)  # capture one or more digits 
u      # literal "u"

#2


3  

I think you're trying too complicated a regex. What about:

我认为你正在尝试过于复杂的正则表达式。关于什么:

df['b'].str.extract(r'-(.*)u', expand=True)

      0
0   428
1    68
2    58
3   318