在Python中对列应用re.search函数

时间:2021-12-22 20:24:09

I have the following Python code (I want the first match of a specific number in a text field):

我有以下Python代码(我想要文本字段中特定数字的第一个匹配):

import numpy as np
import pandas

data = {'A': [1, 2, 3], 'B': ['bla 4044 bla', 'bla 5022 bla', 'bla 6045 bla']}
df = pandas.DataFrame(data)

def fun_subjectnr(column):
    column = str(column)
    subjectnr = re.search(r"(\b[4][0-1][0-9][0-9]\b)",column)
    subjectnr1 = re.search(r"(\b[2-3|6-8][0-9][0-9][0-5]\b)",column)
    subjectnr = np.where(subjectnr == "" and subjectnr1 != "", subjectnr1, 
    subjectnr)
    return subjectnr1

df['C'] = df['B'].apply(fun_subjectnr)

Wanted output:

想要的输出:

 A    B                C
 1    bla 4044 bla    4044
 2    bla 5022 bla    None
 3    bla 6045 bla    6045

It doesn't seem to work. When I add a [0] to the regex code, it gives an error...(subjectnr = re.search(r"(\b[4][0-1][0-9][0-9]\b)",column)[0])

这似乎行不通。当我在regex代码中添加[0]时,它会给出一个错误…(subjectnr = re.search(r“b(\[4][0 - 9][0 - 9][0 - 1]\ b)”,列)[0])

Who knows what to do? Thanks in advance!

谁知道该怎么办?提前谢谢!

1 个解决方案

#1


2  

You can do this with str.extract. You can also condense your pattern a bit, as I show below.

你可以用str.extract来做这个。您还可以稍微压缩您的模式,如下所示。

p = r'\b(4[0-1]\d{2}|(?:[2-3]|[6-8])\d{2}[0-5])\b'
df['C'] = df.B.str.extract(p, expand=False)

df

   A             B     C
0  1  bla 4044 bla  4044
1  2  bla 5022 bla   NaN
2  3  bla 6045 bla  6045

This should be much faster than calling apply.

这应该比调用应用程序快得多。


Details

细节

\b                 # word boundary
(                  # first capture group
   4               # match digit 4
   [0-1]           # match 0 or 1
   \d{2}           # match any two digits
|
   (?:             # non-capture group (prevent ambiguity during matching)
       [2-3]       # 2 or 3
       |           # regex OR metacharacter
       [6-8]       # 6, 7, or 8
   )
   \d{2}           # any two digits
   [0-5]           # any digit b/w 0 and 5
)
\b

#1


2  

You can do this with str.extract. You can also condense your pattern a bit, as I show below.

你可以用str.extract来做这个。您还可以稍微压缩您的模式,如下所示。

p = r'\b(4[0-1]\d{2}|(?:[2-3]|[6-8])\d{2}[0-5])\b'
df['C'] = df.B.str.extract(p, expand=False)

df

   A             B     C
0  1  bla 4044 bla  4044
1  2  bla 5022 bla   NaN
2  3  bla 6045 bla  6045

This should be much faster than calling apply.

这应该比调用应用程序快得多。


Details

细节

\b                 # word boundary
(                  # first capture group
   4               # match digit 4
   [0-1]           # match 0 or 1
   \d{2}           # match any two digits
|
   (?:             # non-capture group (prevent ambiguity during matching)
       [2-3]       # 2 or 3
       |           # regex OR metacharacter
       [6-8]       # 6, 7, or 8
   )
   \d{2}           # any two digits
   [0-5]           # any digit b/w 0 and 5
)
\b