正则表达式以一致的顺序提取字符串的不同部分

时间:2022-09-13 13:30:03

I have a list of strings

我有一个字符串列表

my_strings = [
    "2002-03-04 with Matt",
    "Important: 2016-01-23 with Mary",
    "with Tom on 2015-06-30",
]

I want to extract:

我想提取:

  • date (always in yyyy-mm-dd format)
  • 日期(始终采用yyyy-mm-dd格式)

  • person (always in with person) but I don't want to keep "with"
  • 人(总是与人在一起)但我不想保持“与”

I could do:

我可以:

import re
pattern = r'.*(\d{4}-\d{2}-\d{2}).*with \b([^\b]+)\b.*'
matched = [re.match(pattern, x).groups() for x in my_strings]

but it fails because pattern doesn't match "with Tom on 2015-06-30".

但它失败了,因为模式与“2015-06-30上的汤”不匹配。

Questions

How do I specify the regex pattern to be indifferent to the order in which date or person appear in the string?

如何指定正则表达式模式对日期或人物出现在字符串中的顺序无动于衷?

and

How do I ensure that the groups() method returns them in the same order every time?

我如何确保groups()方法每次都以相同的顺序返回它们?

I expect the output to look like this?

我希望输出看起来像这样?

[('2002-03-04', 'Matt'), ('2016-01-23', 'Mary'), ('2015-06-30', 'Tom')]

4 个解决方案

#1


2  

If you use Python's new regex module, you can use conditionals to get
a guaranteed match on 2 items.

如果您使用Python的新正则表达式模块,则可以使用条件来获得2个项目的保证匹配。

I'd think this is more like a standard to do out-of-order matching.

我认为这更像是无序匹配的标准。

(?:.*?(?:(?(1)(?!))\b(\d{4}-\d\d-\d\d)\b|(?(2)(?!))with[ ](\w+))){2}

Expanded

 (?:
      .*? 
      (?:
           (?(1)(?!))
           \b 
           ( \d{4} - \d\d - \d\d )       # (1)
           \b 
        |  (?(2)(?!))
           with [ ] 
           ( \w+ )                       # (2)
      )
 ){2}

#2


4  

What about doing it with 2 separate regex?

用2个单独的正则表达式做什么呢?

my_strings = [
    "2002-03-04 with Matt",
    "Important: 2016-01-23 with Mary",
    "with Tom on 2015-06-30",
]
import re

pattern = r'.*(\d{4}-\d{2}-\d{2})'
dates = [re.match(pattern, x).groups()[0] for x in my_strings]

pattern = r'.*with (\w+).*'
persons = [re.match(pattern, x).groups()[0] for x in my_strings]

output = zip(dates, persons)
print output
## [('2002-03-04', 'Matt'), ('2016-01-23', 'Mary'), ('2015-06-30', 'Tom')]

#3


2  

This should work:

这应该工作:

my_strings = [
    "2002-03-04 with Matt",
    "Important: 2016-01-23 with Mary",
    "with Tom on 2015-06-30",
]

import re

alternates = r"(?:\b(\d{4}-\d\d-\d\d)\b|with (\w+)|.)*"

for tc in my_strings:
    print(tc)
    m = re.match(alternates, tc)
    if m:
        print("\t", m.group(1))
        print("\t", m.group(2))

Output is:

$ python test.py
2002-03-04 with Matt
     2002-03-04
     Matt
Important: 2016-01-23 with Mary
     2016-01-23
     Mary
with Tom on 2015-06-30
     2015-06-30
     Tom

However, something like this is not totally intuitive. I encourage you to try using named groups if at all possible.

但是,这样的事情并不完全直观。我鼓励你尽可能尝试使用命名组。

#4


2  

Just for education reasons, a non-regex approach could involve using dateutil parser in a "fuzzy" mode to extract the dates and the nltk toolkit with the named entity recognition to extract names. Complete code:

仅出于教育原因,非正则表达式方法可能涉及在“模糊”模式下使用dateutil解析器来提取日期,并使用命名实体识别来提取nltk工具包以提取名称。完整代码:

import nltk
from nltk import pos_tag, ne_chunk
from nltk.tokenize import SpaceTokenizer
from dateutil.parser import parse


def extract_names(text):
    tokenizer = SpaceTokenizer()
    toks = tokenizer.tokenize(text)
    pos = pos_tag(toks)
    chunked_nes = ne_chunk(pos)

    return [' '.join(map(lambda x: x[0], ne.leaves())) for ne in chunked_nes if isinstance(ne, nltk.tree.Tree)]

my_strings = [
    "2002-03-04 with Matt",
    "Important: 2016-01-23 with Mary",
    "with Tom on 2015-06-30"
]

for s in my_strings:
    print(parse(s, fuzzy=True))
    print(extract_names(s))

Prints:

2002-03-04 00:00:00
['Matt']
2016-01-23 00:00:00
['Mary']
2015-06-30 00:00:00
['Tom']

That's probably an over-complication though.

但这可能是一个过于复杂的问题。

#1


2  

If you use Python's new regex module, you can use conditionals to get
a guaranteed match on 2 items.

如果您使用Python的新正则表达式模块,则可以使用条件来获得2个项目的保证匹配。

I'd think this is more like a standard to do out-of-order matching.

我认为这更像是无序匹配的标准。

(?:.*?(?:(?(1)(?!))\b(\d{4}-\d\d-\d\d)\b|(?(2)(?!))with[ ](\w+))){2}

Expanded

 (?:
      .*? 
      (?:
           (?(1)(?!))
           \b 
           ( \d{4} - \d\d - \d\d )       # (1)
           \b 
        |  (?(2)(?!))
           with [ ] 
           ( \w+ )                       # (2)
      )
 ){2}

#2


4  

What about doing it with 2 separate regex?

用2个单独的正则表达式做什么呢?

my_strings = [
    "2002-03-04 with Matt",
    "Important: 2016-01-23 with Mary",
    "with Tom on 2015-06-30",
]
import re

pattern = r'.*(\d{4}-\d{2}-\d{2})'
dates = [re.match(pattern, x).groups()[0] for x in my_strings]

pattern = r'.*with (\w+).*'
persons = [re.match(pattern, x).groups()[0] for x in my_strings]

output = zip(dates, persons)
print output
## [('2002-03-04', 'Matt'), ('2016-01-23', 'Mary'), ('2015-06-30', 'Tom')]

#3


2  

This should work:

这应该工作:

my_strings = [
    "2002-03-04 with Matt",
    "Important: 2016-01-23 with Mary",
    "with Tom on 2015-06-30",
]

import re

alternates = r"(?:\b(\d{4}-\d\d-\d\d)\b|with (\w+)|.)*"

for tc in my_strings:
    print(tc)
    m = re.match(alternates, tc)
    if m:
        print("\t", m.group(1))
        print("\t", m.group(2))

Output is:

$ python test.py
2002-03-04 with Matt
     2002-03-04
     Matt
Important: 2016-01-23 with Mary
     2016-01-23
     Mary
with Tom on 2015-06-30
     2015-06-30
     Tom

However, something like this is not totally intuitive. I encourage you to try using named groups if at all possible.

但是,这样的事情并不完全直观。我鼓励你尽可能尝试使用命名组。

#4


2  

Just for education reasons, a non-regex approach could involve using dateutil parser in a "fuzzy" mode to extract the dates and the nltk toolkit with the named entity recognition to extract names. Complete code:

仅出于教育原因,非正则表达式方法可能涉及在“模糊”模式下使用dateutil解析器来提取日期,并使用命名实体识别来提取nltk工具包以提取名称。完整代码:

import nltk
from nltk import pos_tag, ne_chunk
from nltk.tokenize import SpaceTokenizer
from dateutil.parser import parse


def extract_names(text):
    tokenizer = SpaceTokenizer()
    toks = tokenizer.tokenize(text)
    pos = pos_tag(toks)
    chunked_nes = ne_chunk(pos)

    return [' '.join(map(lambda x: x[0], ne.leaves())) for ne in chunked_nes if isinstance(ne, nltk.tree.Tree)]

my_strings = [
    "2002-03-04 with Matt",
    "Important: 2016-01-23 with Mary",
    "with Tom on 2015-06-30"
]

for s in my_strings:
    print(parse(s, fuzzy=True))
    print(extract_names(s))

Prints:

2002-03-04 00:00:00
['Matt']
2016-01-23 00:00:00
['Mary']
2015-06-30 00:00:00
['Tom']

That's probably an over-complication though.

但这可能是一个过于复杂的问题。