正则表达式：如何匹配字符串末尾的键值对序列

I am trying to match key-value pairs that appear at the end of (long) strings. The strings look like (I replaced the "\n")

我试图匹配出现在(长)字符串末尾的键值对。字符串看起来像(我替换了“\ n”)

my_str = "lots of blah
          key1: val1-words
          key2: val2-words
          key3: val3-words"

so I expect matches "key1: val1-words", "key2: val2-words" and "key3: val3-words".

所以我希望匹配“key1:val1-words”,“key2:val2-words”和“key3:val3-words”。

The set of possible key names is known.

已知可能的密钥名称集。

Not all possible keys appear in every string.

并非所有可能的键都出现在每个字符串中

At least two keys appear in every string (if that makes it easier to match).

每个字符串中至少会出现两个键(如果这样可以更容易匹配)。

val-words can be several words.

val-words可以是几个单词。

key-value pairs should only be matched at the end of string.

键值对只应在字符串末尾匹配。

I am using Python re module.

我正在使用Python re模块。

I was thinking

我刚在想

re.compile('(?:tag1|tag2|tag3):')

plus some look-ahead assertion stuff would be a solution. I can't get it right though. How do I do?

加上一些前瞻性断言的东西将是一个解决方案。我不能说得对。我该怎么做?

Thank you.

/David

Real example string:

真实示例字符串:

my_str = u'ucourt métrage pour kino session volume 18\nThème: O sombres héros\nContraintes: sous titrés\nAuthor: nicoalabdou\nTags: wakatanka productions court métrage kino session humour cantat bertrand noir désir sombres héros mer medine marie trintignant femme droit des femmes nicoalabdou pute soumise\nPosted: 06 June 2009\nRating: 1.3\nVotes: 3'

EDIT:

Based on Mikel's solution I am now using the following:

根据Mikel的解决方案,我现在使用以下内容:


my_tags = ['\S+'] # gets all tags
my_tags = ['Tags','Author','Posted'] # selected tags
regex = re.compile(r'''
    \n                     # all key-value pairs are on separate lines
    (                      # start group to return
       (?:{0}):            # placeholder for tags to detect '\S+' == all
        \s                 # the space between ':' and value
       .*                  # the value
    )                      # end group to return
    '''.format('|'.join(my_tags)), re.VERBOSE)

regex.sub('',my_str) # return my_str without matching key-vaue lines regex.findall(my_str) # return matched key-value lines 
regex.sub('',my_str)#return my_str,不匹配key-vaue行regex.findall(my_str)#return matching key-value lines

1 个解决方案

#1

The negative zero-width lookahead is (?!pattern).

负零宽度前瞻是(?!pattern)。

It's mentioned part-way down the re module documentation page.

它在re模块文档页面的部分内容中提到过。

(?!...)

Matches if ... doesn’t match next. This is a negative lookahead assertion. For example, Isaac (?!Asimov) will match 'Isaac ' only if it’s not followed by 'Asimov'.

匹配如果......下一个不匹配。这是一个负面的先行断言。例如,Isaac(?!Asimov)只有在没有'Asimov'的情况下才会匹配'Isaac'。

So you could use it to match any number of words after a key, but not a key using something like (?!\S+:)\S+.

因此,您可以使用它来匹配键后的任意数量的单词,但不能使用类似(?!\ S +:)\ S +之类的键。

And the complete code would look like this:

完整的代码如下所示:

regex = re.compile(r'''
    [\S]+:                # a key (any word followed by a colon)
    (?:
    \s                    # then a space in between
        (?!\S+:)\S+       # then a value (any word not followed by a colon)
    )+                    # match multiple values if present
    ''', re.VERBOSE)

matches = regex.findall(my_str)

Which gives

['key1: val1-words ', 'key2: val2-words ', 'key3: val3-words']

If you print the key/values using:

如果使用以下方法打印键/值:

for match in matches:
    print match

It will print:

它将打印:

key1: val1-words
key2: val2-words
key3: val3-words

Or using your updated example, it would print:

或者使用您更新的示例,它将打印:

Thème: O sombres héros 
Contraintes: sous titrés 
Author: nicoalabdou 
Tags: wakatanka productions court métrage kino session humour cantat bertrand noir désir sombres héros mer medine marie trintignant femme droit des femmes nicoalabdou pute soumise 
Posted: 06 June 2009 
Rating: 1.3 
Votes: 3

You could turn each key/value pair into a dictionary using something like this:

您可以使用以下内容将每个键/值对转换为字典:

pairs = dict([match.split(':', 1) for match in matches])

which would make it easier to look up only the keys (and values) you want.

这样可以更容易地只查找所需的键(和值)。

More info:

Python re module documentation

Python re模块文档

Python Regular Expression HOWTO

Python正则表达式HOWTO

Perl Regular Expression Reference "perlreref"

Perl正则表达式参考“perlreref”

#1