正则表达式:捕获可选的开始或结束组/子组

时间:2021-02-28 11:10:22

I'm trying to write a regex to find different emoticons in a string. Some of the emoticons have hats [example party hat emoticon: *<:-) ] so I'm trying to add an optional group for hats at the beginning of the expression. The problem I'm having is that when I add an optional group to the beginning or end of the expression, it starts to match empty strings. I read some of the other questions on here regarding this topic, but I'm still having trouble understanding why this is happening and what I can do to fix it. Here's what I have so far:

我正在尝试编写一个正则表达式来查找字符串中的不同表情符号。一些表情符号有帽子[示例派对帽子表情符号:* <:-)]所以我试图在表达式的开头为帽子添加一个可选组。我遇到的问题是,当我在表达式的开头或结尾添加一个可选组时,它开始匹配空字符串。我在这里阅读了关于这个主题的一些其他问题,但我仍然无法理解为什么会发生这种情况以及我能做些什么来解决它。这是我到目前为止所拥有的:

 r"""
 (                 
     ([{}]|K|(E-)|(\*<))?   # Optional Hat/Toupee
     [:;8B=xX#%*0]          # Eyes
     [-o]?                  # Optional Nose
     [DbP)(>{c$I3/\J&]      # Mouth/Tongue
 )"""

If I try to match :-) in a string, the regular expression parser returns:

如果我尝试在字符串中匹配:-),则正则表达式解析器返回:

[(':-)', '', '', '')]

Any help is greatly appreciated.

任何帮助是极大的赞赏。

2 个解决方案

#1


1  

Each parenthesis pair adds a capturing group to your expression. To debug your regex, name your capturing groups with ?P<name>:

每个括号对都会在表达式中添加一个捕获组。要调试正则表达式,请使用?P 命名捕获组:

regexp = re.compile("(?P<A>(?P<B>[{}]|K|(?P<C>E-)|(?P<D>\*<))?[:;8B=xX#%*0][-o]?[DbP)(>{c$I3/\J&])")

Then you have:

然后你有:

>>> print regexp.match(':-)').groupdict()
{'A': ':-)', 'C': None, 'B': None, 'D': None}

which makes sense to me.

这对我来说很有意义。

Note that unless you want to capture those specific parts of the emoticons, the C and D groups look unnecessary to me. And unless you want to capture the hat part separately, the B group can be made non-capturing by using (?:) instead of ().

请注意,除非您想要捕获表情符号的这些特定部分,否则C和D组对我来说不再必要。除非你想分别捕获帽子部分,否则可以使用(?:)而不是()使B组成为非捕获。

#2


1  

import re

message1 = "I'm happy today :-)"
message2 = 'Me too *<:-) :3'
message3 = 'I prefer emoticons like this: =D =) =P'

regexp = re.compile("(([{}*<]+)?[:;8B=xX#%*0][-o]?[DbP)(>{c$I3/\J&])")
emoticons1 = regexp.findall(message1)
emoticons2 = regexp.findall(message2)
emoticons3 = regexp.findall(message3)
print(emoticons1)
print(emoticons2)
print(emoticons3)

If you want have only two capture per emoticons, one for emoticon and another for hat, you need have only two group.

如果你想每个表情符号只有两个捕获,一个用于表情符号而另一个用于帽子,你需要只有两个组。

And, where: [DbP)(>{c$I3/\J&] You can not use \ at \J in charclass. In charclass, everthing is character. If you want get J, you should only use J.

并且,其中:[DbP)(> {c $ I3 / \ J&]你不能在charclass中使用\ at \ J.在charclass中,everthing是字符。如果你想得到J,你应该只使用J.

#1


1  

Each parenthesis pair adds a capturing group to your expression. To debug your regex, name your capturing groups with ?P<name>:

每个括号对都会在表达式中添加一个捕获组。要调试正则表达式,请使用?P 命名捕获组:

regexp = re.compile("(?P<A>(?P<B>[{}]|K|(?P<C>E-)|(?P<D>\*<))?[:;8B=xX#%*0][-o]?[DbP)(>{c$I3/\J&])")

Then you have:

然后你有:

>>> print regexp.match(':-)').groupdict()
{'A': ':-)', 'C': None, 'B': None, 'D': None}

which makes sense to me.

这对我来说很有意义。

Note that unless you want to capture those specific parts of the emoticons, the C and D groups look unnecessary to me. And unless you want to capture the hat part separately, the B group can be made non-capturing by using (?:) instead of ().

请注意,除非您想要捕获表情符号的这些特定部分,否则C和D组对我来说不再必要。除非你想分别捕获帽子部分,否则可以使用(?:)而不是()使B组成为非捕获。

#2


1  

import re

message1 = "I'm happy today :-)"
message2 = 'Me too *<:-) :3'
message3 = 'I prefer emoticons like this: =D =) =P'

regexp = re.compile("(([{}*<]+)?[:;8B=xX#%*0][-o]?[DbP)(>{c$I3/\J&])")
emoticons1 = regexp.findall(message1)
emoticons2 = regexp.findall(message2)
emoticons3 = regexp.findall(message3)
print(emoticons1)
print(emoticons2)
print(emoticons3)

If you want have only two capture per emoticons, one for emoticon and another for hat, you need have only two group.

如果你想每个表情符号只有两个捕获,一个用于表情符号而另一个用于帽子,你需要只有两个组。

And, where: [DbP)(>{c$I3/\J&] You can not use \ at \J in charclass. In charclass, everthing is character. If you want get J, you should only use J.

并且,其中:[DbP)(> {c $ I3 / \ J&]你不能在charclass中使用\ at \ J.在charclass中,everthing是字符。如果你想得到J,你应该只使用J.