正则表达式从HTML中提取名称

I have two pieces of code below from which I want to extract the names.

我有两段代码，我想从中提取名称。

Code:

码：

 ;"><strong>DeanSkyShadow</strong>
 ;"><strong><em>Xavier</em></strong>

The regex should extract the names DeanSkyShadow and Xavier. My current regex:

正则表达式应该提取名称DeanSkyShadow和Xavier。我现在的正则表达式：

(?<=(;"><strong><em>)|(;"><strong>))[\s\S]+?(?=(</em></strong>)|(</strong>))

grabs the names correctly if there is no em tag in the code; if there is then it also grabs the opening em tag, like this: <em>Xavier. How can I fix that?

如果代码中没有em标签，则正确抓取名称;如果有，那么它也会抓住开放的em标签，如下所示： Xavier。我该如何解决这个问题？

1 个解决方案

#1

Match anything that is not a < character; you also cannot use a variable-width look-behind so your version doesn't work at all. Use a non-capturing pattern instead

匹配任何不是 <字符的东西;你也不能使用可变宽度的后视，所以你的版本根本不起作用。请改用非捕捉模式< p>

(?:;"><strong>(?:<em>)?)([^<]+?)(?=(?:</em>)?</strong>)

Demo:

演示：

>>> import re
>>> sample = '''\
...  ;"><strong>DeanSkyShadow</strong>
...  ;"><strong><em>Xavier</em></strong>
... '''
>>> re.findall(r'(?:;"><strong>(?:<em>)?)([^<]+?)(?=(?:</em>)?</strong>)', sample)
['DeanSkyShadow', 'Xavier']

The better solution is to use a HTML parser instead. I can recommend BeautifulSoup:

更好的解决方案是使用HTML解析器。我可以推荐BeautifulSoup：

from bs4 import BeautifulSoup

soup = BeautifulSoup(htmltext)

for strong in soup.find_all('strong'):
    print strong.text

#1