正则表达式从HTML中提取名称

时间:2022-09-13 11:15:15

I have two pieces of code below from which I want to extract the names.

我有两段代码,我想从中提取名称。

Code:

码:

 ;"><strong>DeanSkyShadow</strong>
 ;"><strong><em>Xavier</em></strong>

The regex should extract the names DeanSkyShadow and Xavier. My current regex:

正则表达式应该提取名称DeanSkyShadow和Xavier。我现在的正则表达式:

(?<=(;"><strong><em>)|(;"><strong>))[\s\S]+?(?=(</em></strong>)|(</strong>))

grabs the names correctly if there is no em tag in the code; if there is then it also grabs the opening em tag, like this: <em>Xavier. How can I fix that?

如果代码中没有em标签,则正确抓取名称;如果有,那么它也会抓住开放的em标签,如下所示: Xavier。我该如何解决这个问题?

1 个解决方案

#1


3  

Match anything that is not a < character; you also cannot use a variable-width look-behind so your version doesn't work at all. Use a non-capturing pattern instead

匹配任何不是 <字符的东西;你也不能使用可变宽度的后视,所以你的版本根本不起作用。请改用非捕捉模式< p>

(?:;"><strong>(?:<em>)?)([^<]+?)(?=(?:</em>)?</strong>)

Demo:

演示:

>>> import re
>>> sample = '''\
...  ;"><strong>DeanSkyShadow</strong>
...  ;"><strong><em>Xavier</em></strong>
... '''
>>> re.findall(r'(?:;"><strong>(?:<em>)?)([^<]+?)(?=(?:</em>)?</strong>)', sample)
['DeanSkyShadow', 'Xavier']

The better solution is to use a HTML parser instead. I can recommend BeautifulSoup:

更好的解决方案是使用HTML解析器。我可以推荐BeautifulSoup:

from bs4 import BeautifulSoup

soup = BeautifulSoup(htmltext)

for strong in soup.find_all('strong'):
    print strong.text

#1


3  

Match anything that is not a < character; you also cannot use a variable-width look-behind so your version doesn't work at all. Use a non-capturing pattern instead

匹配任何不是 <字符的东西;你也不能使用可变宽度的后视,所以你的版本根本不起作用。请改用非捕捉模式< p>

(?:;"><strong>(?:<em>)?)([^<]+?)(?=(?:</em>)?</strong>)

Demo:

演示:

>>> import re
>>> sample = '''\
...  ;"><strong>DeanSkyShadow</strong>
...  ;"><strong><em>Xavier</em></strong>
... '''
>>> re.findall(r'(?:;"><strong>(?:<em>)?)([^<]+?)(?=(?:</em>)?</strong>)', sample)
['DeanSkyShadow', 'Xavier']

The better solution is to use a HTML parser instead. I can recommend BeautifulSoup:

更好的解决方案是使用HTML解析器。我可以推荐BeautifulSoup:

from bs4 import BeautifulSoup

soup = BeautifulSoup(htmltext)

for strong in soup.find_all('strong'):
    print strong.text