I have two pieces of code below from which I want to extract the names.
我有两段代码,我想从中提取名称。
Code:
码:
;"><strong>DeanSkyShadow</strong>
;"><strong><em>Xavier</em></strong>
The regex should extract the names DeanSkyShadow and Xavier. My current regex:
正则表达式应该提取名称DeanSkyShadow和Xavier。我现在的正则表达式:
(?<=(;"><strong><em>)|(;"><strong>))[\s\S]+?(?=(</em></strong>)|(</strong>))
grabs the names correctly if there is no em tag in the code; if there is then it also grabs the opening em tag, like this: <em>Xavier
. How can I fix that?
如果代码中没有em标签,则正确抓取名称;如果有,那么它也会抓住开放的em标签,如下所示: Xavier。我该如何解决这个问题?
1 个解决方案
#1
3
Match anything that is not a <
character; you also cannot use a variable-width look-behind so your version doesn't work at all. Use a non-capturing pattern instead
匹配任何不是 <字符的东西;你也不能使用可变宽度的后视,所以你的版本根本不起作用。请改用非捕捉模式< p>
(?:;"><strong>(?:<em>)?)([^<]+?)(?=(?:</em>)?</strong>)
Demo:
演示:
>>> import re
>>> sample = '''\
... ;"><strong>DeanSkyShadow</strong>
... ;"><strong><em>Xavier</em></strong>
... '''
>>> re.findall(r'(?:;"><strong>(?:<em>)?)([^<]+?)(?=(?:</em>)?</strong>)', sample)
['DeanSkyShadow', 'Xavier']
The better solution is to use a HTML parser instead. I can recommend BeautifulSoup:
更好的解决方案是使用HTML解析器。我可以推荐BeautifulSoup:
from bs4 import BeautifulSoup
soup = BeautifulSoup(htmltext)
for strong in soup.find_all('strong'):
print strong.text
#1
3
Match anything that is not a <
character; you also cannot use a variable-width look-behind so your version doesn't work at all. Use a non-capturing pattern instead
匹配任何不是 <字符的东西;你也不能使用可变宽度的后视,所以你的版本根本不起作用。请改用非捕捉模式< p>
(?:;"><strong>(?:<em>)?)([^<]+?)(?=(?:</em>)?</strong>)
Demo:
演示:
>>> import re
>>> sample = '''\
... ;"><strong>DeanSkyShadow</strong>
... ;"><strong><em>Xavier</em></strong>
... '''
>>> re.findall(r'(?:;"><strong>(?:<em>)?)([^<]+?)(?=(?:</em>)?</strong>)', sample)
['DeanSkyShadow', 'Xavier']
The better solution is to use a HTML parser instead. I can recommend BeautifulSoup:
更好的解决方案是使用HTML解析器。我可以推荐BeautifulSoup:
from bs4 import BeautifulSoup
soup = BeautifulSoup(htmltext)
for strong in soup.find_all('strong'):
print strong.text