When trying to extract the title of a html-page I have always used the following regex:
在试图提取html页面的标题时,我总是使用以下regex:
(?<=<title.*>)([\s\S]*)(?=</title>)
Which will extract everything between the tags in a document and ignore the tags themselves. However, when trying to use this regex in Python it raises the following Exception:
它将提取文档中标记之间的所有内容,并忽略标记本身。但是,当尝试在Python中使用这个regex时,它会引发以下异常:
Traceback (most recent call last):
File "test.py", line 21, in <module>
pattern = re.compile('(?<=<title.*>)([\s\S]*)(?=</title>)')
File "C:\Python31\lib\re.py", line 205, in compile
return _compile(pattern, flags)
File "C:\Python31\lib\re.py", line 273, in _compile
p = sre_compile.compile(pattern, flags) File
"C:\Python31\lib\sre_compile.py", line 495, in compile
code = _code(p, flags) File "C:\Python31\lib\sre_compile.py", line 480, in _code
_compile(code, p.data, flags) File "C:\Python31\lib\sre_compile.py", line 115, in _compile
raise error("look-behind requires fixed-width pattern")
sre_constants.error: look-behind requires fixed-width pattern
The code I am using is:
我使用的代码是:
pattern = re.compile('(?<=<title.*>)([\s\S]*)(?=</title>)')
m = pattern.search(f)
if I do some minimal adjustments it works:
如果我做一些最小的调整,它会起作用:
pattern = re.compile('(?<=<title>)([\s\S]*)(?=</title>)')
m = pattern.search(f)
This will, however, not take into account potential html titles that for some reason have attributes or similar.
但是,这将不考虑潜在的html标题,因为某些原因具有属性或类似的特性。
Anyone know a good workaround for this issue? Any tips are appreciated.
有人知道解决这个问题的好办法吗?任何建议都欣赏。
5 个解决方案
#1
1
If you just want to get the title tag,
如果你只想要标题标签,
html=urllib2.urlopen("http://somewhere").read()
for item in html.split("</title>"):
if "<title>" in item:
print item[ item.find("<title>")+7: ]
#2
10
Toss out the idea of parsing HTML with regular expressions and use an actual HTML parsing library instead. After a quick search I found this one. It's a much safer way to extract information from an HTML file.
放弃用正则表达式解析HTML的想法,转而使用实际的HTML解析库。经过快速搜索,我找到了这个。从HTML文件中提取信息要安全得多。
Remember, HTML is not a regular language so regular expressions are fundamentally the wrong tool for extracting information from it.
请记住,HTML不是一种正则语言,因此正则表达式从根本上来说是提取信息的错误工具。
#3
5
Here's a famous answer on parsing html with regular expressions that does a great job of saying, "don't use regex to parse html."
这里有一个关于用正则表达式解析html的著名答案,它很好地说明了“不要使用正则表达式来解析html”。
#4
3
The regex for extracting the content of non-nested HTML/XML tags is actually very simple:
用于提取非嵌套HTML/XML标记内容的regex实际上非常简单:
r = re.compile('<title[^>]*>(.*?)</title>')
However, for anything more complex, you should really use a proper DOM parser like urllib or BeautifulSoup.
但是,对于更复杂的内容,您应该使用适当的DOM解析器,如urllib或BeautifulSoup。
#5
2
What about something like:
什么东西:
r = re.compile("(<title.*>)([\s\S]*)(</title>)")
title = r.search(page).group(2)
#1
1
If you just want to get the title tag,
如果你只想要标题标签,
html=urllib2.urlopen("http://somewhere").read()
for item in html.split("</title>"):
if "<title>" in item:
print item[ item.find("<title>")+7: ]
#2
10
Toss out the idea of parsing HTML with regular expressions and use an actual HTML parsing library instead. After a quick search I found this one. It's a much safer way to extract information from an HTML file.
放弃用正则表达式解析HTML的想法,转而使用实际的HTML解析库。经过快速搜索,我找到了这个。从HTML文件中提取信息要安全得多。
Remember, HTML is not a regular language so regular expressions are fundamentally the wrong tool for extracting information from it.
请记住,HTML不是一种正则语言,因此正则表达式从根本上来说是提取信息的错误工具。
#3
5
Here's a famous answer on parsing html with regular expressions that does a great job of saying, "don't use regex to parse html."
这里有一个关于用正则表达式解析html的著名答案,它很好地说明了“不要使用正则表达式来解析html”。
#4
3
The regex for extracting the content of non-nested HTML/XML tags is actually very simple:
用于提取非嵌套HTML/XML标记内容的regex实际上非常简单:
r = re.compile('<title[^>]*>(.*?)</title>')
However, for anything more complex, you should really use a proper DOM parser like urllib or BeautifulSoup.
但是,对于更复杂的内容,您应该使用适当的DOM解析器,如urllib或BeautifulSoup。
#5
2
What about something like:
什么东西:
r = re.compile("(<title.*>)([\s\S]*)(</title>)")
title = r.search(page).group(2)