This question has been asked a few times on SO but I couldn't get any of the answers to work correctly. I need to extract all the URLs in page both in href links and the plain text. I don't need to individual groups of the regex. I need a list of strings i.e. URLs in the page. Could someone point me to a good working example?
这个问题已经被问了几次,但我无法得到任何正确的答案。我需要在href链接和纯文本中提取页面中的所有URL。我不需要正则表达式的各个组。我需要一个字符串列表,即页面中的URL。有人能指出我一个好的工作榜样吗?
I'd like to do this using Regexs and not BeautifulSoup, etc.
我想用Regexs而不是BeautifulSoup等来做这件事。
Thank you.
谢谢。
2 个解决方案
#1
3
HTML is not a regular language, and thus cannot be parsed by regular expressions.
HTML不是常规语言,因此无法通过正则表达式进行解析。
It's possible to make reasonable guesses using regular expressions, and/or to recognize a restricted subset of URIs, but that way lies madness (lengthy debugging processes, inaccurate results).
可以使用正则表达式进行合理的猜测,和/或识别URI的受限子集,但这种方式是疯狂的(冗长的调试过程,不准确的结果)。
That said, if you're willing to go that path, see John Gruber's regex for the purpose:
也就是说,如果你愿意走那条路,请看John Gruber的正则表达式:
def extract_urls(your_text):
url_re = re.compile(r'\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))')
for match in url_re.finditer(your_text):
yield match.group(0)
This can be used as follows:
这可以使用如下:
>>> for uri in extract_urls('http://foo.bar/baz irc://freenode.org/bash'):
... print uri
http://foo.bar/
irc://freenode.org
#2
0
I know you can use the DOM object in PHP to parse an HTML document. I'm not familiar with python but this might help: http://docs.python.org/library/xml.dom.html
我知道你可以使用PHP中的DOM对象来解析HTML文档。我不熟悉python,但这可能会有所帮助:http://docs.python.org/library/xml.dom.html
#1
3
HTML is not a regular language, and thus cannot be parsed by regular expressions.
HTML不是常规语言,因此无法通过正则表达式进行解析。
It's possible to make reasonable guesses using regular expressions, and/or to recognize a restricted subset of URIs, but that way lies madness (lengthy debugging processes, inaccurate results).
可以使用正则表达式进行合理的猜测,和/或识别URI的受限子集,但这种方式是疯狂的(冗长的调试过程,不准确的结果)。
That said, if you're willing to go that path, see John Gruber's regex for the purpose:
也就是说,如果你愿意走那条路,请看John Gruber的正则表达式:
def extract_urls(your_text):
url_re = re.compile(r'\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))')
for match in url_re.finditer(your_text):
yield match.group(0)
This can be used as follows:
这可以使用如下:
>>> for uri in extract_urls('http://foo.bar/baz irc://freenode.org/bash'):
... print uri
http://foo.bar/
irc://freenode.org
#2
0
I know you can use the DOM object in PHP to parse an HTML document. I'm not familiar with python but this might help: http://docs.python.org/library/xml.dom.html
我知道你可以使用PHP中的DOM对象来解析HTML文档。我不熟悉python,但这可能会有所帮助:http://docs.python.org/library/xml.dom.html