正则表达式从页面中提取所有URL

时间:2022-09-13 11:11:04

This question has been asked a few times on SO but I couldn't get any of the answers to work correctly. I need to extract all the URLs in page both in href links and the plain text. I don't need to individual groups of the regex. I need a list of strings i.e. URLs in the page. Could someone point me to a good working example?

这个问题已经被问了几次,但我无法得到任何正确的答案。我需要在href链接和纯文本中提取页面中的所有URL。我不需要正则表达式的各个组。我需要一个字符串列表,即页面中的URL。有人能指出我一个好的工作榜样吗?

I'd like to do this using Regexs and not BeautifulSoup, etc.

我想用Regexs而不是BeautifulSoup等来做这件事。

Thank you.

谢谢。

2 个解决方案

#1


3  

HTML is not a regular language, and thus cannot be parsed by regular expressions.

HTML不是常规语言,因此无法通过正则表达式进行解析。

It's possible to make reasonable guesses using regular expressions, and/or to recognize a restricted subset of URIs, but that way lies madness (lengthy debugging processes, inaccurate results).

可以使用正则表达式进行合理的猜测,和/或识别URI的受限子集,但这种方式是疯狂的(冗长的调试过程,不准确的结果)。

That said, if you're willing to go that path, see John Gruber's regex for the purpose:

也就是说,如果你愿意走那条路,请看John Gruber的正则表达式:

def extract_urls(your_text):
  url_re = re.compile(r'\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))')
  for match in url_re.finditer(your_text):
    yield match.group(0)

This can be used as follows:

这可以使用如下:

>>> for uri in extract_urls('http://foo.bar/baz irc://freenode.org/bash'):
...   print uri
http://foo.bar/
irc://freenode.org

#2


0  

I know you can use the DOM object in PHP to parse an HTML document. I'm not familiar with python but this might help: http://docs.python.org/library/xml.dom.html

我知道你可以使用PHP中的DOM对象来解析HTML文档。我不熟悉python,但这可能会有所帮助:http://docs.python.org/library/xml.dom.html

#1


3  

HTML is not a regular language, and thus cannot be parsed by regular expressions.

HTML不是常规语言,因此无法通过正则表达式进行解析。

It's possible to make reasonable guesses using regular expressions, and/or to recognize a restricted subset of URIs, but that way lies madness (lengthy debugging processes, inaccurate results).

可以使用正则表达式进行合理的猜测,和/或识别URI的受限子集,但这种方式是疯狂的(冗长的调试过程,不准确的结果)。

That said, if you're willing to go that path, see John Gruber's regex for the purpose:

也就是说,如果你愿意走那条路,请看John Gruber的正则表达式:

def extract_urls(your_text):
  url_re = re.compile(r'\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))')
  for match in url_re.finditer(your_text):
    yield match.group(0)

This can be used as follows:

这可以使用如下:

>>> for uri in extract_urls('http://foo.bar/baz irc://freenode.org/bash'):
...   print uri
http://foo.bar/
irc://freenode.org

#2


0  

I know you can use the DOM object in PHP to parse an HTML document. I'm not familiar with python but this might help: http://docs.python.org/library/xml.dom.html

我知道你可以使用PHP中的DOM对象来解析HTML文档。我不熟悉python,但这可能会有所帮助:http://docs.python.org/library/xml.dom.html