Regex to extract URLs from href attribute in HTML with Python [duplicate]

时间:2022-09-13 11:10:34

Possible Duplicate:
What is the best regular expression to check if a string is a valid URL?

可能重复:检查字符串是否为有效URL的最佳正则表达式是什么?

Considering a string as follows:

考虑如下字符串:

string = "<p>Hello World</p><a href="http://example.com">More Examples</a><a href="http://example2.com">Even More Examples</a>"

How could I, with Python, extract the urls, inside the anchor tag's href? Something like:

我怎么能用Python在锚标记的href中提取网址?就像是:

>>> url = getURLs(string)
>>> url
['http://example.com', 'http://example2.com']

Thanks!

2 个解决方案

#1


165  

import re

url = '<p>Hello World</p><a href="http://example.com">More Examples</a><a href="http://example2.com">Even More Examples</a>'

urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', url)

>>> print urls
['http://example.com', 'http://example2.com']

#2


35  

The best answer is...

最好的答案是......

Don't use a regex

The expression in the accepted answer misses many cases. Among other things, URLs can have unicode characters in them. The regex you want is here, and after looking at it, you may conclude that you don't really want it after all. The most correct version is ten-thousand characters long.

接受的答案中的表达错过了许多案例。除其他外,URL中可以包含unicode字符。你想要的正则表达式就在这里,看了之后,你可能会得出结论,毕竟你并不是真的想要它。最正确的版本是一万个字符长。

Admittedly, if you were starting with plain, unstructured text with a bunch of URLs in it, then you might need that ten-thousand-character-long regex. But if your input is structured, use the structure. Your stated aim is to "extract the url, inside the anchor tag's href." Why use a ten-thousand-character-long regex when you can do something much simpler?

不可否认,如果您从简单的非结构化文本开始,其中包含一堆URL,那么您可能需要这个一万个字符长的正则表达式。但如果您的输入是结构化的,请使用该结构。您声明的目标是“在锚标记的href中提取网址”。当你可以做一些更简单的事情时,为什么要使用一个长达一万字符的正则表达式呢?

Parse the HTML instead

For many tasks, using Beautiful Soup will be far faster and easier to use:

对于许多任务,使用Beautiful Soup将更快更容易使用:

>>> from bs4 import BeautifulSoup as Soup
>>> html = Soup(s, 'html.parser')           # Soup(s, 'lxml') if lxml is installed
>>> [a['href'] for a in html.find_all('a')]
['http://example.com', 'http://example2.com']

If you prefer not to use external tools, you can also directly use Python's own built-in HTML parsing library. Here's a really simple subclass of HTMLParser that does exactly what you want:

如果您不想使用外部工具,也可以直接使用Python自带的内置HTML解析库。这是HTMLParser的一个非常简单的子类,可以完全满足您的需求:

from html.parser import HTMLParser

class MyParser(HTMLParser):
    def __init__(self, output_list=None):
        HTMLParser.__init__(self)
        if output_list is None:
            self.output_list = []
        else:
            self.output_list = output_list
    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            self.output_list.append(dict(attrs).get('href'))

Test:

>>> p = MyParser()
>>> p.feed(s)
>>> p.output_list
['http://example.com', 'http://example2.com']

You could even create a new method that accepts a string, calls feed, and returns output_list. This is a vastly more powerful and extensible way than regular expressions to extract information from html.

您甚至可以创建一个接受字符串,调用feed并返回output_list的新方法。这是一种比正则表达式更强大,更可扩展的方法,可以从html中提取信息。

#1


165  

import re

url = '<p>Hello World</p><a href="http://example.com">More Examples</a><a href="http://example2.com">Even More Examples</a>'

urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', url)

>>> print urls
['http://example.com', 'http://example2.com']

#2


35  

The best answer is...

最好的答案是......

Don't use a regex

The expression in the accepted answer misses many cases. Among other things, URLs can have unicode characters in them. The regex you want is here, and after looking at it, you may conclude that you don't really want it after all. The most correct version is ten-thousand characters long.

接受的答案中的表达错过了许多案例。除其他外,URL中可以包含unicode字符。你想要的正则表达式就在这里,看了之后,你可能会得出结论,毕竟你并不是真的想要它。最正确的版本是一万个字符长。

Admittedly, if you were starting with plain, unstructured text with a bunch of URLs in it, then you might need that ten-thousand-character-long regex. But if your input is structured, use the structure. Your stated aim is to "extract the url, inside the anchor tag's href." Why use a ten-thousand-character-long regex when you can do something much simpler?

不可否认,如果您从简单的非结构化文本开始,其中包含一堆URL,那么您可能需要这个一万个字符长的正则表达式。但如果您的输入是结构化的,请使用该结构。您声明的目标是“在锚标记的href中提取网址”。当你可以做一些更简单的事情时,为什么要使用一个长达一万字符的正则表达式呢?

Parse the HTML instead

For many tasks, using Beautiful Soup will be far faster and easier to use:

对于许多任务,使用Beautiful Soup将更快更容易使用:

>>> from bs4 import BeautifulSoup as Soup
>>> html = Soup(s, 'html.parser')           # Soup(s, 'lxml') if lxml is installed
>>> [a['href'] for a in html.find_all('a')]
['http://example.com', 'http://example2.com']

If you prefer not to use external tools, you can also directly use Python's own built-in HTML parsing library. Here's a really simple subclass of HTMLParser that does exactly what you want:

如果您不想使用外部工具,也可以直接使用Python自带的内置HTML解析库。这是HTMLParser的一个非常简单的子类,可以完全满足您的需求:

from html.parser import HTMLParser

class MyParser(HTMLParser):
    def __init__(self, output_list=None):
        HTMLParser.__init__(self)
        if output_list is None:
            self.output_list = []
        else:
            self.output_list = output_list
    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            self.output_list.append(dict(attrs).get('href'))

Test:

>>> p = MyParser()
>>> p.feed(s)
>>> p.output_list
['http://example.com', 'http://example2.com']

You could even create a new method that accepts a string, calls feed, and returns output_list. This is a vastly more powerful and extensible way than regular expressions to extract information from html.

您甚至可以创建一个接受字符串,调用feed并返回output_list的新方法。这是一种比正则表达式更强大,更可扩展的方法,可以从html中提取信息。