What I'm trying to do here is get the headers of a given URL so I can determine the MIME type. I want to be able to see if http://somedomain/foo/
will return an HTML document or a JPEG image for example. Thus, I need to figure out how to send a HEAD request so that I can read the MIME type without having to download the content. Does anyone know of an easy way of doing this?
这里我要做的是获取给定URL的头,这样我就可以确定MIME类型。我想看看http://somedomain/foo/是否会返回HTML文档或JPEG图像。因此,我需要弄清楚如何发送HEAD请求,这样我就可以在不下载内容的情况下读取MIME类型。有人知道一种简单的方法吗?
11 个解决方案
#1
100
edit: This answer works, but nowadays you should just use the requests library as mentioned by other answers below.
编辑:这个答案是有效的,但是现在您应该使用下面其他答案所提到的请求库。
Use httplib.
。的使用httplib的
>>> import httplib
>>> conn = httplib.HTTPConnection("www.google.com")
>>> conn.request("HEAD", "/index.html")
>>> res = conn.getresponse()
>>> print res.status, res.reason
200 OK
>>> print res.getheaders()
[('content-length', '0'), ('expires', '-1'), ('server', 'gws'), ('cache-control', 'private, max-age=0'), ('date', 'Sat, 20 Sep 2008 06:43:36 GMT'), ('content-type', 'text/html; charset=ISO-8859-1')]
There's also a getheader(name)
to get a specific header.
还有一个getheader(名称)来获取特定的标题。
#2
103
urllib2 can be used to perform a HEAD request. This is a little nicer than using httplib since urllib2 parses the URL for you instead of requiring you to split the URL into host name and path.
urllib2可用于执行HEAD请求。这比使用httplib好一点,因为urllib2为您解析URL,而不是要求您将URL分割为主机名和路径。
>>> import urllib2
>>> class HeadRequest(urllib2.Request):
... def get_method(self):
... return "HEAD"
...
>>> response = urllib2.urlopen(HeadRequest("http://google.com/index.html"))
Headers are available via response.info() as before. Interestingly, you can find the URL that you were redirected to:
可以像以前一样通过response.info()获取报头。有趣的是,你可以找到你被重定向到的URL:
>>> print response.geturl()
http://www.google.com.au/index.html
#3
53
Obligatory Requests
way:
的请求:
import requests
resp = requests.head("http://www.google.com")
print resp.status_code, resp.text, resp.headers
#5
15
Just:
只是:
import urllib2
request = urllib2.Request('http://localhost:8080')
request.get_method = lambda : 'HEAD'
response = urllib2.urlopen(request)
response.info().gettype()
Edit: I've just came to realize there is httplib2 :D
编辑:我刚意识到有httplib2:D
import httplib2
h = httplib2.Http()
resp = h.request("http://www.google.com", 'HEAD')
assert resp[0]['status'] == 200
assert resp[0]['content-type'] == 'text/html'
...
链接文本
#6
7
For completeness to have a Python3 answer equivalent to the accepted answer using httplib.
为了完整性起见,使用httplib将Python3的答案等价于已接受的答案。
It is basically the same code just that the library isn't called httplib anymore but http.client
它基本上是相同的代码,只是库不再被称为httplib,而是http.client
from http.client import HTTPConnection
conn = HTTPConnection('www.google.com')
conn.request('HEAD', '/index.html')
res = conn.getresponse()
print(res.status, res.reason)
#7
2
import httplib
import urlparse
def unshorten_url(url):
parsed = urlparse.urlparse(url)
h = httplib.HTTPConnection(parsed.netloc)
h.request('HEAD', parsed.path)
response = h.getresponse()
if response.status/100 == 3 and response.getheader('Location'):
return response.getheader('Location')
else:
return url
#8
1
As an aside, when using the httplib (at least on 2.5.2), trying to read the response of a HEAD request will block (on readline) and subsequently fail. If you do not issue read on the response, you are unable to send another request on the connection, you will need to open a new one. Or accept a long delay between requests.
顺便提一下,当使用httplib(至少在2.5.2上)时,尝试读取HEAD请求的响应将会阻塞(在readline上),并随后失败。如果不发出响应上的read,则无法在连接上发送另一个请求,则需要打开一个新的请求。或者接受两个请求之间的长时间延迟。
#9
1
I have found that httplib is slightly faster than urllib2. I timed two programs - one using httplib and the other using urllib2 - sending HEAD requests to 10,000 URL's. The httplib one was faster by several minutes. httplib's total stats were: real 6m21.334s user 0m2.124s sys 0m16.372s
我发现httplib略快于urllib2。我给两个程序计时——一个使用httplib,另一个使用urllib2——将HEAD请求发送到10,000个URL。httplib则快了几分钟。httplib的统计数据是:real 6m21.334s用户0m2.124s sys 0m16.372s
And urllib2's total stats were: real 9m1.380s user 0m16.666s sys 0m28.565s
urllib2的统计数据是:real 9m1.380s用户0m16.666s系统0m28.55s
Does anybody else have input on this?
有人有意见吗?
#10
0
And yet another approach (similar to Pawel answer):
另一种方法(类似于Pawel的回答):
import urllib2
import types
request = urllib2.Request('http://localhost:8080')
request.get_method = types.MethodType(lambda self: 'HEAD', request, request.__class__)
Just to avoid having unbounded methods at instance level.
只是为了避免在实例级使用*方法。
#11
-4
Probably easier: use urllib or urllib2.
可能更简单:使用urllib或urllib2。
>>> import urllib
>>> f = urllib.urlopen('http://google.com')
>>> f.info().gettype()
'text/html'
f.info() is a dictionary-like object, so you can do f.info()['content-type'], etc.
f.info()是一个类似于字典的对象,所以您可以做f.info()['content-type']等。
http://docs.python.org/library/urllib.html
http://docs.python.org/library/urllib2.html
http://docs.python.org/library/httplib.html
http://docs.python.org/library/urllib.html http://docs.python.org/library/urllib2.html http://docs.python.org/library/urllib2.html
The docs note that httplib is not normally used directly.
文档说明httplib通常不直接使用。
#1
100
edit: This answer works, but nowadays you should just use the requests library as mentioned by other answers below.
编辑:这个答案是有效的,但是现在您应该使用下面其他答案所提到的请求库。
Use httplib.
。的使用httplib的
>>> import httplib
>>> conn = httplib.HTTPConnection("www.google.com")
>>> conn.request("HEAD", "/index.html")
>>> res = conn.getresponse()
>>> print res.status, res.reason
200 OK
>>> print res.getheaders()
[('content-length', '0'), ('expires', '-1'), ('server', 'gws'), ('cache-control', 'private, max-age=0'), ('date', 'Sat, 20 Sep 2008 06:43:36 GMT'), ('content-type', 'text/html; charset=ISO-8859-1')]
There's also a getheader(name)
to get a specific header.
还有一个getheader(名称)来获取特定的标题。
#2
103
urllib2 can be used to perform a HEAD request. This is a little nicer than using httplib since urllib2 parses the URL for you instead of requiring you to split the URL into host name and path.
urllib2可用于执行HEAD请求。这比使用httplib好一点,因为urllib2为您解析URL,而不是要求您将URL分割为主机名和路径。
>>> import urllib2
>>> class HeadRequest(urllib2.Request):
... def get_method(self):
... return "HEAD"
...
>>> response = urllib2.urlopen(HeadRequest("http://google.com/index.html"))
Headers are available via response.info() as before. Interestingly, you can find the URL that you were redirected to:
可以像以前一样通过response.info()获取报头。有趣的是,你可以找到你被重定向到的URL:
>>> print response.geturl()
http://www.google.com.au/index.html
#3
53
Obligatory Requests
way:
的请求:
import requests
resp = requests.head("http://www.google.com")
print resp.status_code, resp.text, resp.headers
#4
#5
15
Just:
只是:
import urllib2
request = urllib2.Request('http://localhost:8080')
request.get_method = lambda : 'HEAD'
response = urllib2.urlopen(request)
response.info().gettype()
Edit: I've just came to realize there is httplib2 :D
编辑:我刚意识到有httplib2:D
import httplib2
h = httplib2.Http()
resp = h.request("http://www.google.com", 'HEAD')
assert resp[0]['status'] == 200
assert resp[0]['content-type'] == 'text/html'
...
链接文本
#6
7
For completeness to have a Python3 answer equivalent to the accepted answer using httplib.
为了完整性起见,使用httplib将Python3的答案等价于已接受的答案。
It is basically the same code just that the library isn't called httplib anymore but http.client
它基本上是相同的代码,只是库不再被称为httplib,而是http.client
from http.client import HTTPConnection
conn = HTTPConnection('www.google.com')
conn.request('HEAD', '/index.html')
res = conn.getresponse()
print(res.status, res.reason)
#7
2
import httplib
import urlparse
def unshorten_url(url):
parsed = urlparse.urlparse(url)
h = httplib.HTTPConnection(parsed.netloc)
h.request('HEAD', parsed.path)
response = h.getresponse()
if response.status/100 == 3 and response.getheader('Location'):
return response.getheader('Location')
else:
return url
#8
1
As an aside, when using the httplib (at least on 2.5.2), trying to read the response of a HEAD request will block (on readline) and subsequently fail. If you do not issue read on the response, you are unable to send another request on the connection, you will need to open a new one. Or accept a long delay between requests.
顺便提一下,当使用httplib(至少在2.5.2上)时,尝试读取HEAD请求的响应将会阻塞(在readline上),并随后失败。如果不发出响应上的read,则无法在连接上发送另一个请求,则需要打开一个新的请求。或者接受两个请求之间的长时间延迟。
#9
1
I have found that httplib is slightly faster than urllib2. I timed two programs - one using httplib and the other using urllib2 - sending HEAD requests to 10,000 URL's. The httplib one was faster by several minutes. httplib's total stats were: real 6m21.334s user 0m2.124s sys 0m16.372s
我发现httplib略快于urllib2。我给两个程序计时——一个使用httplib,另一个使用urllib2——将HEAD请求发送到10,000个URL。httplib则快了几分钟。httplib的统计数据是:real 6m21.334s用户0m2.124s sys 0m16.372s
And urllib2's total stats were: real 9m1.380s user 0m16.666s sys 0m28.565s
urllib2的统计数据是:real 9m1.380s用户0m16.666s系统0m28.55s
Does anybody else have input on this?
有人有意见吗?
#10
0
And yet another approach (similar to Pawel answer):
另一种方法(类似于Pawel的回答):
import urllib2
import types
request = urllib2.Request('http://localhost:8080')
request.get_method = types.MethodType(lambda self: 'HEAD', request, request.__class__)
Just to avoid having unbounded methods at instance level.
只是为了避免在实例级使用*方法。
#11
-4
Probably easier: use urllib or urllib2.
可能更简单:使用urllib或urllib2。
>>> import urllib
>>> f = urllib.urlopen('http://google.com')
>>> f.info().gettype()
'text/html'
f.info() is a dictionary-like object, so you can do f.info()['content-type'], etc.
f.info()是一个类似于字典的对象,所以您可以做f.info()['content-type']等。
http://docs.python.org/library/urllib.html
http://docs.python.org/library/urllib2.html
http://docs.python.org/library/httplib.html
http://docs.python.org/library/urllib.html http://docs.python.org/library/urllib2.html http://docs.python.org/library/urllib2.html
The docs note that httplib is not normally used directly.
文档说明httplib通常不直接使用。