This shell command succeeds
这个shell命令成功
$ curl -A "Mozilla/5.0 (X11; Linux x86_64; rv:18.0) Gecko/20100101 Firefox/18.0 (compatible;)" http://fifa-infinity.com/robots.txt
and prints robots.txt. Omitting the user-agent option results in a 403 error from the server. Inspecting the robots.txt file shows that content under http://www.fifa-infinity.com/board is allowed for crawling. However, the following fails (python code):
并打印robots . txt。省略用户代理选项将导致服务器出现403错误。检查机器人。txt文件显示在http://www.fifa-infinity.com/board下的内容是允许爬行的。但是,以下失败(python代码):
import logging
import mechanize
from mechanize import Browser
ua = 'Mozilla/5.0 (X11; Linux x86_64; rv:18.0) Gecko/20100101 Firefox/18.0 (compatible;)'
br = Browser()
br.addheaders = [('User-Agent', ua)]
br.set_debug_http(True)
br.set_debug_responses(True)
logging.getLogger('mechanize').setLevel(logging.DEBUG)
br.open('http://www.fifa-infinity.com/robots.txt')
And the output on my console is:
我控制台上的输出是:
No handlers could be found for logger "mechanize.cookies"
send: 'GET /robots.txt HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.fifa-infinity.com\r\nConnection: close\r\nUser-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:18.0) Gecko/20100101 Firefox/18.0 (compatible;)\r\n\r\n'
reply: 'HTTP/1.1 403 Bad Behavior\r\n'
header: Date: Wed, 13 Feb 2013 15:37:16 GMT
header: Server: Apache
header: X-Powered-By: PHP/5.2.17
header: Vary: User-Agent,Accept-Encoding
header: Connection: close
header: Transfer-Encoding: chunked
header: Content-Type: text/html
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/moshev/Projects/forumscrawler/lib/python2.7/site-packages/mechanize/_mechanize.py", line 203, in open
return self._mech_open(url, data, timeout=timeout)
File "/home/moshev/Projects/forumscrawler/lib/python2.7/site-packages/mechanize/_mechanize.py", line 255, in _mech_open
raise response
mechanize._response.httperror_seek_wrapper: HTTP Error 403: Bad Behavior
Strangely, using curl without setting the user-agent results in "403: Forbidden" rather than "403: Bad Behavior".
奇怪的是,使用curl而不设置用户代理会导致“403:禁止”而不是“403:不良行为”。
Am I somehow doing something wrong, or is this a bug in mechanize/urllib2? I don't see how simply getting robots.txt can be "bad behaviour"?
是我做错了什么,还是这是一个mechanize/urllib2中的bug ?我不知道如何简单地得到机器人。txt会是“不良行为”吗?
1 个解决方案
#1
9
As verified by experiment, you need add an Accept header to specify acceptable content types(any type will do, as long as "Accept" header exists). For example, it will work after changing:
通过实验验证,您需要添加一个Accept标头来指定可接受的内容类型(只要存在“Accept”标头,任何类型都可以)。例如,它在改变后会起作用:
br.addheaders = [('User-Agent', ua)]
to:
:
br.addheaders = [('User-Agent', ua), ('Accept', '*/*')]
#1
9
As verified by experiment, you need add an Accept header to specify acceptable content types(any type will do, as long as "Accept" header exists). For example, it will work after changing:
通过实验验证,您需要添加一个Accept标头来指定可接受的内容类型(只要存在“Accept”标头,任何类型都可以)。例如,它在改变后会起作用:
br.addheaders = [('User-Agent', ua)]
to:
:
br.addheaders = [('User-Agent', ua), ('Accept', '*/*')]