while crawling website like https://www.netflix.com, getting Forbidden by robots.txt: https://www.netflix.com/>
在抓取像https://www.netflix.com这样的网站时,被robots.txt禁止访问:https://www.netflix.com/>
ERROR: No response downloaded for: https://www.netflix.com/
错误:未下载响应:https://www.netflix.com/
2 个解决方案
#1
80
In the new version (scrapy 1.1) launched 2016-05-11 the crawl first downloads robots.txt before crawling. To change this behavior change in your settings.py
with ROBOTSTXT_OBEY
在2016-05-11推出的新版本(scrapy 1.1)中,抓取首先在抓取之前下载robots.txt。要使用ROBOTSTXT_OBEY更改settings.py中的此行为更改
ROBOTSTXT_OBEY=False
Here are the release notes
这是发行说明
#2
0
First thing you need to ensure is that you change your user agent in the request, otherwise default user agent will be blocked for sure.
您需要确保的第一件事是您在请求中更改了用户代理,否则将默认阻止默认用户代理。
#1
80
In the new version (scrapy 1.1) launched 2016-05-11 the crawl first downloads robots.txt before crawling. To change this behavior change in your settings.py
with ROBOTSTXT_OBEY
在2016-05-11推出的新版本(scrapy 1.1)中,抓取首先在抓取之前下载robots.txt。要使用ROBOTSTXT_OBEY更改settings.py中的此行为更改
ROBOTSTXT_OBEY=False
Here are the release notes
这是发行说明
#2
0
First thing you need to ensure is that you change your user agent in the request, otherwise default user agent will be blocked for sure.
您需要确保的第一件事是您在请求中更改了用户代理,否则将默认阻止默认用户代理。