I'm trying to scrape images off a site using the default ImagePipeline scrapy has. However, I encounter this error.
我正在尝试使用默认的ImagePipeline scrapy从网站上删除图像。但是,我遇到了这个错误。
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/middleware.py", line 62, in _process_chain
return process_chain(self.methods[methodname], obj, *args)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/utils/defer.py", line 65, in process_chain
d.callback(input)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py", line 383, in callback
self._startRunCallbacks(result)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py", line 491, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py", line 578, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/contrib/pipeline/media.py", line 40, in process_item
requests = arg_to_iter(self.get_media_requests(item, info))
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/contrib/pipeline/images.py", line 104, in get_media_requests
return [Request(x) for x in item.get(self.IMAGES_URLS_FIELD, [])]
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/http/request/__init__.py", line 26, in __init__
self._set_url(url)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/http/request/__init__.py", line 57, in _set_url
self._set_url(url.encode(self.encoding))
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/http/request/__init__.py", line 61, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
exceptions.ValueError: Missing scheme in request url: h
My code:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from extrapetite.items import ExtrapetiteItem
class ExtrapetiteCsSpider(CrawlSpider):
name = 'extrapetite_cs'
allowed_domains = ['www.extrapetite.com']
start_urls = ['http://www.extrapetite.com/']
rules = [
Rule(
LinkExtractor(allow=['/search/label/Lookbook']),
callback='parse_link',
follow=True
)
]
def parse_link(self, response):
for thing in response.xpath('//*[@id="Blog1"]/div[1]/div[position()>2]/div/div/div/div[3]/a/@href'):
request = scrapy.Request(thing.extract(), callback=self.parse_img)
yield request
def parse_img(self,response):
for thing in response.xpath('//*[@id="Blog1"]/div[1]/div/div/div[1]/div[1]/div[2]/a[position()>0]/img'):
item = ExtrapetiteItem()
item['image_urls'] = thing.xpath('@src').extract()[0]
item['url'] = response.url
item['desc'] = thing.xpath('@alt').extract()[0]
yield item
My settings:
# -*- coding: utf-8 -*-
# Scrapy settings for extrapetite project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
#
BOT_NAME = 'extrapetite'
SPIDER_MODULES = ['extrapetite.spiders']
NEWSPIDER_MODULE = 'extrapetite.spiders'
FEED_URI = 'logs/%(time)s.csv'
FEED_FORMAT = 'csv'
ITEM_PIPELINES = {
'scrapy.contrib.pipeline.images.ImagesPipeline': 1
}
IMAGES_STORE = '/Users/crescal/compsci/ggslh/extrapetite/images/'
IMAGES_EXPIRES = 90
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'extrapetite (+http://www.yourdomain.com)'
Without the ImagePipeline enabled, I am able to crawl the site and obtain the image_url, url, and desc.
如果没有启用ImagePipeline,我可以抓取网站并获取image_url,url和desc。
Also, when I try to view the link using scrapy view https://farm9.staticflickr.com/8802/16977938339_20f05dc232_o.jpg
此外,当我尝试使用scrapy视图查看链接时https://farm9.staticflickr.com/8802/16977938339_20f05dc232_o.jpg
I get this error
我收到这个错误
2015-05-08 13:31:37+0800 [default] DEBUG: Crawled (200) <GET https://farm9.staticflickr.com/8802/16977938339_20f05dc232_o.jpg> (referer: None)
2015-05-08 13:31:37+0800 [default] ERROR: Spider error processing <GET https://farm9.staticflickr.com/8802/16977938339_20f05dc232_o.jpg>
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/base.py", line 1201, in mainLoop
self.runUntilCurrent()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/base.py", line 824, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py", line 383, in callback
self._startRunCallbacks(result)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py", line 491, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py", line 578, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/commands/fetch.py", line 47, in <lambda>
cb = lambda x: self._print_response(x, opts)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/commands/view.py", line 20, in _print_response
open_in_browser(response)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/utils/response.py", line 86, in open_in_browser
response.__class__.__name__)
exceptions.TypeError: Unsupported response type: Response
The content type of the url is image/jpeg, why is that I am unable to view this?
网址的内容类型是image / jpeg,为什么我无法查看此内容?
1 个解决方案
#1
The response is binary data. scrapy
will only show the response if it's html or txt (because binary data is mostly not so useful in the terminal). See also the source code
响应是二进制数据。 scrapy只会显示响应,如果它是html或txt(因为二进制数据在终端中大多没那么有用)。另请参阅源代码
And your first error message
还有你的第一个错误信息
Missing scheme in request url: h
results because you pass a URL to the image_urls field. You need to pass a list of URLs -- even if it is only one URL. While extract()
already yields a list you can fix this by removing the [0]
at the end of the line.
结果是因为您将URL传递给image_urls字段。您需要传递一个URL列表 - 即使它只是一个URL。虽然extract()已经产生了一个列表,你可以通过删除行尾的[0]来解决这个问题。
#1
The response is binary data. scrapy
will only show the response if it's html or txt (because binary data is mostly not so useful in the terminal). See also the source code
响应是二进制数据。 scrapy只会显示响应,如果它是html或txt(因为二进制数据在终端中大多没那么有用)。另请参阅源代码
And your first error message
还有你的第一个错误信息
Missing scheme in request url: h
results because you pass a URL to the image_urls field. You need to pass a list of URLs -- even if it is only one URL. While extract()
already yields a list you can fix this by removing the [0]
at the end of the line.
结果是因为您将URL传递给image_urls字段。您需要传递一个URL列表 - 即使它只是一个URL。虽然extract()已经产生了一个列表,你可以通过删除行尾的[0]来解决这个问题。