关键字:未处理的延迟错误

时间:2021-11-06 20:55:30

I'm writing a small crawler with Scrapy. I want to be able to pass the start_url argument to my spider which later will enable me to run it via Celery (or something elese).

我在写一个小爬虫的痒病。我希望能够将start_url参数传递给我的spider,这将使我能够通过芹菜(或其他类似的东西)运行它。

I hit a wall with passing arguments. And I'm getting an error:

我的论点站不住脚。我得到了一个错误

2016-03-13 08:50:50 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
Unhandled error in Deferred:
2016-03-13 08:50:50 [twisted] CRITICAL: Unhandled error in Deferred:


Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/scrapy/cmdline.py", line 150, in _run_command
    cmd.run(args, opts)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/commands/crawl.py", line 57, in run
    self.crawler_process.crawl(spname, **opts.spargs)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 153, in crawl
    d = crawler.crawl(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1274, in unwindGenerator
    return _inlineCallbacks(None, gen, Deferred())
--- <exception caught here> ---
  File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1128, in _inlineCallbacks
    result = g.send(result)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 70, in crawl
    self.spider = self._create_spider(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 80, in _create_spider
    return self.spidercls.from_crawler(self, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/crawl.py", line 91, in from_crawler
    spider = super(CrawlSpider, cls).from_crawler(crawler, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 50, in from_crawler
    spider = cls(*args, **kwargs)
exceptions.TypeError: __init__() takes at least 3 arguments (1 given)
2016-03-13 08:50:50 [twisted] CRITICAL:

The spider code is as below:

蜘蛛代码如下:

Class OnetSpider(CrawlSpider):
    name = 'OnetSpider'
    def __init__(self, ur, *args, **kwargs):
        super(OnetSpider, self).__init__(*args, **kwargs)
        self.start_urls = [kwargs.get('start_url')]

    #allowed_domains = ['katalog.onet.pl']
    #start_urls = ['http://katalog.onet.pl/']
    response_url = ""

    rules = [Rule(LinkExtractor(unique = True),
        callback="parse_items",
        follow = True)]

    def parse_start_url(self, response):
        self.response_url = response.url
        return self.parse_items(response)

    def parse_items (self, response):
        baseDomain = self.get_base_domain(self.response_url)
        for sel in response.xpath('//a'):
            l = sel.xpath('@href').extract()[0]
            t = sel.xpath('text()').extract()
            if (self.is_relative(l)) or (baseDomain.upper()
                in l.upper()):
                continue
            else:
                itm = OnetItem()
                itm['anchorTitle'] = t
                itm['link'] = self.process_url(l)
                itm['timeStamp'] = datetime.datetime.now()
                itm['isChecked'] = 0
                itm['responseCode'] = 0
                itm['redirecrURL'] = ''
                yield itm

    def is_relative(self,url):
        #checks if url is relative path or absolute
        if urlparse(url).netloc =="":
            return True
        else:
            return False


    def get_base_domain(self, url):
        #returns base url stripped from www/ftp and any ports
        base =  urlparse(url).netloc
        if base.upper().startswith("WWW."):
            base = base[4:]
        if base.upper().startswith("FTP."):
            base = base[4:]
        base = base.split(':')[0]
        return base

    def process_url(self,url):
        u = urlparse(url)
        if u.scheme == '' :
            u.scheme = 'http'
        finalURL = u.scheme + '://' + u.netloc +'/'
        return finalURL.lower()

I'm pretty sure it has something to do with passing arguments as without the def __init__ spider runs well.

我很确定它与传递参数有关,因为没有def __init__ spider运行得很好。

Any idea what's the issue?

有什么问题吗?

I'm running this on my VPS Ubuntu server.

我在我的VPS Ubuntu服务器上运行这个。

1 个解决方案

#1


1  

So I managed to get the Crawler working. No idea what fixed it - just took the orginal spider and copied the def __init__ part from Scrapy source file.

所以我设法让爬虫工作。不知道是什么修复了它——只是把原始蜘蛛从剪贴源文件中复制了def __init__部分。

Below is working version. Just for historical reference. I tested one of the scrapinghub's examples and it was working - so that got me thinking that my spider may have some small error and in the end I may just re write it.

下面是工作版本。只是由于历史参考。我测试了其中的一个例子,它是有效的,这让我想到我的蜘蛛可能有一个小错误,最后我可能会重新写出来。

Anyways - working sample:

不管怎样,工作样本:

from onet.items import OnetItem

import scrapy
import re

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
#from scrapy import log
import logging

from urlparse import urlparse
import datetime

logger = logging.getLogger('-------MICHAL------')


class WebSpider(CrawlSpider):
    name = 'WebSpider'

    def __init__(self, *a, **kw):
        super(WebSpider, self).__init__(*a, **kw)
        self._compile_rules()
        url = kw.get('url') or kw.get('domain')
        if not url.startswith('http://') and not url.startswith('https://'):
            url = 'http://%s/' % url

        self.url = url
        self.start_urls = [self.url]
        self.allowed_domains = [re.sub(r'^www\.', '', urlparse(url).hostname)]
    response_url = ""

    rules = [Rule(LinkExtractor(unique = True),
        callback="parse_items",
        follow = True)]

    def parse_start_url(self, response):
        self.response_url = response.url
        return self.parse_items(response)

    def parse_items (self, response):
        baseDomain = self.get_base_domain(self.response_url)
        for sel in response.xpath('//a'):
            l = sel.xpath('@href').extract()[0]
            t = sel.xpath('text()').extract()
            if (self.is_relative(l)) or (baseDomain.upper()
                in l.upper()):
                continue
            else:
                itm = OnetItem()
                itm['anchorTitle'] = t
                itm['link'] = self.process_url(l)
                itm['timeStamp'] = datetime.datetime.now()
                itm['isChecked'] = 0
                itm['responseCode'] = 0
                itm['redirecrURL'] = ''
                yield itm

    def is_relative(self,url):
        #checks if url is relative path or absolute
        if urlparse(url).netloc =="":
            return True
        else:
            return False

#1


1  

So I managed to get the Crawler working. No idea what fixed it - just took the orginal spider and copied the def __init__ part from Scrapy source file.

所以我设法让爬虫工作。不知道是什么修复了它——只是把原始蜘蛛从剪贴源文件中复制了def __init__部分。

Below is working version. Just for historical reference. I tested one of the scrapinghub's examples and it was working - so that got me thinking that my spider may have some small error and in the end I may just re write it.

下面是工作版本。只是由于历史参考。我测试了其中的一个例子,它是有效的,这让我想到我的蜘蛛可能有一个小错误,最后我可能会重新写出来。

Anyways - working sample:

不管怎样,工作样本:

from onet.items import OnetItem

import scrapy
import re

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
#from scrapy import log
import logging

from urlparse import urlparse
import datetime

logger = logging.getLogger('-------MICHAL------')


class WebSpider(CrawlSpider):
    name = 'WebSpider'

    def __init__(self, *a, **kw):
        super(WebSpider, self).__init__(*a, **kw)
        self._compile_rules()
        url = kw.get('url') or kw.get('domain')
        if not url.startswith('http://') and not url.startswith('https://'):
            url = 'http://%s/' % url

        self.url = url
        self.start_urls = [self.url]
        self.allowed_domains = [re.sub(r'^www\.', '', urlparse(url).hostname)]
    response_url = ""

    rules = [Rule(LinkExtractor(unique = True),
        callback="parse_items",
        follow = True)]

    def parse_start_url(self, response):
        self.response_url = response.url
        return self.parse_items(response)

    def parse_items (self, response):
        baseDomain = self.get_base_domain(self.response_url)
        for sel in response.xpath('//a'):
            l = sel.xpath('@href').extract()[0]
            t = sel.xpath('text()').extract()
            if (self.is_relative(l)) or (baseDomain.upper()
                in l.upper()):
                continue
            else:
                itm = OnetItem()
                itm['anchorTitle'] = t
                itm['link'] = self.process_url(l)
                itm['timeStamp'] = datetime.datetime.now()
                itm['isChecked'] = 0
                itm['responseCode'] = 0
                itm['redirecrURL'] = ''
                yield itm

    def is_relative(self,url):
        #checks if url is relative path or absolute
        if urlparse(url).netloc =="":
            return True
        else:
            return False