I'm writing a small crawler with Scrapy. I want to be able to pass the start_url
argument to my spider which later will enable me to run it via Celery (or something elese).
我在写一个小爬虫的痒病。我希望能够将start_url参数传递给我的spider,这将使我能够通过芹菜(或其他类似的东西)运行它。
I hit a wall with passing arguments. And I'm getting an error:
我的论点站不住脚。我得到了一个错误
2016-03-13 08:50:50 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
Unhandled error in Deferred:
2016-03-13 08:50:50 [twisted] CRITICAL: Unhandled error in Deferred:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/scrapy/cmdline.py", line 150, in _run_command
cmd.run(args, opts)
File "/usr/local/lib/python2.7/dist-packages/scrapy/commands/crawl.py", line 57, in run
self.crawler_process.crawl(spname, **opts.spargs)
File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 153, in crawl
d = crawler.crawl(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1274, in unwindGenerator
return _inlineCallbacks(None, gen, Deferred())
--- <exception caught here> ---
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1128, in _inlineCallbacks
result = g.send(result)
File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 70, in crawl
self.spider = self._create_spider(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 80, in _create_spider
return self.spidercls.from_crawler(self, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/crawl.py", line 91, in from_crawler
spider = super(CrawlSpider, cls).from_crawler(crawler, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 50, in from_crawler
spider = cls(*args, **kwargs)
exceptions.TypeError: __init__() takes at least 3 arguments (1 given)
2016-03-13 08:50:50 [twisted] CRITICAL:
The spider code is as below:
蜘蛛代码如下:
Class OnetSpider(CrawlSpider):
name = 'OnetSpider'
def __init__(self, ur, *args, **kwargs):
super(OnetSpider, self).__init__(*args, **kwargs)
self.start_urls = [kwargs.get('start_url')]
#allowed_domains = ['katalog.onet.pl']
#start_urls = ['http://katalog.onet.pl/']
response_url = ""
rules = [Rule(LinkExtractor(unique = True),
callback="parse_items",
follow = True)]
def parse_start_url(self, response):
self.response_url = response.url
return self.parse_items(response)
def parse_items (self, response):
baseDomain = self.get_base_domain(self.response_url)
for sel in response.xpath('//a'):
l = sel.xpath('@href').extract()[0]
t = sel.xpath('text()').extract()
if (self.is_relative(l)) or (baseDomain.upper()
in l.upper()):
continue
else:
itm = OnetItem()
itm['anchorTitle'] = t
itm['link'] = self.process_url(l)
itm['timeStamp'] = datetime.datetime.now()
itm['isChecked'] = 0
itm['responseCode'] = 0
itm['redirecrURL'] = ''
yield itm
def is_relative(self,url):
#checks if url is relative path or absolute
if urlparse(url).netloc =="":
return True
else:
return False
def get_base_domain(self, url):
#returns base url stripped from www/ftp and any ports
base = urlparse(url).netloc
if base.upper().startswith("WWW."):
base = base[4:]
if base.upper().startswith("FTP."):
base = base[4:]
base = base.split(':')[0]
return base
def process_url(self,url):
u = urlparse(url)
if u.scheme == '' :
u.scheme = 'http'
finalURL = u.scheme + '://' + u.netloc +'/'
return finalURL.lower()
I'm pretty sure it has something to do with passing arguments as without the def __init__
spider runs well.
我很确定它与传递参数有关,因为没有def __init__ spider运行得很好。
Any idea what's the issue?
有什么问题吗?
I'm running this on my VPS Ubuntu server.
我在我的VPS Ubuntu服务器上运行这个。
1 个解决方案
#1
1
So I managed to get the Crawler working. No idea what fixed it - just took the orginal spider and copied the def __init__
part from Scrapy source file.
所以我设法让爬虫工作。不知道是什么修复了它——只是把原始蜘蛛从剪贴源文件中复制了def __init__部分。
Below is working version. Just for historical reference. I tested one of the scrapinghub's examples and it was working - so that got me thinking that my spider may have some small error and in the end I may just re write it.
下面是工作版本。只是由于历史参考。我测试了其中的一个例子,它是有效的,这让我想到我的蜘蛛可能有一个小错误,最后我可能会重新写出来。
Anyways - working sample:
不管怎样,工作样本:
from onet.items import OnetItem
import scrapy
import re
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
#from scrapy import log
import logging
from urlparse import urlparse
import datetime
logger = logging.getLogger('-------MICHAL------')
class WebSpider(CrawlSpider):
name = 'WebSpider'
def __init__(self, *a, **kw):
super(WebSpider, self).__init__(*a, **kw)
self._compile_rules()
url = kw.get('url') or kw.get('domain')
if not url.startswith('http://') and not url.startswith('https://'):
url = 'http://%s/' % url
self.url = url
self.start_urls = [self.url]
self.allowed_domains = [re.sub(r'^www\.', '', urlparse(url).hostname)]
response_url = ""
rules = [Rule(LinkExtractor(unique = True),
callback="parse_items",
follow = True)]
def parse_start_url(self, response):
self.response_url = response.url
return self.parse_items(response)
def parse_items (self, response):
baseDomain = self.get_base_domain(self.response_url)
for sel in response.xpath('//a'):
l = sel.xpath('@href').extract()[0]
t = sel.xpath('text()').extract()
if (self.is_relative(l)) or (baseDomain.upper()
in l.upper()):
continue
else:
itm = OnetItem()
itm['anchorTitle'] = t
itm['link'] = self.process_url(l)
itm['timeStamp'] = datetime.datetime.now()
itm['isChecked'] = 0
itm['responseCode'] = 0
itm['redirecrURL'] = ''
yield itm
def is_relative(self,url):
#checks if url is relative path or absolute
if urlparse(url).netloc =="":
return True
else:
return False
#1
1
So I managed to get the Crawler working. No idea what fixed it - just took the orginal spider and copied the def __init__
part from Scrapy source file.
所以我设法让爬虫工作。不知道是什么修复了它——只是把原始蜘蛛从剪贴源文件中复制了def __init__部分。
Below is working version. Just for historical reference. I tested one of the scrapinghub's examples and it was working - so that got me thinking that my spider may have some small error and in the end I may just re write it.
下面是工作版本。只是由于历史参考。我测试了其中的一个例子,它是有效的,这让我想到我的蜘蛛可能有一个小错误,最后我可能会重新写出来。
Anyways - working sample:
不管怎样,工作样本:
from onet.items import OnetItem
import scrapy
import re
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
#from scrapy import log
import logging
from urlparse import urlparse
import datetime
logger = logging.getLogger('-------MICHAL------')
class WebSpider(CrawlSpider):
name = 'WebSpider'
def __init__(self, *a, **kw):
super(WebSpider, self).__init__(*a, **kw)
self._compile_rules()
url = kw.get('url') or kw.get('domain')
if not url.startswith('http://') and not url.startswith('https://'):
url = 'http://%s/' % url
self.url = url
self.start_urls = [self.url]
self.allowed_domains = [re.sub(r'^www\.', '', urlparse(url).hostname)]
response_url = ""
rules = [Rule(LinkExtractor(unique = True),
callback="parse_items",
follow = True)]
def parse_start_url(self, response):
self.response_url = response.url
return self.parse_items(response)
def parse_items (self, response):
baseDomain = self.get_base_domain(self.response_url)
for sel in response.xpath('//a'):
l = sel.xpath('@href').extract()[0]
t = sel.xpath('text()').extract()
if (self.is_relative(l)) or (baseDomain.upper()
in l.upper()):
continue
else:
itm = OnetItem()
itm['anchorTitle'] = t
itm['link'] = self.process_url(l)
itm['timeStamp'] = datetime.datetime.now()
itm['isChecked'] = 0
itm['responseCode'] = 0
itm['redirecrURL'] = ''
yield itm
def is_relative(self,url):
#checks if url is relative path or absolute
if urlparse(url).netloc =="":
return True
else:
return False