I'd like to continuously fetch urls to crawl from a database. So far I succeeded in fetching urls from the base but I'd like my spider to keep reading from that base since the table will be populated by another thread.
我想继续从数据库中获取要抓取的网址。到目前为止,我成功地从基地获取网址,但我希望我的蜘蛛继续从该基地读取,因为该表将由另一个线程填充。
I have a pipeline that removes url from the table once it is crawled (working). In other words, I'd like to use my database as a queue. I tried different approaches with no luck.
我有一个管道,一旦它被爬行(工作)就从表中删除url。换句话说,我想将我的数据库用作队列。我尝试了不同的方法,没有运气。
Here's my spider.py
这是我的spider.py
class MySpider(scrapy.Spider):
MAX_RETRY = 10
logger = logging.getLogger(__name__)
name = 'myspider'
start_urls = [
]
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(MySpider, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.spider_closed, signals.spider_closed)
return spider
def __init__(self):
db = MySQLdb.connect(
user='myuser',
passwd='mypassword',
db='mydatabase',
host='myhost',
charset='utf8',
use_unicode=True
)
self.db = db
self.logger.info('Connection to database opened')
super(MySpider, self)
def spider_closed(self, spider):
self.db.close()
self.logger.info('Connection to database closed')
def start_requests(self):
cursor = self.db.cursor()
cursor.execute('SELECT * FROM mytable WHERE nbErrors < %s', (self.MAX_RETRY,))
rows = cursor.fetchall()
for row in rows:
yield Request(row[0], self.parse, meta={
'splash': {
'args':{
'html': 1,
'wait': 2
}
}
}, errback=self.errback_httpbin)
cursor.close()
Thank you very much
非常感谢你
EDIT
Here's my new code.
这是我的新代码。
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(MySpider, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.spider_closed, signals.spider_closed)
crawler.signals.connect(spider.spider_idle, signals.spider_idle)
return spider
def spider_idle(self, spider):
self.logger.info('IDLE')
time.sleep(5)
for url in self.getUrlsToCrawl():
self.logger.info(url[1])
self.crawler.engine.crawl(Request(url[1], self.parse, meta={
'splash': {
'args':{
'html': 1,
'wait': 5
}
},
'dbId': url[0]
}, errback=self.errback_httpbin), self)
raise DontCloseSpider
def getUrlsToCrawl(self):
dateNowUtc = datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%S")
cursor = self.db.cursor()
cursor.execute('SELECT id, url FROM mytable WHERE nbErrors < %s AND domain = %s and nextCrawl < %s', (self.MAX_RETRY, self.domain, dateNowUtc))
urls = cursor.fetchall()
cursor.close()
return urls
In my logs I can see :
INFO: IDLE
INFO: someurl
INFO: IDLE
INFO: someurl
在我的日志中,我可以看到:INFO:IDLE INFO:someurl INFO:IDLE INFO:someurl
But when I update the data in my table to fetch more or less urls, the output never changes. It seems that the data collected is not fresh and I never crawl the requests made in the spider_idle method
但是,当我更新表格中的数据以获取更多或更少的网址时,输出永远不会改变。似乎所收集的数据并不新鲜,我从不抓取在spider_idle方法中发出的请求
1 个解决方案
#1
3
I would personally recommend to start a new spider every time you have to crawl something but if you want to keep the process alive I would recommend using the spider_idle
signal:
我个人建议每次你必须爬行时启动一个新的蜘蛛,但是如果你想让这个过程保持活着,我建议你使用spider_idle信号:
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(MySpider, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.spider_closed, signals.spider_closed)
crawler.signals.connect(spider.spider_idle, signals.spider_idle)
return spider
...
def spider_idle(self, spider):
# read database again and send new requests
# check that sending new requests here is different
self.crawler.engine.crawl(
Request(
new_url,
callback=self.parse),
spider
)
Here you are sending new requests before the spider actually closes.
在蜘蛛实际关闭之前,您将在此处发送新请求。
#1
3
I would personally recommend to start a new spider every time you have to crawl something but if you want to keep the process alive I would recommend using the spider_idle
signal:
我个人建议每次你必须爬行时启动一个新的蜘蛛,但是如果你想让这个过程保持活着,我建议你使用spider_idle信号:
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(MySpider, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.spider_closed, signals.spider_closed)
crawler.signals.connect(spider.spider_idle, signals.spider_idle)
return spider
...
def spider_idle(self, spider):
# read database again and send new requests
# check that sending new requests here is different
self.crawler.engine.crawl(
Request(
new_url,
callback=self.parse),
spider
)
Here you are sending new requests before the spider actually closes.
在蜘蛛实际关闭之前,您将在此处发送新请求。