如何在Python脚本中运行Scrapy

I'm new to Scrapy and I'm looking for a way to run it from a Python script. I found 2 sources that explain this:

我是Scrapy的新手，我正在寻找一种从Python脚本运行它的方法。我找到了两个解释这个的来源：

http://tryolabs.com/Blog/2011/09/27/calling-scrapy-python-script/

http://snipplr.com/view/67006/using-scrapy-from-a-script/

I can't figure out where I should put my spider code and how to call it from the main function. Please help. This is the example code:

我无法弄清楚我应该把蜘蛛代码放在哪里以及如何从主函数中调用它。请帮忙。这是示例代码：

# This snippet can be used to run scrapy spiders independent of scrapyd or the scrapy command line tool and use it from a script. 
# 
# The multiprocessing library is used in order to work around a bug in Twisted, in which you cannot restart an already running reactor or in this case a scrapy instance.
# 
# [Here](http://groups.google.com/group/scrapy-users/browse_thread/thread/f332fc5b749d401a) is the mailing-list discussion for this snippet. 

#!/usr/bin/python
import os
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'project.settings') #Must be at the top before other imports

from scrapy import log, signals, project
from scrapy.xlib.pydispatch import dispatcher
from scrapy.conf import settings
from scrapy.crawler import CrawlerProcess
from multiprocessing import Process, Queue

class CrawlerScript():

    def __init__(self):
        self.crawler = CrawlerProcess(settings)
        if not hasattr(project, 'crawler'):
            self.crawler.install()
        self.crawler.configure()
        self.items = []
        dispatcher.connect(self._item_passed, signals.item_passed)

    def _item_passed(self, item):
        self.items.append(item)

    def _crawl(self, queue, spider_name):
        spider = self.crawler.spiders.create(spider_name)
        if spider:
            self.crawler.queue.append_spider(spider)
        self.crawler.start()
        self.crawler.stop()
        queue.put(self.items)

    def crawl(self, spider):
        queue = Queue()
        p = Process(target=self._crawl, args=(queue, spider,))
        p.start()
        p.join()
        return queue.get(True)

# Usage
if __name__ == "__main__":
    log.start()

    """
    This example runs spider1 and then spider2 three times. 
    """
    items = list()
    crawler = CrawlerScript()
    items.append(crawler.crawl('spider1'))
    for i in range(3):
        items.append(crawler.crawl('spider2'))
    print items

# Snippet imported from snippets.scrapy.org (which no longer works)
# author: joehillen
# date  : Oct 24, 2010

Thank you.

谢谢。

6 个解决方案

#1

All other answers reference Scrapy v0.x. According to the updated docs, Scrapy 1.0 demands:

所有其他答案均参考Scrapy v0.x.根据更新的文档，Scrapy 1.0要求：

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    # Your spider definition
    ...

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished

#2

Though I haven't tried it I think the answer can be found within the scrapy documentation. To quote directly from it:

虽然我没有尝试过，但我认为答案可以在scrapy文档中找到。直接引用它：

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log
from testspiders.spiders.followall import FollowAllSpider

spider = FollowAllSpider(domain='scrapinghub.com')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here

From what I gather this is a new development in the library which renders some of the earlier approaches online (such as that in the question) obsolete.

从我收集的内容来看，这是图书馆的一个新发展，它使一些早期的在线方法（例如问题中的方法）过时了。

#3

In scrapy 0.19.x you should do this:

在scrapy 0.19.x中你应该这样做：

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from testspiders.spiders.followall import FollowAllSpider
from scrapy.utils.project import get_project_settings

spider = FollowAllSpider(domain='scrapinghub.com')
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here until the spider_closed signal was sent

Note these lines

注意这些行

settings = get_project_settings()
crawler = Crawler(settings)

Without it your spider won't use your settings and will not save the items. Took me a while to figure out why the example in documentation wasn't saving my items. I sent a pull request to fix the doc example.

没有它，您的蜘蛛将不会使用您的设置，也不会保存项目。花了一些时间来弄清楚为什么文档中的示例没有保存我的项目。我发送了一个pull请求来修复doc示例。

One more to do so is just call command directly from you script

还有一个是直接从你的脚本调用命令

from scrapy import cmdline
cmdline.execute("scrapy crawl followall".split())  #followall is the spider's name

Copied this answer from my first answer in here: https://*.com/a/19060485/1402286

从我在这里的第一个答案复制了这个答案：https：//*.com/a/19060485/1402286

#4

When there are multiple crawlers need to be run inside one python script, the reactor stop needs to be handled with caution as the reactor can only be stopped once and cannot be restarted.

当需要在一个python脚本中运行多个爬虫时，需要谨慎处理反应器停止，因为反应器只能停止一次并且无法重新启动。

However, I found while doing my project that using

但是，我在做我的项目时发现了使用

os.system("scrapy crawl yourspider")

is the easiest. This will save me from handling all sorts of signals especially when I have multiple spiders.

是最简单的。这将使我免于处理各种信号，特别是当我有多个蜘蛛时。

If Performance is a concern, you can use multiprocessing to run your spiders in parallel, something like:

如果Performance是一个问题，您可以使用多处理来并行运行您的蜘蛛，例如：

def _crawl(spider_name=None):
    if spider_name:
        os.system('scrapy crawl %s' % spider_name)
    return None

def run_crawler():

    spider_names = ['spider1', 'spider2', 'spider2']

    pool = Pool(processes=len(spider_names))
    pool.map(_crawl, spider_names)

#5

-2

# -*- coding: utf-8 -*-
import sys
from scrapy.cmdline import execute


def gen_argv(s):
    sys.argv = s.split()


if __name__ == '__main__':
    gen_argv('scrapy crawl abc_spider')
    execute()

Put this code to the path you can run scrapy crawl abc_spider from command line. (Tested with Scrapy==0.24.6)

将此代码放在您可以从命令行运行scrapy crawl abc_spider的路径中。（用Scrapy测试== 0.24.6）

#6

-2

If you want to run a simple crawling, It's easy by just running command:

如果你想运行一个简单的爬行，只需运行命令即可：

scrapy crawl . There is another options to export your results to store in some formats like: Json, xml, csv.

scrapy爬行。还有另一种选择可以将结果导出为某些格式的存储，例如：Json，xml，csv。

scrapy crawl -o result.csv or result.json or result.xml.

scrapy crawl -o result.csv或result.json或result.xml。

you may want to try it

你可能想尝试一下

#1