如何使用BaseItemExporter中的fields_to_export属性来订购我的Scrapy CSV数据?

时间:2020-12-22 07:22:40

I have made a simple Scrapy spider that I use from the command line to export my data into the CSV format, but the order of the data seem random. How can I order the CSV fields in my output?

我已经制作了一个简单的Scrapy蜘蛛,我从命令行使用它来将我的数据导出为CSV格式,但数据的顺序似乎是随机的。如何在输出中订购CSV字段?

I use the following command line to get CSV data:

我使用以下命令行来获取CSV数据:

scrapy crawl somwehere -o items.csv -t csv

According to this Scrapy documentation, I should be able to use the fields_to_export attribute of the BaseItemExporter class to control the order. But I am clueless how to use this as I have not found any simple example to follow.

根据此Scrapy文档,我应该能够使用BaseItemExporter类的fields_to_export属性来控制顺序。但我无法如何使用它,因为我还没有找到任何简单的例子。

Please Note: This question is very similar to THIS one. However, that question is over 2 years old and doesn't address the many recent changes to Scrapy and neither provides a satisfactory answer, as it requires hacking one or both of:

请注意:这个问题与这个问题非常相似。然而,这个问题已经超过2年,并没有解决最近Scrapy的许多变化,也没有提供令人满意的答案,因为它需要黑客攻击其中一个或两个:

to address some previous issues, that seem to have already been resolved...

解决以前的一些问题,似乎已经解决了......

Many thanks in advance.

提前谢谢了。

2 个解决方案

#1


23  

To use such exporter you need to create your own Item pipeline that will process your spider output. Assuming that you have simple case and you want to have all spider output in one file this is pipeline you should use (pipelines.py):

要使用这样的导出器,您需要创建自己的Item管道来处理您的蜘蛛输出。假设你有简单的情况,并且你希望将所有的蜘蛛输出都放在一个文件中,这就是你应该使用的管道(pipelines.py):

from scrapy import signals
from scrapy.contrib.exporter import CsvItemExporter

class CSVPipeline(object):

  def __init__(self):
    self.files = {}

  @classmethod
  def from_crawler(cls, crawler):
    pipeline = cls()
    crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
    crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
    return pipeline

  def spider_opened(self, spider):
    file = open('%s_items.csv' % spider.name, 'w+b')
    self.files[spider] = file
    self.exporter = CsvItemExporter(file)
    self.exporter.fields_to_export = [list with Names of fields to export - order is important]
    self.exporter.start_exporting()

  def spider_closed(self, spider):
    self.exporter.finish_exporting()
    file = self.files.pop(spider)
    file.close()

  def process_item(self, item, spider):
    self.exporter.export_item(item)
    return item

Of course you need to remember to add this pipeline in your configuration file (settings.py):

当然,您需要记住在配置文件(settings.py)中添加此管道:

ITEM_PIPELINES = {'myproject.pipelines.CSVPipeline': 300 }

#2


7  

You can now specify settings in the spider itself. https://doc.scrapy.org/en/latest/topics/settings.html#settings-per-spider

您现在可以在蜘蛛本身中指定设置。 https://doc.scrapy.org/en/latest/topics/settings.html#settings-per-spider

To set the field order for exported feeds, set FEED_EXPORT_FIELDS. https://doc.scrapy.org/en/latest/topics/feed-exports.html#feed-export-fields

要设置导出的Feed的字段顺序,请设置FEED_EXPORT_FIELDS。 https://doc.scrapy.org/en/latest/topics/feed-exports.html#feed-export-fields

The spider below dumps all links on a website (written against Scrapy 1.4.0):

下面的蜘蛛转储网站上的所有链接(针对Scrapy 1.4.0编写):

import scrapy
from scrapy.http import HtmlResponse

class DumplinksSpider(scrapy.Spider):
  name = 'dumplinks'
  allowed_domains = ['www.example.com']
  start_urls = ['http://www.example.com/']
  custom_settings = {
    # specifies exported fields and order
    'FEED_EXPORT_FIELDS': ["page", "page_ix", "text", "url"],
  }

  def parse(self, response):
    if not isinstance(response, HtmlResponse):
      return

    a_selectors = response.xpath('//a')
    for i, a_selector in enumerate(a_selectors):
      text = a_selector.xpath('normalize-space(text())').extract_first()
      url = a_selector.xpath('@href').extract_first()
      yield {
        'page_ix': i + 1,
        'page': response.url,
        'text': text,
        'url': url,
      }
      yield response.follow(url, callback=self.parse)  # see allowed_domains

Run with this command:

使用此命令运行:

scrapy crawl dumplinks --loglevel=INFO -o links.csv

Fields in links.csv are ordered as specified by FEED_EXPORT_FIELDS.

links.csv中的字段按FEED_EXPORT_FIELDS的指定排序。

#1


23  

To use such exporter you need to create your own Item pipeline that will process your spider output. Assuming that you have simple case and you want to have all spider output in one file this is pipeline you should use (pipelines.py):

要使用这样的导出器,您需要创建自己的Item管道来处理您的蜘蛛输出。假设你有简单的情况,并且你希望将所有的蜘蛛输出都放在一个文件中,这就是你应该使用的管道(pipelines.py):

from scrapy import signals
from scrapy.contrib.exporter import CsvItemExporter

class CSVPipeline(object):

  def __init__(self):
    self.files = {}

  @classmethod
  def from_crawler(cls, crawler):
    pipeline = cls()
    crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
    crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
    return pipeline

  def spider_opened(self, spider):
    file = open('%s_items.csv' % spider.name, 'w+b')
    self.files[spider] = file
    self.exporter = CsvItemExporter(file)
    self.exporter.fields_to_export = [list with Names of fields to export - order is important]
    self.exporter.start_exporting()

  def spider_closed(self, spider):
    self.exporter.finish_exporting()
    file = self.files.pop(spider)
    file.close()

  def process_item(self, item, spider):
    self.exporter.export_item(item)
    return item

Of course you need to remember to add this pipeline in your configuration file (settings.py):

当然,您需要记住在配置文件(settings.py)中添加此管道:

ITEM_PIPELINES = {'myproject.pipelines.CSVPipeline': 300 }

#2


7  

You can now specify settings in the spider itself. https://doc.scrapy.org/en/latest/topics/settings.html#settings-per-spider

您现在可以在蜘蛛本身中指定设置。 https://doc.scrapy.org/en/latest/topics/settings.html#settings-per-spider

To set the field order for exported feeds, set FEED_EXPORT_FIELDS. https://doc.scrapy.org/en/latest/topics/feed-exports.html#feed-export-fields

要设置导出的Feed的字段顺序,请设置FEED_EXPORT_FIELDS。 https://doc.scrapy.org/en/latest/topics/feed-exports.html#feed-export-fields

The spider below dumps all links on a website (written against Scrapy 1.4.0):

下面的蜘蛛转储网站上的所有链接(针对Scrapy 1.4.0编写):

import scrapy
from scrapy.http import HtmlResponse

class DumplinksSpider(scrapy.Spider):
  name = 'dumplinks'
  allowed_domains = ['www.example.com']
  start_urls = ['http://www.example.com/']
  custom_settings = {
    # specifies exported fields and order
    'FEED_EXPORT_FIELDS': ["page", "page_ix", "text", "url"],
  }

  def parse(self, response):
    if not isinstance(response, HtmlResponse):
      return

    a_selectors = response.xpath('//a')
    for i, a_selector in enumerate(a_selectors):
      text = a_selector.xpath('normalize-space(text())').extract_first()
      url = a_selector.xpath('@href').extract_first()
      yield {
        'page_ix': i + 1,
        'page': response.url,
        'text': text,
        'url': url,
      }
      yield response.follow(url, callback=self.parse)  # see allowed_domains

Run with this command:

使用此命令运行:

scrapy crawl dumplinks --loglevel=INFO -o links.csv

Fields in links.csv are ordered as specified by FEED_EXPORT_FIELDS.

links.csv中的字段按FEED_EXPORT_FIELDS的指定排序。