I have made a simple Scrapy spider that I use from the command line to export my data into the CSV format, but the order of the data seem random. How can I order the CSV fields in my output?
我已经制作了一个简单的Scrapy蜘蛛,我从命令行使用它来将我的数据导出为CSV格式,但数据的顺序似乎是随机的。如何在输出中订购CSV字段?
I use the following command line to get CSV data:
我使用以下命令行来获取CSV数据:
scrapy crawl somwehere -o items.csv -t csv
According to this Scrapy documentation, I should be able to use the fields_to_export
attribute of the BaseItemExporter
class to control the order. But I am clueless how to use this as I have not found any simple example to follow.
根据此Scrapy文档,我应该能够使用BaseItemExporter类的fields_to_export属性来控制顺序。但我无法如何使用它,因为我还没有找到任何简单的例子。
Please Note: This question is very similar to THIS one. However, that question is over 2 years old and doesn't address the many recent changes to Scrapy and neither provides a satisfactory answer, as it requires hacking one or both of:
请注意:这个问题与这个问题非常相似。然而,这个问题已经超过2年,并没有解决最近Scrapy的许多变化,也没有提供令人满意的答案,因为它需要黑客攻击其中一个或两个:
- contrib/exporter/init.py
- 的contrib /出口/ init.py
- contrib/feedexport.py
- 的contrib / feedexport.py
to address some previous issues, that seem to have already been resolved...
解决以前的一些问题,似乎已经解决了......
Many thanks in advance.
提前谢谢了。
2 个解决方案
#1
23
To use such exporter you need to create your own Item pipeline that will process your spider output. Assuming that you have simple case and you want to have all spider output in one file this is pipeline you should use (pipelines.py
):
要使用这样的导出器,您需要创建自己的Item管道来处理您的蜘蛛输出。假设你有简单的情况,并且你希望将所有的蜘蛛输出都放在一个文件中,这就是你应该使用的管道(pipelines.py):
from scrapy import signals
from scrapy.contrib.exporter import CsvItemExporter
class CSVPipeline(object):
def __init__(self):
self.files = {}
@classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
file = open('%s_items.csv' % spider.name, 'w+b')
self.files[spider] = file
self.exporter = CsvItemExporter(file)
self.exporter.fields_to_export = [list with Names of fields to export - order is important]
self.exporter.start_exporting()
def spider_closed(self, spider):
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
Of course you need to remember to add this pipeline in your configuration file (settings.py
):
当然,您需要记住在配置文件(settings.py)中添加此管道:
ITEM_PIPELINES = {'myproject.pipelines.CSVPipeline': 300 }
#2
7
You can now specify settings in the spider itself. https://doc.scrapy.org/en/latest/topics/settings.html#settings-per-spider
您现在可以在蜘蛛本身中指定设置。 https://doc.scrapy.org/en/latest/topics/settings.html#settings-per-spider
To set the field order for exported feeds, set FEED_EXPORT_FIELDS
. https://doc.scrapy.org/en/latest/topics/feed-exports.html#feed-export-fields
要设置导出的Feed的字段顺序,请设置FEED_EXPORT_FIELDS。 https://doc.scrapy.org/en/latest/topics/feed-exports.html#feed-export-fields
The spider below dumps all links on a website (written against Scrapy 1.4.0):
下面的蜘蛛转储网站上的所有链接(针对Scrapy 1.4.0编写):
import scrapy
from scrapy.http import HtmlResponse
class DumplinksSpider(scrapy.Spider):
name = 'dumplinks'
allowed_domains = ['www.example.com']
start_urls = ['http://www.example.com/']
custom_settings = {
# specifies exported fields and order
'FEED_EXPORT_FIELDS': ["page", "page_ix", "text", "url"],
}
def parse(self, response):
if not isinstance(response, HtmlResponse):
return
a_selectors = response.xpath('//a')
for i, a_selector in enumerate(a_selectors):
text = a_selector.xpath('normalize-space(text())').extract_first()
url = a_selector.xpath('@href').extract_first()
yield {
'page_ix': i + 1,
'page': response.url,
'text': text,
'url': url,
}
yield response.follow(url, callback=self.parse) # see allowed_domains
Run with this command:
使用此命令运行:
scrapy crawl dumplinks --loglevel=INFO -o links.csv
Fields in links.csv
are ordered as specified by FEED_EXPORT_FIELDS
.
links.csv中的字段按FEED_EXPORT_FIELDS的指定排序。
#1
23
To use such exporter you need to create your own Item pipeline that will process your spider output. Assuming that you have simple case and you want to have all spider output in one file this is pipeline you should use (pipelines.py
):
要使用这样的导出器,您需要创建自己的Item管道来处理您的蜘蛛输出。假设你有简单的情况,并且你希望将所有的蜘蛛输出都放在一个文件中,这就是你应该使用的管道(pipelines.py):
from scrapy import signals
from scrapy.contrib.exporter import CsvItemExporter
class CSVPipeline(object):
def __init__(self):
self.files = {}
@classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
file = open('%s_items.csv' % spider.name, 'w+b')
self.files[spider] = file
self.exporter = CsvItemExporter(file)
self.exporter.fields_to_export = [list with Names of fields to export - order is important]
self.exporter.start_exporting()
def spider_closed(self, spider):
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
Of course you need to remember to add this pipeline in your configuration file (settings.py
):
当然,您需要记住在配置文件(settings.py)中添加此管道:
ITEM_PIPELINES = {'myproject.pipelines.CSVPipeline': 300 }
#2
7
You can now specify settings in the spider itself. https://doc.scrapy.org/en/latest/topics/settings.html#settings-per-spider
您现在可以在蜘蛛本身中指定设置。 https://doc.scrapy.org/en/latest/topics/settings.html#settings-per-spider
To set the field order for exported feeds, set FEED_EXPORT_FIELDS
. https://doc.scrapy.org/en/latest/topics/feed-exports.html#feed-export-fields
要设置导出的Feed的字段顺序,请设置FEED_EXPORT_FIELDS。 https://doc.scrapy.org/en/latest/topics/feed-exports.html#feed-export-fields
The spider below dumps all links on a website (written against Scrapy 1.4.0):
下面的蜘蛛转储网站上的所有链接(针对Scrapy 1.4.0编写):
import scrapy
from scrapy.http import HtmlResponse
class DumplinksSpider(scrapy.Spider):
name = 'dumplinks'
allowed_domains = ['www.example.com']
start_urls = ['http://www.example.com/']
custom_settings = {
# specifies exported fields and order
'FEED_EXPORT_FIELDS': ["page", "page_ix", "text", "url"],
}
def parse(self, response):
if not isinstance(response, HtmlResponse):
return
a_selectors = response.xpath('//a')
for i, a_selector in enumerate(a_selectors):
text = a_selector.xpath('normalize-space(text())').extract_first()
url = a_selector.xpath('@href').extract_first()
yield {
'page_ix': i + 1,
'page': response.url,
'text': text,
'url': url,
}
yield response.follow(url, callback=self.parse) # see allowed_domains
Run with this command:
使用此命令运行:
scrapy crawl dumplinks --loglevel=INFO -o links.csv
Fields in links.csv
are ordered as specified by FEED_EXPORT_FIELDS
.
links.csv中的字段按FEED_EXPORT_FIELDS的指定排序。