[python小记]scrapy-pipeline存储item到excel中-openpyxl

时间:2022-03-18 14:27:59

首先,话不多说,先上scrapy-item pipeline

之前数据都是存到数据库和json中,用的时候还得转成excel格式,挺麻烦,所以今天查了一下发现了openpyxl这个库,在此小记..

from openpyx import WorkBook
#创建工作簿,同时页建一个sheet
wb = WorkBook()
#调用得到的sheet,并命名为test1
ws = wb.active 
(注:active返回的是一个列表) 
@property
    def active(self):
        """Get the currently active sheet or None
        
        :type: :class:`openpyxl.worksheet.worksheet.Worksheet`
        """
        try:
            return self._sheets[self._active_sheet_index]
        except IndexError:
            pass 
ws.title = 'test1'
#插入数据
ws.append([...])
#保存工作簿,在当前目录下文件名为test1.xlsx
wb.save('test1.xlsx')

项目代码

class CdcspiderExcelPipeline(object):  '''  use Item Exporter  save the item to excel  '''   def __init__(self):  '''  initialize the object  '''  self.spider = None  self.count = 0   def log(self, l):  '''  reload the log  :return:  '''  msg = '========== CdcspiderExcelPipeline == %s' % l   if self.spider is not None:  # spider.logger -> return logging.LoggerAdapter(logger, {'spider': self})  self.spider.logger.info(msg)

    def open_spider(self, spider):  '''  create a queue  :return:  '''  self.wb = openpyxl.Workbook()
        self.ws = self.wb.active
        self.ws.append(['文章日期', '文章标题', 'url', '文章作者'])

    def process_item(self, item, spider):  '''  save every  :return:  '''  self.log('process %s, %s:' % (spider.name, self.count + 1))

        line = [item['article_time'],item['title'],item['url'],item['author']]
        self.ws.append(line)
        return item   def close_spider(self, spider):  '''  save lines to excel  :return:  '''  print 'ExcelPipline info: items size: %s' % self.count
        file_name = _generate_filename(spider, file_format='xlsx')
        self.wb.save(file_name)

结果如下
[python小记]scrapy-pipeline存储item到excel中-openpyxl
此外,scrapy提供了item exporter进行持久化或者导出,但笔者本人使用觉着不如第三方库方便,当然可能跟小编水平有限相关哈哈.