首先,话不多说,先上scrapy-item pipeline
之前数据都是存到数据库和json中,用的时候还得转成excel格式,挺麻烦,所以今天查了一下发现了openpyxl这个库,在此小记..
from openpyx import WorkBook #创建工作簿,同时页建一个sheet wb = WorkBook() #调用得到的sheet,并命名为test1 ws = wb.active (注:active返回的是一个列表) @property def active(self): """Get the currently active sheet or None :type: :class:`openpyxl.worksheet.worksheet.Worksheet` """ try: return self._sheets[self._active_sheet_index] except IndexError: pass ws.title = 'test1' #插入数据 ws.append([...]) #保存工作簿,在当前目录下文件名为test1.xlsx wb.save('test1.xlsx')
项目代码
class CdcspiderExcelPipeline(object): ''' use Item Exporter save the item to excel ''' def __init__(self): ''' initialize the object ''' self.spider = None self.count = 0 def log(self, l): ''' reload the log :return: ''' msg = '========== CdcspiderExcelPipeline == %s' % l if self.spider is not None: # spider.logger -> return logging.LoggerAdapter(logger, {'spider': self}) self.spider.logger.info(msg)
def open_spider(self, spider): ''' create a queue :return: ''' self.wb = openpyxl.Workbook()
self.ws = self.wb.active
self.ws.append(['文章日期', '文章标题', 'url', '文章作者'])
def process_item(self, item, spider): ''' save every :return: ''' self.log('process %s, %s:' % (spider.name, self.count + 1))
line = [item['article_time'],item['title'],item['url'],item['author']]
self.ws.append(line)
return item def close_spider(self, spider): ''' save lines to excel :return: ''' print 'ExcelPipline info: items size: %s' % self.count
file_name = _generate_filename(spider, file_format='xlsx')
self.wb.save(file_name)
结果如下
此外,scrapy提供了item exporter进行持久化或者导出,但笔者本人使用觉着不如第三方库方便,当然可能跟小编水平有限相关哈哈.