Hi i am very new to Python and Scrapy, this is my first code and i cant solve a problem that looks pretty basic.
你好,我是Python和scratch的新手,这是我的第一个代码,我不能解决一个看起来很基本的问题。
I have the crawler set to do two things: 1- Find all pagination URLs, visit them and get some data from each page 2- Get all links listed on the results pages, visite them and crawl for each location data
我让爬虫程序做两件事:1-查找所有分页url,访问它们并从每个页面2获取一些数据——获取结果页面上列出的所有链接,访问它们并为每个位置数据抓取。
I am taking the decision of each item to parse using rules with callback. I created to classes inside items.py for each parser
我将根据每个条目的决定来解析使用回调规则。我创建了类内的类。py每个解析器
The second rule is processing perfect but the first is not being processed and i cant find where is the error.
第二个规则是处理完美,但是第一个规则没有被处理,我找不到错误在哪里。
The error message that i am getting in the terminal running the crawler
我在终端运行爬虫的错误消息。
2014-11-24 02:30:39-0200 [apontador] ERROR: Error processing {'city': u'BR-SP-S\xe3o Paulo',
'coordinates': {'lat': u'-23.56588', 'lng': u'-46.64777'},
'current_url': 'http://www.apontador.com.br/local/search.html?q=supermercado&loc_z=S%C3%A3o+Paulo%2C+SP&loc=S%C3%A3o+Paulo%2C+SP&loc_y=S%C3%A3o+Paulo%2C+SP',
'datetime': datetime.datetime(2014, 11, 24, 2, 30, 39, 703972),
'depth': 0,
'domain': 'apontador.com.br',
'link_cat': 'ls',
'loc_cat': u'supermercado',
'session_id': -1,
'site_name': u'Apontador',
'state': u'BR-SP'}
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/scrapy/middleware.py", line 62, in _process_chain
return process_chain(self.methods[methodname], obj, *args)
File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/defer.py", line 65, in process_chain
d.callback(input)
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 382, in callback
self._startRunCallbacks(result)
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 490, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/locman/scrapy/locman/pipelines.py", line 37, in process_item
'neighborhood': item['neighborhood'],
File "/usr/local/lib/python2.7/dist-packages/scrapy/item.py", line 50, in __getitem__
return self._values[key]
exceptions.KeyError: 'neighborhood'
Looking at error message looks clear that scrapy is trying to process all the items in items.py, not respecting the defined item class called by each callback.
查看错误消息看起来很清楚,剪贴簿正在尝试处理项目中的所有项目。py,不尊重每个回调调用的定义项类。
If you see the file items.py there are two classes: 1- apontadorlsItem, 2- apontadordsItem
如果您看到文件项。py有两个类:1- apontadorlsItem, 2- apontadordsItem。
The class apontadordsItem has the key 'neighborhood' but the item class apontadorlsItem does not have the key 'neighborhood'. I created this two classes to support two different callback parser functions depending on the xpath rule. I did this because there are two types of pages being crawled with differents sets of information on each. The rules are working fine as i can see on the log files, the crawler is working, the problem is on processing/saving it!
类apontadordsItem具有关键的“邻域”,但条目类apontadorlsItem没有关键的“邻域”。根据xpath规则,我创建了这两个类来支持两个不同的回调解析器函数。我这样做是因为有两种类型的页面在每个页面上都有不同的信息。我可以在日志文件中看到,规则运行良好,爬虫正在工作,问题是处理/保存它!
How can i declare to pipeline to use different item matching rule depending on the source items.py class that was used by the crawler.
如何根据源项向管道声明使用不同的项匹配规则。爬虫使用的py类。
Please help, i got stuck
请帮忙,我被卡住了。
Spider file - spiders/apontador.py
文件-蜘蛛/ apontador.py蜘蛛
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import Selector
from datetime import datetime
from tld import get_tld
from locman.items import apontadorlsItem
from locman.items import apontadordsItem
class apontador(CrawlSpider):
name = 'apontador'
session_id = -1
start_urls = ["http://www.apontador.com.br/local/search.html?q=supermercado&loc_z=S%C3%A3o+Paulo%2C+SP&loc=S%C3%A3o+Paulo%2C+SP&loc_y=S%C3%A3o+Paulo%2C+SP"]
rules = (
# Rule for LS - Link source - Search results page
Rule(SgmlLinkExtractor(allow=("", ),restrict_xpaths=("//nav[@class='pagination']") ), callback='parse_items_ls', follow= True),
# Rule for DS - Data Source - Location data page
Rule(SgmlLinkExtractor(allow=("", ),restrict_xpaths=(
"//article[@class='poi card highlight']",
"//li[@class='similar-place sponsored']",
"//div[@class='recomendations']",
"//ul[@class='similar-places-list']",
"//article[@class='poi card']") ),
callback='parse_items_ds',
follow= True),
)
def __init__(self, session_id=-1, *args, **kwargs):
super(apontador, self).__init__(*args, **kwargs)
self.session_id = session_id
def parse_start_url(self, response):
self.response_url = response.url
return self.parse_items_ls(response)
# Callback item type LS
def parse_items_ls(self, response):
self.response_url = response.url
sel = Selector(response)
items_ls = []
item_ls = apontadorlsItem()
item_ls["session_id"] = self.session_id
item_ls["depth"] = response.meta["depth"]
item_ls["current_url"] = response.url
# Get site name in metadata
meta_site = sel.xpath("//meta[@property='og:site_name']/@content").extract()
item_ls["site_name"] = u''.join(meta_site)
# Get latitude and longitude in metadata
meta_latitude = sel.xpath("//meta[@name='apontador:latitude']/@content").extract()
latitude = ''.join(meta_latitude)
meta_longitude = sel.xpath("//meta[@name='apontador:longitude']/@content").extract()
longitude = ''.join(meta_longitude)
# Convert the coordinates to an array
coordinates = {"lng": longitude , "lat": latitude}
item_ls["coordinates"] = coordinates
# This items gets the strings directly from meta data keywords and creates a list
meta_keywords_ls = sel.xpath("//meta[@name='keywords']/@content").extract()
meta_keywords_ls_str = u''.join(meta_keywords_ls)
meta_keywords_ls_list = meta_keywords_ls_str.split(", ")
meta_state = meta_keywords_ls_list[6]
meta_city = meta_keywords_ls_list[5]
meta_loc_cat = meta_keywords_ls_list[4]
item_ls["state"] = u"BR-" + meta_state
item_ls["city"] = u"BR-" + meta_state + "-" + meta_city
item_ls["loc_cat"] = meta_loc_cat
# This items gets the domain name using the TLD module
domain = get_tld(response.url)
item_ls["domain"] = domain
# This items gets datetime
item_ls["datetime"] = datetime.now()
# This items defines de link category
item_ls["link_cat"] = "ls"
yield item_ls
# Callback item type DS
def parse_items_ds(self, response):
self.response_url = response.url
sel = Selector(response)
items_ds = []
item_ds = apontadordsItem()
item_ds["session_id"] = self.session_id
item_ds["depth"] = response.meta["depth"]
item_ds["current_url"] = response.url
# Get site name in metadata
meta_site = sel.xpath("//meta[@property='og:site_name']/@content").extract()
item_ds["site_name"] = u''.join(meta_site)
# Get location name in metadata
meta_loc_name = sel.xpath("//meta[@property='og:title']/@content").extract()
item_ds["loc_name"] = u''.join(meta_loc_name)
# Get location source id in metadata
meta_loc_source_id = sel.xpath("//meta[@name='apontador:place-id']/@content").extract()
item_ds["loc_source_id"] = ''.join(meta_loc_source_id)
# Get location street address in metadata
meta_loc_address = sel.xpath("//meta[@property='business:contact_data:street_address']/@content").extract()
meta_loc_address_str = u''.join(meta_loc_address)
meta_loc_address_list = meta_loc_address_str.split(", ")
meta_loc_address_number = meta_loc_address_list[1]
meta_loc_address_street = meta_loc_address_list[0]
item_ds["loc_street"] = meta_loc_address_street
item_ds["loc_number"] = meta_loc_address_number
# Get latitude and longitude in metadata
meta_latitude = sel.xpath("//meta[@property='place:location:latitude']/@content").extract()
latitude = ''.join(meta_latitude)
meta_longitude = sel.xpath("//meta[@property='place:location:longitude']/@content").extract()
longitude = ''.join(meta_longitude)
coordinates = {"lng": longitude , "lat": latitude}
item_ds["coordinates"] = coordinates
# This items gets the neighborhood, loc_cat, loc_sub_categoryfrom meta data keywords, creates a list and populates the fields from the list
meta_keywords_ds = sel.xpath("//meta[@name='keywords']/@content").extract()
meta_keywords_ds_str = u''.join(meta_keywords_ds)
meta_keywords_ds_list = meta_keywords_ds_str.split(", ")
meta_loc_cat = meta_keywords_ds_list[9]
meta_loc_cat_sub = meta_keywords_ds_list[8]
meta_neighborhood = meta_keywords_ds_list[5]
item_ds["loc_cat"] = meta_loc_cat
item_ds["loc_cat_sub"] = meta_loc_cat_sub
item_ds["neighborhood"] = meta_neighborhood
# Region informations
meta_statec = sel.xpath("//meta[@property='business:contact_data:region']/@content").extract()
meta_state = u''.join(meta_statec)
item_ds["state"] = u"BR-" + meta_state
meta_cityc = sel.xpath("//meta[@property='business:contact_data:locality']/@content").extract()
meta_city = u''.join(meta_cityc)
item_ds["city"] = u"BR-" + meta_state + "-" + meta_city
meta_postal_code = sel.xpath("//meta[@property='business:contact_data:postal_code']/@content").extract()
item_ds["loc_postal_code"] = ''.join(meta_postal_code)
# This items gets the domain name using the TLD module
domain = get_tld(response.url)
item_ds["domain"] = domain
# This items gets datetime as an i
item_ds["datetime"] = datetime.now()
item_ds["link_cat"] = "ds"
yield item_ds
Items file - items.py
项目文件- items.py
from scrapy.item import Item, Field
class apontadorlsItem(Item):
datetime = Field()
session_id = Field()
depth = Field()
link_cat = Field()
site_name = Field()
domain = Field()
current_url = Field()
city = Field()
state = Field()
loc_cat = Field()
coordinates = Field()
class apontadordsItem(Item):
datetime = Field()
session_id = Field()
depth = Field()
link_cat = Field()
site_name = Field()
domain = Field()
current_url = Field()
state = Field()
city = Field()
neighborhood = Field()
loc_name = Field()
loc_street = Field()
loc_number = Field()
loc_postal_code = Field()
loc_source_id = Field()
loc_cat = Field()
loc_cat_sub = Field()
coordinates = Field()
Pipelines file - pipelines.py
管道文件- pipelines.py
from scrapy.exceptions import DropItem
from scrapy_mongodb import MongoDBPipeline
class apontadorpipe(MongoDBPipeline):
def process_item(self, item, spider):
if self.config['buffer']:
self.current_item += 1
item = dict(item)
self.item_buffer.append(item)
if self.current_item == self.config['buffer']:
self.current_item = 0
return self.insert_item(self.item_buffer, spider)
else:
return item
matching_item = self.collection.find_one(
{'datetime': item['datetime'],
'session_id': item['session_id'],
'depth': item['depth'],
'link_cat': item['link_cat'],
'site_name': item['site_name'],
'domain': item['domain'],
'current_url': item['current_url'],
'state': item['state'],
'city': item['city'],
'neighborhood': item['neighborhood'],
'loc_name': item['loc_name'],
'loc_street': item['loc_street'],
'loc_number': item['loc_number'],
'loc_postal_code': item['loc_postal_code'],
'loc_cat': item['loc_cat'],
'loc_cat_sub': item['loc_cat_sub'],
'loc_source_id': item['loc_source_id'],
'coordinates': item['coordinates']}
)
if matching_item is not None:
raise DropItem(
"Duplicate found for %s, %s" %
item['current_url']
)
else:
return self.insert_item(item, spider)
Settings file - settings.py
设置文件——settings.py
BOT_NAME = 'locman'
SPIDER_MODULES = 'locman.spiders'
NEWSPIDER_MODULE = 'locman.spiders'
DEPTH_LIMIT = 10000
DEFAULT_ITEM_CLASS = 'locman.items.apontador'
ITEM_PIPELINES = {
'locman.pipelines.apontadorpipe': 100
}
# 'scrapy_mongodb.MongoDBPipeline' connection
MONGODB_URI = 'connection string'
MONGODB_DATABASE = ''
MONGODB_COLLECTION = ''
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None,
'locman.ua.rotate_useragent.RotateUserAgentMiddleware' :400
}
2 个解决方案
#1
0
It looks like that item does not have key "neighborhood". make sure following things.
看起来这个项目没有关键的“社区”。确保以下的事情。
- you have not misspelled "neighborhood"
- 你没有拼错“邻居”
- "neighborhood" is defines in item class
- “邻域”是在item类中定义的。
- item['neighborhood'] is initialized in spider
- 项目['邻居']在蜘蛛中初始化。
Make sure that item has key "neighborhood" in File "/locman/scrapy/locman/pipelines.py", line 37, in process_item
确保在文件“/locman/刮擦/locman/管道”中有关键的“邻居”。process_item中的第37行。
if item.get('neighborhood', None):
it will return None if item has not key "neighborhood", you can also set default value instead of None like this
如果项目没有关键的“邻域”,它将返回None,您也可以设置默认值,而不是像这样。
if item.get('neighborhood', 'default_value')
#2
0
thanks a lot for the help! I found a nice workround for my problem and it is exactly what i needed!
非常感谢你的帮助!我找到了一个很好的解决问题的方法,这正是我所需要的!
In the pipeline.py i imported the two classes from items.py, defined 2 differents functions and dict for each. This way i can have different duplicate record treatment and different writing processes to the database for each item class!
在管道。py我从项目导入了两个类。py,定义了两个不同的函数和指令。这样我就可以将不同的记录处理和不同的编写过程复制到每个条目类的数据库中!
The new code for pipeline.py:
管道的新代码:
from scrapy.exceptions import DropItem
from scrapy_mongodb import MongoDBPipeline
from locman.items import apontadorlsItem
from locman.items import apontadordsItem
class apontadorpipe(MongoDBPipeline):
def process_item_ds(self, item, spider):
if self.config['buffer']:
self.current_item += 1
item = dict(apontadordsItem)
self.item_buffer.append(item)
if self.current_item == self.config['buffer']:
self.current_item = 0
return self.insert_item(self.item_buffer, spider)
else:
return item
if isinstance(item, apontadordsItem):
matching_item = self.collection.find_one(
{'datetime': item['datetime'],
'session_id': item['session_id'],
'link_cat': item['link_cat'],
'site_name': item['site_name'].encode('utf-8'),
'domain': item['domain'],
'current_url': item['current_url'],
'state': item['state'],
'city': item['city'].encode('utf-8'),
'neighborhood': item['neighborhood'].encode('utf-8'),
'loc_name': item['loc_name'].encode('utf-8'),
'loc_street': item['loc_street'].encode('utf-8'),
'loc_number': item['loc_number'],
'loc_postal_code': item['loc_postal_code'],
'loc_cat': item['loc_cat'],
'loc_cat_sub': item['loc_cat_sub'],
'loc_source_id': item['loc_source_id'],
'loc_phone': item['loc_phone'],
'address': item['address'].encode('utf-8'),
'coordinates': item['coordinates']}
)
if matching_item is not None:
raise DropItem(
"Duplicate found for %s, %s" %
item['current_url'],
item['loc_source_id'],
)
else:
return self.insert_item(item, spider)
def process_item_ls(self, item, spider):
if self.config['buffer']:
self.current_item += 1
item = dict(apontadorlsItem)
self.item_buffer.append(item)
if self.current_item == self.config['buffer']:
self.current_item = 0
return self.insert_item(self.item_buffer, spider)
else:
return item
if isinstance(item, apontadorlsItem):
matching_item = self.collection.find_one(
{'datetime': item['datetime'],
'session_id': item['session_id'],
'link_cat': item['link_cat'],
'site_name': item['site_name'].encode('utf-8'),
'domain': item['domain'],
'current_url': item['current_url'],
'state': item['state'],
'city': item['city'].encode('utf-8'),
'loc_cat': item['loc_cat'].encode('utf-8'),
'coordinates': item['coordinates']}
)
if matching_item is not None:
raise DropItem(
"Duplicate found for %s, %s" %
item['current_url'],
)
else:
return self.insert_item(item, spider)
#1
0
It looks like that item does not have key "neighborhood". make sure following things.
看起来这个项目没有关键的“社区”。确保以下的事情。
- you have not misspelled "neighborhood"
- 你没有拼错“邻居”
- "neighborhood" is defines in item class
- “邻域”是在item类中定义的。
- item['neighborhood'] is initialized in spider
- 项目['邻居']在蜘蛛中初始化。
Make sure that item has key "neighborhood" in File "/locman/scrapy/locman/pipelines.py", line 37, in process_item
确保在文件“/locman/刮擦/locman/管道”中有关键的“邻居”。process_item中的第37行。
if item.get('neighborhood', None):
it will return None if item has not key "neighborhood", you can also set default value instead of None like this
如果项目没有关键的“邻域”,它将返回None,您也可以设置默认值,而不是像这样。
if item.get('neighborhood', 'default_value')
#2
0
thanks a lot for the help! I found a nice workround for my problem and it is exactly what i needed!
非常感谢你的帮助!我找到了一个很好的解决问题的方法,这正是我所需要的!
In the pipeline.py i imported the two classes from items.py, defined 2 differents functions and dict for each. This way i can have different duplicate record treatment and different writing processes to the database for each item class!
在管道。py我从项目导入了两个类。py,定义了两个不同的函数和指令。这样我就可以将不同的记录处理和不同的编写过程复制到每个条目类的数据库中!
The new code for pipeline.py:
管道的新代码:
from scrapy.exceptions import DropItem
from scrapy_mongodb import MongoDBPipeline
from locman.items import apontadorlsItem
from locman.items import apontadordsItem
class apontadorpipe(MongoDBPipeline):
def process_item_ds(self, item, spider):
if self.config['buffer']:
self.current_item += 1
item = dict(apontadordsItem)
self.item_buffer.append(item)
if self.current_item == self.config['buffer']:
self.current_item = 0
return self.insert_item(self.item_buffer, spider)
else:
return item
if isinstance(item, apontadordsItem):
matching_item = self.collection.find_one(
{'datetime': item['datetime'],
'session_id': item['session_id'],
'link_cat': item['link_cat'],
'site_name': item['site_name'].encode('utf-8'),
'domain': item['domain'],
'current_url': item['current_url'],
'state': item['state'],
'city': item['city'].encode('utf-8'),
'neighborhood': item['neighborhood'].encode('utf-8'),
'loc_name': item['loc_name'].encode('utf-8'),
'loc_street': item['loc_street'].encode('utf-8'),
'loc_number': item['loc_number'],
'loc_postal_code': item['loc_postal_code'],
'loc_cat': item['loc_cat'],
'loc_cat_sub': item['loc_cat_sub'],
'loc_source_id': item['loc_source_id'],
'loc_phone': item['loc_phone'],
'address': item['address'].encode('utf-8'),
'coordinates': item['coordinates']}
)
if matching_item is not None:
raise DropItem(
"Duplicate found for %s, %s" %
item['current_url'],
item['loc_source_id'],
)
else:
return self.insert_item(item, spider)
def process_item_ls(self, item, spider):
if self.config['buffer']:
self.current_item += 1
item = dict(apontadorlsItem)
self.item_buffer.append(item)
if self.current_item == self.config['buffer']:
self.current_item = 0
return self.insert_item(self.item_buffer, spider)
else:
return item
if isinstance(item, apontadorlsItem):
matching_item = self.collection.find_one(
{'datetime': item['datetime'],
'session_id': item['session_id'],
'link_cat': item['link_cat'],
'site_name': item['site_name'].encode('utf-8'),
'domain': item['domain'],
'current_url': item['current_url'],
'state': item['state'],
'city': item['city'].encode('utf-8'),
'loc_cat': item['loc_cat'].encode('utf-8'),
'coordinates': item['coordinates']}
)
if matching_item is not None:
raise DropItem(
"Duplicate found for %s, %s" %
item['current_url'],
)
else:
return self.insert_item(item, spider)