BACKGROUND
I'm very new to using Scrapy and webscraping in general, and I'm attempting to access a target webpage, fill the form present there, submit that form and scrape data from the page that has been returned into items. After completing those steps, I want to then go back to the target webpage, fill the form with different information, scrape the new data that has been returned, and append this data to those same items.
背景技术我一般都非常习惯使用Scrapy和webscraping,我试图访问目标网页,填写表单中的表单,提交表单并从已返回项目的页面中抓取数据。完成这些步骤后,我想回到目标网页,用不同的信息填写表单,抓取已返回的新数据,并将这些数据附加到这些相同的项目。
WHAT I HAVE
The following code fills out the target form, scrapes the returned page for info, and places that info into items.
我有什么以下代码填写目标表单,擦除返回页面的信息,并将该信息放入项目。
import scrapy
from AirScraper.items import AirscraperItem
class airSpider(scrapy.Spider):
name = "airSpider"
start_urls = ["https://book.jetblue.com"]
origin = "MCO"
dest = "BOS"
dateDep = "2015-05-13"
dateRet = "2015-05-15"
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formname = "searchForm",
formdata={'origin' : self.origin, 'destination' : self.dest, 'departureDate' : self.dateDep, 'returnDate' : self.dateRet},
callback=self.after_search
)
def after_search(self, response):
flights = response.xpath('//*[contains(@class, "flight-row no-mint")]')
for sel in flights:
#scrape data about target flight
yield item
WHAT I NEED
Once I've scraped data from the first form request, I need to then return to the original form page, fill it out with similar data, and then scrape its results as well. I'm just unsure how to go about telling the spider to return to that first page and perform a different set of actions.
我需要什么一旦我从第一个表单请求中删除了数据,我需要返回到原始表单页面,用类似的数据填充它,然后刮掉它的结果。我只是不确定如何告诉蜘蛛返回第一页并执行一组不同的操作。
1 个解决方案
#1
As it turns out, this is actually really simple.
In the parse method, simply replace the single return with the following code:
事实证明,这实际上非常简单。在parse方法中,只需使用以下代码替换单个return:
def parse(self, response):
yield scrapy.FormRequest.from_response(
response,
formname = "searchForm",
formdata={'origin' : self.origin, 'destination' : self.dest, 'departureDate' : self.dateDep, 'returnDate' : self.dateRet},
callback=self.after_search
)
yield scrapy.FormRequest.from_response(
response,
formname = "searchForm",
formdata={'origin' : self.NEWorigin, 'destination' : self.NEWdest, 'departureDate' : self.NEWdateDep, 'returnDate' : self.NEWdateRet},
callback=self.after_search_2
)
This will make the spider that you've defined perform both the first and second searches, with whatever new information you've defined.
这将使您定义的蜘蛛执行第一次和第二次搜索,并使用您定义的任何新信息。
#1
As it turns out, this is actually really simple.
In the parse method, simply replace the single return with the following code:
事实证明,这实际上非常简单。在parse方法中,只需使用以下代码替换单个return:
def parse(self, response):
yield scrapy.FormRequest.from_response(
response,
formname = "searchForm",
formdata={'origin' : self.origin, 'destination' : self.dest, 'departureDate' : self.dateDep, 'returnDate' : self.dateRet},
callback=self.after_search
)
yield scrapy.FormRequest.from_response(
response,
formname = "searchForm",
formdata={'origin' : self.NEWorigin, 'destination' : self.NEWdest, 'departureDate' : self.NEWdateDep, 'returnDate' : self.NEWdateRet},
callback=self.after_search_2
)
This will make the spider that you've defined perform both the first and second searches, with whatever new information you've defined.
这将使您定义的蜘蛛执行第一次和第二次搜索,并使用您定义的任何新信息。