如何使用Scrapy和Splash刮取基于AJAX的网站？

I want to make a general scraper which can crawl and scrape all data from any type of website including AJAX websites. I have extensively searched the internet but could not find any proper link which can explain me how Scrapy and Splash together can scrape AJAX websites(which includes pagination,form data and clicking on button before page is displayed). Every link I have referred tells me that Javascript websites can be rendered using Splash but there's no good tutorial/explanation about using Splash to render JS websites. Please don't give me solutions related to using browsers(I want to do everything programmatically,headless browser suggestions are welcome..but I want to use Splash).

我想制作一个通用的刮刀,它可以从任何类型的网站(包括AJAX网站)抓取和抓取所有数据。我已经广泛搜索了互联网,但找不到任何正确的链接,这可以解释我Scrapy和Splash如何一起刮掉AJAX网站(包括分页,表格数据和点击页面显示之前的按钮)。我提到的每个链接都告诉我,Javascript网站可以使用Splash呈现,但没有关于使用Splash呈现JS网站的良好教程/解释。请不要给我与使用浏览器相关的解决方案(我想以编程方式执行所有操作,欢迎使用无头浏览器建议..但我想使用Splash)。

class FlipSpider(CrawlSpider):
    name = "flip"
    allowed_domains = ["www.amazon.com"]

    start_urls = ['https://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=mobile']  

    rules = (Rule(LinkExtractor(), callback='lol', follow=True),

    def parse_start_url(self,response):

       yield scrapy.Request(response.url, self.lol, meta={'splash':{'endpoint':'render.html','args':{'wait': 5,'iframes':1,}}})

    def lol(self, response):
       """
       Some code

2 个解决方案

#1

You can emulate behaviors, like a ckick, or scroll, by writting a JavaScript function and by telling Splash to execute that script when it renders your page.

您可以通过编写JavaScript函数并通过告诉Splash在呈现页面时执行该脚本来模拟行为,如ckick或滚动。

A little exemple:

一个小例子:

You define a JavaScript function that selects an element in the page and then clicks on it:

您定义了一个JavaScript函数,用于选择页面中的元素,然后单击它:

(source: splash doc)

(来源:splash doc)

  -- Get button element dimensions with javascript and perform mouse click.
_script = """
function main(splash)
    assert(splash:go(splash.args.url))
    local get_dimensions = splash:jsfunc([[
        function () {
            var rect = document.getElementById('button').getClientRects()[0];
            return {"x": rect.left, "y": rect.top}
        }
    ]])
    splash:set_viewport_full()
    splash:wait(0.1)
    local dimensions = get_dimensions()
    splash:mouse_click(dimensions.x, dimensions.y)

    -- Wait split second to allow event to propagate.
    splash:wait(0.1)
    return splash:html()
end
"""

Then, when you request, you modify the endpoint and set it to "execute", and you add "lua_script": _script to the args.

然后,当您请求时,修改端点并将其设置为“execute”,然后将“lua_script”:_script添加到args中。

Exemple :

def parse(self, response):
    yield SplashRequest(response.url, self.parse_elem,
                        endpoint="execute",
                        args={"lua_source": _script})

You will find all the informations about splash scripting here

您将在此处找到有关splash脚本的所有信息

#2

The problem with Splash and pagination is following:

Splash和分页的问题如下:

I wasn't able to product a Lua script that delivers a new webpage (after click on pagination link) that is in format of response. and not pure HTML.

我无法生成一个Lua脚本,该脚本提供了一个响应格式的新网页(点击分页链接后)。而不是纯HTML。

So, my solution is following - to click the link and extract that new generated url and direct a crawler to this new url.

因此,我的解决方案如下 - 点击链接并提取新生成的网址并将抓取工具指向此新网址。

So, I on the page that has pagination link I execute

所以,我在页面上有我执行的分页链接

yield SplashRequest(url=response.url, callback=self.get_url, endpoint="execute", args={'lua_source': script})

with following Lua script

跟随Lua脚本

def parse_categories(self, response):
script = """
             function main(splash)
                 assert(splash:go(splash.args.url))
                 splash:wait(1)
                 splash:runjs('document.querySelectorAll(".next-page")[0].click()')
                 splash:wait(1)
                 return splash:url()  
             end
             """

and the get_url function

和get_url函数

def get_url(self,response):
    yield SplashRequest(url=response.body_as_unicode(), callback=self.parse_categories)

This way I was able to loop my queries.

这样我就可以循环查询。

Same way if you don't expect new URL your Lua script can just produce pure html that you have to work our with regex (that is bad) - but this is the best I was able to do.

同样地,如果你不期望新的URL,你的Lua脚本只能生成纯html,你必须使用正则表达式(这很糟糕) - 但这是我能做的最好的。

#1