需要用户交互的Web Scraping

时间:2021-10-03 12:59:34

I'm trying to scrape a site https://ibotta.com/rebates that requires you to scroll down and when it hits the bottom, loads more items. Its a finite amount of items so I know it won't be scrolling forever but is there any method of doing this without having to interact with a browser object.

我正在尝试抓一个网站https://ibotta.com/rebates,要求你向下滚动,当它到达底部时,加载更多的项目。它是一个有限数量的项目,所以我知道它不会永远滚动,但有没有任何方法这样做,而无需与浏览器对象进行交互。

I'm trying to accomplish this in VB / VBA but any language would do. Right now I templated it in MS Access just to get a feeling for how the site reacts, I can do it with the browser control loaded but its clunky. Preferably something I can just make an HTTP call to.

我试图在VB / VBA中完成这个,但任何语言都可以。现在我在MS Access中模仿它只是为了感受网站的反应,我可以通过加载的浏览器控件来实现它,但它很笨重。最好是我可以拨打HTTP电话。

On a side note, are they any good web scraping tutorials out their I should be looking at?

另外,他们是否应该关注他们的任何好的网络抓取教程?

1 个解决方案

#1


1  

At the first sight XHRs I examined in Chrome - Developer tools - Network tab show that all the necessary data located in 2 files: retailers.json (15.7 kB) and offers.json (299 kB). While you are scrolling down the page actually no additional data is dowloaded, so I made a conclusion that scripts on the page just fetch data from that already downloaded files and put items to the page. I checked parameters and headers of the XHRs, and created the below simple VBS, which downloads the files:

乍一看,我在Chrome中检查了XHRs - 开发者工具 - 网络选项卡显示所有必要的数据位于2个文件中:retailers.json(15.7 kB)和offers.json(299 kB)。当您向下滚动页面时,实际上没有其他数据被下载,因此我得出结论,页面上的脚本只是从已经下载的文件中获取数据并将项目放到页面中。我检查了XHR的参数和标题,并创建了以下简单的VBS,它下载了文件:

strZipCode = "11590" ' your zip code here
strPathRetailers = "C:\retailers.json" ' retailers output file path
strPathOffers = "C:\offers.json" ' offers output file path

' make XHR to retrieve initial page with X-App-Token and X-NewRelic-ID
strURL = "https://ibotta.com/rebates"
XmlHttpRequest "GET", strURL, "", "", "", strResp

' extract X-App-Token eg 'loader_config={xpid:"VQAHUlVUGwcJUlBWBQg="}'
arrTmp = Split(strResp, "loader_config={xpid:""", 2)
strTmp = arrTmp(1)
arrTmp = Split(strTmp, """}", 2)
strNewRelicID = arrTmp(0)

' extract X-NewRelic-ID eg '<meta name="ibotta-t" content="nce0dc967myuho7wco:1458857196:91bf12dcd5442cf6b2100c962c656a510738150a">'
arrTmp = Split(strResp, "<meta name=""ibotta-t"" content=""", 2)
strTmp = arrTmp(1)
arrTmp = Split(strTmp, """>", 2)
strAppToken = arrTmp(0)

' put headers to array
arrHeaders = Array( _
    Array("Accept", "application/json, text/javascript"), _
    Array("Accept-Encoding", "deflate"), _
    Array("Accept-Language", "en-US,en;q=0.5"), _
    Array("Connection", "keep-alive"), _
    Array("Host", "ibotta.com"), _
    Array("If-Modified-Since", "Thu, 1 Jan 1970 10:00:00 GMT"), _
    Array("Referer", "https", "//ibotta.com/rebates"), _
    Array("User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:38.0) Gecko/20100101 Firefox/38.0"), _
    Array("X-App-Token", strAppToken), _
    Array("X-App-Version", "3.6:webapp"), _
    Array("X-NewRelic-ID", strNewRelicID), _
    Array("X-Requested-With", "XMLHttpRequest") _
)

' make XHR to retrieve retailers
strURL = "https://ibotta.com/web_v1/retailers.json?zip=" & strZipCode
XmlHttpRequest "GET", strURL, arrHeaders, "", "", strResp
' save retailers to file
WriteTextFile strResp, strPathRetailers, -1

' make XHR to retrieve offers
strURL = "https://ibotta.com/web_v1/offers.json"
XmlHttpRequest "GET", strURL, arrHeaders, "", "", strResp
' save offers to file
WriteTextFile strResp, strPathOffers, -1

Sub XmlHttpRequest(strMethod, strURL, arrSetHeaders, strFormData, strRespHeaders, strRespText)
    Dim arrHeader
    With CreateObject("Msxml2.ServerXMLHTTP")
        .SetOption 2, 13056 ' SXH_SERVER_CERT_IGNORE_ALL_SERVER_ERRORS
        .Open strMethod, strURL, False
        If IsArray(arrSetHeaders) Then
            For Each arrHeader In arrSetHeaders
                .SetRequestHeader arrHeader(0), arrHeader(1)
            Next
        End If
        .Send strFormData
        strRespHeaders = .GetAllResponseHeaders
        strRespText = .ResponseText
    End With
End Sub

Sub WriteTextFile(strContent, strPath, lngFormat)
    ' lngFormat -2 - System default, -1 - Unicode, 0 - ASCII
    With CreateObject("Scripting.FileSystemObject").OpenTextFile(strPath, 2, True, lngFormat)
        .Write (strContent)
        .Close
    End With
End Sub

You can save this code to the text file vith .vbs extension and run.

您可以将此代码保存到文本文件vith .vbs扩展名并运行。

At the moment I can see there 857 offers totally, and 220 retailers for zip code 11590 (used JSON viewers, like built in Chrome, or via web service). If you want to process only the offers for zip code 11590, then you have to get the list of retailers' id, and filter out only the offers, that belong to the retailers from the list.

目前,我可以看到共有857个优惠,220个零售商的邮政编码11590(使用JSON查看器,如内置Chrome或通过网络服务)。如果您只想处理邮政编码11590的优惠,那么您必须获取零售商的ID列表,并过滤掉列表中属于零售商的优惠。

There is retailers screenshot, each of them has id (outlined with red):

有零售商截图,每个都有id(用红色勾勒):

需要用户交互的Web Scraping

And there is offers screenshot, each of them belongs to several retailers in retailer_ids (outlined with red also):

还有提供屏幕截图,每个截图都属于retailer_ids中的几个零售商(也用红色概述):

需要用户交互的Web Scraping

Further processing depends on what you need. You can parse JSON string to object and interact it, or convert JSON string to Recordset to filter it.

进一步处理取决于您的需求。您可以将JSON字符串解析为对象并进行交互,或将JSON字符串转换为Recordset以对其进行过滤。

#1


1  

At the first sight XHRs I examined in Chrome - Developer tools - Network tab show that all the necessary data located in 2 files: retailers.json (15.7 kB) and offers.json (299 kB). While you are scrolling down the page actually no additional data is dowloaded, so I made a conclusion that scripts on the page just fetch data from that already downloaded files and put items to the page. I checked parameters and headers of the XHRs, and created the below simple VBS, which downloads the files:

乍一看,我在Chrome中检查了XHRs - 开发者工具 - 网络选项卡显示所有必要的数据位于2个文件中:retailers.json(15.7 kB)和offers.json(299 kB)。当您向下滚动页面时,实际上没有其他数据被下载,因此我得出结论,页面上的脚本只是从已经下载的文件中获取数据并将项目放到页面中。我检查了XHR的参数和标题,并创建了以下简单的VBS,它下载了文件:

strZipCode = "11590" ' your zip code here
strPathRetailers = "C:\retailers.json" ' retailers output file path
strPathOffers = "C:\offers.json" ' offers output file path

' make XHR to retrieve initial page with X-App-Token and X-NewRelic-ID
strURL = "https://ibotta.com/rebates"
XmlHttpRequest "GET", strURL, "", "", "", strResp

' extract X-App-Token eg 'loader_config={xpid:"VQAHUlVUGwcJUlBWBQg="}'
arrTmp = Split(strResp, "loader_config={xpid:""", 2)
strTmp = arrTmp(1)
arrTmp = Split(strTmp, """}", 2)
strNewRelicID = arrTmp(0)

' extract X-NewRelic-ID eg '<meta name="ibotta-t" content="nce0dc967myuho7wco:1458857196:91bf12dcd5442cf6b2100c962c656a510738150a">'
arrTmp = Split(strResp, "<meta name=""ibotta-t"" content=""", 2)
strTmp = arrTmp(1)
arrTmp = Split(strTmp, """>", 2)
strAppToken = arrTmp(0)

' put headers to array
arrHeaders = Array( _
    Array("Accept", "application/json, text/javascript"), _
    Array("Accept-Encoding", "deflate"), _
    Array("Accept-Language", "en-US,en;q=0.5"), _
    Array("Connection", "keep-alive"), _
    Array("Host", "ibotta.com"), _
    Array("If-Modified-Since", "Thu, 1 Jan 1970 10:00:00 GMT"), _
    Array("Referer", "https", "//ibotta.com/rebates"), _
    Array("User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:38.0) Gecko/20100101 Firefox/38.0"), _
    Array("X-App-Token", strAppToken), _
    Array("X-App-Version", "3.6:webapp"), _
    Array("X-NewRelic-ID", strNewRelicID), _
    Array("X-Requested-With", "XMLHttpRequest") _
)

' make XHR to retrieve retailers
strURL = "https://ibotta.com/web_v1/retailers.json?zip=" & strZipCode
XmlHttpRequest "GET", strURL, arrHeaders, "", "", strResp
' save retailers to file
WriteTextFile strResp, strPathRetailers, -1

' make XHR to retrieve offers
strURL = "https://ibotta.com/web_v1/offers.json"
XmlHttpRequest "GET", strURL, arrHeaders, "", "", strResp
' save offers to file
WriteTextFile strResp, strPathOffers, -1

Sub XmlHttpRequest(strMethod, strURL, arrSetHeaders, strFormData, strRespHeaders, strRespText)
    Dim arrHeader
    With CreateObject("Msxml2.ServerXMLHTTP")
        .SetOption 2, 13056 ' SXH_SERVER_CERT_IGNORE_ALL_SERVER_ERRORS
        .Open strMethod, strURL, False
        If IsArray(arrSetHeaders) Then
            For Each arrHeader In arrSetHeaders
                .SetRequestHeader arrHeader(0), arrHeader(1)
            Next
        End If
        .Send strFormData
        strRespHeaders = .GetAllResponseHeaders
        strRespText = .ResponseText
    End With
End Sub

Sub WriteTextFile(strContent, strPath, lngFormat)
    ' lngFormat -2 - System default, -1 - Unicode, 0 - ASCII
    With CreateObject("Scripting.FileSystemObject").OpenTextFile(strPath, 2, True, lngFormat)
        .Write (strContent)
        .Close
    End With
End Sub

You can save this code to the text file vith .vbs extension and run.

您可以将此代码保存到文本文件vith .vbs扩展名并运行。

At the moment I can see there 857 offers totally, and 220 retailers for zip code 11590 (used JSON viewers, like built in Chrome, or via web service). If you want to process only the offers for zip code 11590, then you have to get the list of retailers' id, and filter out only the offers, that belong to the retailers from the list.

目前,我可以看到共有857个优惠,220个零售商的邮政编码11590(使用JSON查看器,如内置Chrome或通过网络服务)。如果您只想处理邮政编码11590的优惠,那么您必须获取零售商的ID列表,并过滤掉列表中属于零售商的优惠。

There is retailers screenshot, each of them has id (outlined with red):

有零售商截图,每个都有id(用红色勾勒):

需要用户交互的Web Scraping

And there is offers screenshot, each of them belongs to several retailers in retailer_ids (outlined with red also):

还有提供屏幕截图,每个截图都属于retailer_ids中的几个零售商(也用红色概述):

需要用户交互的Web Scraping

Further processing depends on what you need. You can parse JSON string to object and interact it, or convert JSON string to Recordset to filter it.

进一步处理取决于您的需求。您可以将JSON字符串解析为对象并进行交互,或将JSON字符串转换为Recordset以对其进行过滤。