VBA Excel点击“提交”后提取新的网页数据

时间:2022-11-19 21:47:58

I'm trying to pull some info from a website that provides oil well data by API number (API is a unique number for every well in the US)

我想从一个提供油井数据的网站上获取一些信息(API是美国所有油井的唯一数字)

Website: http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellapi1

网站:http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellapi1

API example: 1708300502

API的例子:1708300502

The issue is, when I get to the 2nd page, IE.document.getElementsByTagName("body")(0).innerText still returns data from the initial page. How do I fetch the updated page data?

问题是,当我读到第二页时,I .document. getelementsbytagname(“body”)(0)。innerText仍然从初始页面返回数据。如何获取已更新的页面数据?

The ultimate goal is to get to the 2nd page, click on "30570" via IE.document.getElementsByTagName("a")(0).Click and then read the final 3rd page. I just cannot figure out how to read the updated page :(

最终的目标是到达第二页,通过IE.document.getElementsByTagName(“a”)(0)点击“30570”。点击并阅读最后的第三页。我不知道如何阅读更新后的网页:(

Option Explicit

Sub sonris_WellData()
   Dim IE As InternetExplorer
   Set IE = CreateObject("InternetExplorer.Application")
   IE.Visible = True

   Dim i As Integer

   'Open SONRIS website
   Application.StatusBar = "Opening Website"
   IE.navigate "http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellapi1"
   Do While IE.readyState <> 4: DoEvents: Loop
   Application.Wait Now() + TimeValue("00:00:01")
   Application.StatusBar = False

   IE.document.forms(0).p_apinum.Value = "1708300502" 'plug-in API
   IE.document.forms(0).submit

   ' Wait until the next page opens
   Application.StatusBar = "Opening Website"
   Do While IE.readyState <> 4: DoEvents: Loop
   Application.Wait Now() + TimeValue("00:00:01")
   Application.StatusBar = False

   ' Read the page - this is where the issue occurs, MsgBox keeps returning text from the very 1st page
   MsgBox IE.document.getElementsByTagName("body")(0).innerText

   IE.Quit
End Sub

2 个解决方案

#1


1  

This seems to be working. Rather than DoEvents use the WinAPI Sleep function. I also added a call to the Sleep function after the form submit.

这似乎行得通。而不是DoEvents使用WinAPI Sleep函数。我还在表单提交之后向Sleep函数添加了一个调用。

MOre often we are seeing sites that are dynamically served by some javascript/etc., in these cases the browser may appear to be READYSTATE_COMPLETE or not Busy but the page has not yet rendered the "new" results.

更多时候,我们看到的是由一些javascript等动态服务的站点。,在这些情况下,浏览器可能显示为READYSTATE_COMPLETE或不繁忙,但页面尚未呈现“new”结果。

Option Explicit
Public Declare Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As Long)
Sub sonris_WellData()
   Dim IE As Object 'InternetExplorer
   Set IE = CreateObject("InternetExplorer.Application")
   IE.Visible = True

   Dim i As Integer

   'Open SONRIS website
   Application.StatusBar = "Opening Website"
   IE.navigate "http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellapi1"
   Do While IE.readyState <> 4
       Sleep 1000
   Loop

   Application.StatusBar = False

   IE.document.forms(0).p_apinum.Value = "1708300502" 'plug-in API
   IE.document.forms(0).submit

   Sleep 1000

   ' Wait until the next page opens
   Application.StatusBar = "Opening Website"
   Do While IE.readyState <> 4
    Sleep 1000
   Loop

   Application.StatusBar = False

   ' Read the page - this is where the issue occurs, MsgBox keeps returning text from the very 1st page
   MsgBox IE.document.getElementsByTagName("body")(0).innerText

   IE.Quit
End Sub

You can experiment maybe with a slightly longer Sleep after the .submit.

你可以尝试在。submit之后稍微多睡一会儿。

Alternatively, I notice that after you submit, the URL changes, so you could also try changing the second waiting loop to:

另外,我注意到您提交后,URL改变了,所以您也可以尝试将第二个等待循环改为:

Do While IE.LocationURL ="http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellapi1"
    Sleep 1000
Loop

This should put the Excel.Application to wait until the URL has changed.

这应该是Excel。应用程序等待URL更改。

Alternatively, you may have better luck using an XMLHTTPRequest (there are many examples of this here on SO and elsewhere on the internet). This allows you to send a request just like the browser, without actually using a web browser. Then you can simply parse the return text as HTML or XML. I would use the Microsoft XML, v6.0 library reference for this.

或者,使用XMLHTTPRequest您可能会有更好的运气(在internet上的SO和其他地方有很多这样的例子)。这允许您像浏览器一样发送请求,而实际上不使用web浏览器。然后,您可以简单地将返回文本解析为HTML或XML。为此,我将使用Microsoft XML, v6.0库引用。

#2


0  

POST requests:

① Entering the Well API number

①进入API数量

I examined the web page making the selections you mention. I inspected the web traffic using fiddler and noticed that the initial request, when you submit the API number is handled by a POST request.

我检查了你提到的网页的选择。我使用fiddler检查了web流量,并注意到最初的请求,当您提交API编号时,是由POST请求处理的。

VBA Excel点击“提交”后提取新的网页数据


② POST request:

②POST请求:

The POST body has the following parameter:

后文主体有以下参数:

VBA Excel点击“提交”后提取新的网页数据

p_apinum is the key and the associated value is the original Well API number.

p_apinum是关键,相关的值是原始的Well API编号。

Using this info I formulated a POST request direct thus avoiding your first landing page.

使用这个信息,我制定了一个直接的POST请求,从而避免了你的第一个登陆页面。


③ Pressing the hyperlink:

③按超链接:

Next, I noticed that the element you wanted to press:

接下来,我注意到你想要的元素:

VBA Excel点击“提交”后提取新的网页数据

Looking at the associated HTML it has an associated relative hyperlink:

查看相关的HTML,它有一个相关的超链接:

VBA Excel点击“提交”后提取新的网页数据

I use a helper function to parse the page HTML to get this relative link and construct the absolute path: GetNextURL(page.body.innerHTML).

我使用一个助手函数来解析页面HTML以获得这个相对链接并构造绝对路径:GetNextURL(page. body.innerhtml)。


④ Making a new request:

④制造一个新的请求:

I re-use my HTTPRequest function GetPage to send a second request, with an empty body, and grab all the tables from the HTML document returned via: page.getElementsByTagName("table").

我重新使用HTTPRequest函数GetPage发送第二个请求,请求的主体为空,并从通过:page.getElementsByTagName(“table”)返回的HTML文档中获取所有表。


⑤ Writing the tables to the Excel worksheet:

⑤写作Excel工作表的表:

I loop all the tables on the page using helper function AddHeaders to write out the table headers, and WriteTables to write the current table to the sheet.

我使用helper函数addheader对页面上的所有表进行循环,并使用WriteTables将当前表写入表。


Example page content:

示例页面内容:

VBA Excel点击“提交”后提取新的网页数据


Example code output:

示例代码输出:

VBA Excel点击“提交”后提取新的网页数据


VBA:

VBA:

Option Explicit
Public Sub GetWellInfo()
    Dim ws As Worksheet, page As HTMLDocument, targetTable As HTMLTable, apiNumbers(), currNumber As Long
    Const PARAM1 As String = "p_apinum"
    Const BASESTRING As String = "http://sonlite.dnr.state.la.us/sundown/cart_prod/"
    apiNumbers = Array(1708300502, 1708300503)

    Application.ScreenUpdating = False
    Set ws = ThisWorkbook.Worksheets("Sheet1")

    With ws
        .Cells.ClearContents
        For currNumber = LBound(apiNumbers) To UBound(apiNumbers)
            Set page = GetPage(BASESTRING & "cart_con_wellapi2", apiNumbers(currNumber), PARAM1)
            Set page = GetPage(BASESTRING & GetNextURL(page.body.innerHTML))
            Dim allTables As Object
            Set allTables = page.getElementsByTagName("table")

            For Each targetTable In allTables
                AddHeaders targetTable, GetLastRow(ws, 1) + 2, ws
                WriteTables targetTable, GetLastRow(ws, 1), ws
            Next targetTable

        Next currNumber
    End With
    Application.ScreenUpdating = True
End Sub

Public Function GetPage(ByVal url As String, Optional ByVal apiNumber As Long, Optional ByVal paramN As String = vbNullString) As HTMLDocument
    Dim objHTTP As Object, html As New HTMLDocument

    Set objHTTP = CreateObject("WinHttp.WinHttpRequest.5.1")

    Dim sBody As String
    If Not paramN = vbNullString Then sBody = paramN & "=" & apiNumber
    With objHTTP
        .SetTimeouts 10000, 10000, 10000, 10000
        .Open "POST", url, False
        .setRequestHeader "User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
        .setRequestHeader "Content-type", "application/x-www-form-urlencoded"
        On Error Resume Next
        .send (sBody)
        If Err.Number = 0 Then
            If .Status = "200" Then
                html.body.innerHTML = .responseText
                Set GetPage = html
            Else
                Debug.Print "HTTP " & .Status & " " & .statusText
                Exit Function
            End If
        Else
            Debug.Print "Error " & Err.Number & " " & Err.Source & " " & Err.Description
            Exit Function
        End If
        On Error GoTo 0
    End With

End Function

Public Function GetNextURL(ByVal inputString As String)
    GetNextURL = Replace$(Replace$(Split(Split(inputString, "href=")(1), ">")(0), Chr$(34), vbNullString), "about:", vbNullString)
End Function

Public Sub AddHeaders(ByVal hTable As Object, ByVal startRow As Long, ByVal ws As Worksheet)
    Dim headers As Object, header As Object, columnCounter As Long
    Set headers = hTable.getElementsByTagName("th")
    For Each header In headers
        columnCounter = columnCounter + 1
        ws.Cells(startRow, columnCounter) = header.innerText
    Next header
End Sub

Public Sub WriteTables(ByVal hTable As HTMLTable, Optional ByVal startRow As Long = 1, Optional ByRef ws As Worksheet)
    If ws Is Nothing Then Set ws = ActiveSheet

    Dim tRow As Object, tCell As Object, tr As Object, td As Object, r As Long, c As Long
    r = startRow
    With ActiveSheet
        Set tRow = hTable.getElementsByTagName("tr")
        For Each tr In tRow
            Set tCell = tr.getElementsByTagName("td")
            For Each td In tCell
                .Cells(r, c).Value = td.innerText
                c = c + 1
            Next td
            r = r + 1:  c = 1
        Next tr
    End With
End Sub

Public Function GetLastRow(ByVal ws As Worksheet, Optional ByVal columnNumber As Long = 1) As Long
    With ws
        GetLastRow = .Cells(.Rows.Count, columnNumber).End(xlUp).Row
    End With
End Function

References:

引用:

VBE > Tools > References > HTML Object Library.

VBE >工具>引用> HTML对象库。

#1


1  

This seems to be working. Rather than DoEvents use the WinAPI Sleep function. I also added a call to the Sleep function after the form submit.

这似乎行得通。而不是DoEvents使用WinAPI Sleep函数。我还在表单提交之后向Sleep函数添加了一个调用。

MOre often we are seeing sites that are dynamically served by some javascript/etc., in these cases the browser may appear to be READYSTATE_COMPLETE or not Busy but the page has not yet rendered the "new" results.

更多时候,我们看到的是由一些javascript等动态服务的站点。,在这些情况下,浏览器可能显示为READYSTATE_COMPLETE或不繁忙,但页面尚未呈现“new”结果。

Option Explicit
Public Declare Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As Long)
Sub sonris_WellData()
   Dim IE As Object 'InternetExplorer
   Set IE = CreateObject("InternetExplorer.Application")
   IE.Visible = True

   Dim i As Integer

   'Open SONRIS website
   Application.StatusBar = "Opening Website"
   IE.navigate "http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellapi1"
   Do While IE.readyState <> 4
       Sleep 1000
   Loop

   Application.StatusBar = False

   IE.document.forms(0).p_apinum.Value = "1708300502" 'plug-in API
   IE.document.forms(0).submit

   Sleep 1000

   ' Wait until the next page opens
   Application.StatusBar = "Opening Website"
   Do While IE.readyState <> 4
    Sleep 1000
   Loop

   Application.StatusBar = False

   ' Read the page - this is where the issue occurs, MsgBox keeps returning text from the very 1st page
   MsgBox IE.document.getElementsByTagName("body")(0).innerText

   IE.Quit
End Sub

You can experiment maybe with a slightly longer Sleep after the .submit.

你可以尝试在。submit之后稍微多睡一会儿。

Alternatively, I notice that after you submit, the URL changes, so you could also try changing the second waiting loop to:

另外,我注意到您提交后,URL改变了,所以您也可以尝试将第二个等待循环改为:

Do While IE.LocationURL ="http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellapi1"
    Sleep 1000
Loop

This should put the Excel.Application to wait until the URL has changed.

这应该是Excel。应用程序等待URL更改。

Alternatively, you may have better luck using an XMLHTTPRequest (there are many examples of this here on SO and elsewhere on the internet). This allows you to send a request just like the browser, without actually using a web browser. Then you can simply parse the return text as HTML or XML. I would use the Microsoft XML, v6.0 library reference for this.

或者,使用XMLHTTPRequest您可能会有更好的运气(在internet上的SO和其他地方有很多这样的例子)。这允许您像浏览器一样发送请求,而实际上不使用web浏览器。然后,您可以简单地将返回文本解析为HTML或XML。为此,我将使用Microsoft XML, v6.0库引用。

#2


0  

POST requests:

① Entering the Well API number

①进入API数量

I examined the web page making the selections you mention. I inspected the web traffic using fiddler and noticed that the initial request, when you submit the API number is handled by a POST request.

我检查了你提到的网页的选择。我使用fiddler检查了web流量,并注意到最初的请求,当您提交API编号时,是由POST请求处理的。

VBA Excel点击“提交”后提取新的网页数据


② POST request:

②POST请求:

The POST body has the following parameter:

后文主体有以下参数:

VBA Excel点击“提交”后提取新的网页数据

p_apinum is the key and the associated value is the original Well API number.

p_apinum是关键,相关的值是原始的Well API编号。

Using this info I formulated a POST request direct thus avoiding your first landing page.

使用这个信息,我制定了一个直接的POST请求,从而避免了你的第一个登陆页面。


③ Pressing the hyperlink:

③按超链接:

Next, I noticed that the element you wanted to press:

接下来,我注意到你想要的元素:

VBA Excel点击“提交”后提取新的网页数据

Looking at the associated HTML it has an associated relative hyperlink:

查看相关的HTML,它有一个相关的超链接:

VBA Excel点击“提交”后提取新的网页数据

I use a helper function to parse the page HTML to get this relative link and construct the absolute path: GetNextURL(page.body.innerHTML).

我使用一个助手函数来解析页面HTML以获得这个相对链接并构造绝对路径:GetNextURL(page. body.innerhtml)。


④ Making a new request:

④制造一个新的请求:

I re-use my HTTPRequest function GetPage to send a second request, with an empty body, and grab all the tables from the HTML document returned via: page.getElementsByTagName("table").

我重新使用HTTPRequest函数GetPage发送第二个请求,请求的主体为空,并从通过:page.getElementsByTagName(“table”)返回的HTML文档中获取所有表。


⑤ Writing the tables to the Excel worksheet:

⑤写作Excel工作表的表:

I loop all the tables on the page using helper function AddHeaders to write out the table headers, and WriteTables to write the current table to the sheet.

我使用helper函数addheader对页面上的所有表进行循环,并使用WriteTables将当前表写入表。


Example page content:

示例页面内容:

VBA Excel点击“提交”后提取新的网页数据


Example code output:

示例代码输出:

VBA Excel点击“提交”后提取新的网页数据


VBA:

VBA:

Option Explicit
Public Sub GetWellInfo()
    Dim ws As Worksheet, page As HTMLDocument, targetTable As HTMLTable, apiNumbers(), currNumber As Long
    Const PARAM1 As String = "p_apinum"
    Const BASESTRING As String = "http://sonlite.dnr.state.la.us/sundown/cart_prod/"
    apiNumbers = Array(1708300502, 1708300503)

    Application.ScreenUpdating = False
    Set ws = ThisWorkbook.Worksheets("Sheet1")

    With ws
        .Cells.ClearContents
        For currNumber = LBound(apiNumbers) To UBound(apiNumbers)
            Set page = GetPage(BASESTRING & "cart_con_wellapi2", apiNumbers(currNumber), PARAM1)
            Set page = GetPage(BASESTRING & GetNextURL(page.body.innerHTML))
            Dim allTables As Object
            Set allTables = page.getElementsByTagName("table")

            For Each targetTable In allTables
                AddHeaders targetTable, GetLastRow(ws, 1) + 2, ws
                WriteTables targetTable, GetLastRow(ws, 1), ws
            Next targetTable

        Next currNumber
    End With
    Application.ScreenUpdating = True
End Sub

Public Function GetPage(ByVal url As String, Optional ByVal apiNumber As Long, Optional ByVal paramN As String = vbNullString) As HTMLDocument
    Dim objHTTP As Object, html As New HTMLDocument

    Set objHTTP = CreateObject("WinHttp.WinHttpRequest.5.1")

    Dim sBody As String
    If Not paramN = vbNullString Then sBody = paramN & "=" & apiNumber
    With objHTTP
        .SetTimeouts 10000, 10000, 10000, 10000
        .Open "POST", url, False
        .setRequestHeader "User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
        .setRequestHeader "Content-type", "application/x-www-form-urlencoded"
        On Error Resume Next
        .send (sBody)
        If Err.Number = 0 Then
            If .Status = "200" Then
                html.body.innerHTML = .responseText
                Set GetPage = html
            Else
                Debug.Print "HTTP " & .Status & " " & .statusText
                Exit Function
            End If
        Else
            Debug.Print "Error " & Err.Number & " " & Err.Source & " " & Err.Description
            Exit Function
        End If
        On Error GoTo 0
    End With

End Function

Public Function GetNextURL(ByVal inputString As String)
    GetNextURL = Replace$(Replace$(Split(Split(inputString, "href=")(1), ">")(0), Chr$(34), vbNullString), "about:", vbNullString)
End Function

Public Sub AddHeaders(ByVal hTable As Object, ByVal startRow As Long, ByVal ws As Worksheet)
    Dim headers As Object, header As Object, columnCounter As Long
    Set headers = hTable.getElementsByTagName("th")
    For Each header In headers
        columnCounter = columnCounter + 1
        ws.Cells(startRow, columnCounter) = header.innerText
    Next header
End Sub

Public Sub WriteTables(ByVal hTable As HTMLTable, Optional ByVal startRow As Long = 1, Optional ByRef ws As Worksheet)
    If ws Is Nothing Then Set ws = ActiveSheet

    Dim tRow As Object, tCell As Object, tr As Object, td As Object, r As Long, c As Long
    r = startRow
    With ActiveSheet
        Set tRow = hTable.getElementsByTagName("tr")
        For Each tr In tRow
            Set tCell = tr.getElementsByTagName("td")
            For Each td In tCell
                .Cells(r, c).Value = td.innerText
                c = c + 1
            Next td
            r = r + 1:  c = 1
        Next tr
    End With
End Sub

Public Function GetLastRow(ByVal ws As Worksheet, Optional ByVal columnNumber As Long = 1) As Long
    With ws
        GetLastRow = .Cells(.Rows.Count, columnNumber).End(xlUp).Row
    End With
End Function

References:

引用:

VBE > Tools > References > HTML Object Library.

VBE >工具>引用> HTML对象库。