I'm trying to pull some info from a website that provides oil well data by API number (API is a unique number for every well in the US)
我想从一个提供油井数据的网站上获取一些信息(API是美国所有油井的唯一数字)
Website: http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellapi1
网站:http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellapi1
API example: 1708300502
API的例子:1708300502
The issue is, when I get to the 2nd page, IE.document.getElementsByTagName("body")(0).innerText still returns data from the initial page. How do I fetch the updated page data?
问题是,当我读到第二页时,I .document. getelementsbytagname(“body”)(0)。innerText仍然从初始页面返回数据。如何获取已更新的页面数据?
The ultimate goal is to get to the 2nd page, click on "30570" via IE.document.getElementsByTagName("a")(0).Click and then read the final 3rd page. I just cannot figure out how to read the updated page :(
最终的目标是到达第二页,通过IE.document.getElementsByTagName(“a”)(0)点击“30570”。点击并阅读最后的第三页。我不知道如何阅读更新后的网页:(
Option Explicit
Sub sonris_WellData()
Dim IE As InternetExplorer
Set IE = CreateObject("InternetExplorer.Application")
IE.Visible = True
Dim i As Integer
'Open SONRIS website
Application.StatusBar = "Opening Website"
IE.navigate "http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellapi1"
Do While IE.readyState <> 4: DoEvents: Loop
Application.Wait Now() + TimeValue("00:00:01")
Application.StatusBar = False
IE.document.forms(0).p_apinum.Value = "1708300502" 'plug-in API
IE.document.forms(0).submit
' Wait until the next page opens
Application.StatusBar = "Opening Website"
Do While IE.readyState <> 4: DoEvents: Loop
Application.Wait Now() + TimeValue("00:00:01")
Application.StatusBar = False
' Read the page - this is where the issue occurs, MsgBox keeps returning text from the very 1st page
MsgBox IE.document.getElementsByTagName("body")(0).innerText
IE.Quit
End Sub
2 个解决方案
#1
1
This seems to be working. Rather than DoEvents
use the WinAPI Sleep function. I also added a call to the Sleep
function after the form submit.
这似乎行得通。而不是DoEvents使用WinAPI Sleep函数。我还在表单提交之后向Sleep函数添加了一个调用。
MOre often we are seeing sites that are dynamically served by some javascript/etc., in these cases the browser may appear to be READYSTATE_COMPLETE
or not Busy
but the page has not yet rendered the "new" results.
更多时候,我们看到的是由一些javascript等动态服务的站点。,在这些情况下,浏览器可能显示为READYSTATE_COMPLETE或不繁忙,但页面尚未呈现“new”结果。
Option Explicit
Public Declare Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As Long)
Sub sonris_WellData()
Dim IE As Object 'InternetExplorer
Set IE = CreateObject("InternetExplorer.Application")
IE.Visible = True
Dim i As Integer
'Open SONRIS website
Application.StatusBar = "Opening Website"
IE.navigate "http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellapi1"
Do While IE.readyState <> 4
Sleep 1000
Loop
Application.StatusBar = False
IE.document.forms(0).p_apinum.Value = "1708300502" 'plug-in API
IE.document.forms(0).submit
Sleep 1000
' Wait until the next page opens
Application.StatusBar = "Opening Website"
Do While IE.readyState <> 4
Sleep 1000
Loop
Application.StatusBar = False
' Read the page - this is where the issue occurs, MsgBox keeps returning text from the very 1st page
MsgBox IE.document.getElementsByTagName("body")(0).innerText
IE.Quit
End Sub
You can experiment maybe with a slightly longer Sleep
after the .submit
.
你可以尝试在。submit之后稍微多睡一会儿。
Alternatively, I notice that after you submit, the URL changes, so you could also try changing the second waiting loop to:
另外,我注意到您提交后,URL改变了,所以您也可以尝试将第二个等待循环改为:
Do While IE.LocationURL ="http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellapi1"
Sleep 1000
Loop
This should put the Excel.Application to wait until the URL has changed.
这应该是Excel。应用程序等待URL更改。
Alternatively, you may have better luck using an XMLHTTPRequest (there are many examples of this here on SO and elsewhere on the internet). This allows you to send a request just like the browser, without actually using a web browser. Then you can simply parse the return text as HTML or XML. I would use the Microsoft XML, v6.0 library reference for this.
或者,使用XMLHTTPRequest您可能会有更好的运气(在internet上的SO和其他地方有很多这样的例子)。这允许您像浏览器一样发送请求,而实际上不使用web浏览器。然后,您可以简单地将返回文本解析为HTML或XML。为此,我将使用Microsoft XML, v6.0库引用。
#2
0
POST requests:
① Entering the Well API number
①进入API数量
I examined the web page making the selections you mention. I inspected the web traffic using fiddler and noticed that the initial request, when you submit the API number is handled by a POST
request.
我检查了你提到的网页的选择。我使用fiddler检查了web流量,并注意到最初的请求,当您提交API编号时,是由POST请求处理的。
② POST request:
②POST请求:
The POST
body has the following parameter:
后文主体有以下参数:
p_apinum
is the key and the associated value is the original Well API number.
p_apinum是关键,相关的值是原始的Well API编号。
Using this info I formulated a POST request direct thus avoiding your first landing page.
使用这个信息,我制定了一个直接的POST请求,从而避免了你的第一个登陆页面。
③ Pressing the hyperlink:
③按超链接:
Next, I noticed that the element you wanted to press:
接下来,我注意到你想要的元素:
Looking at the associated HTML it has an associated relative hyperlink:
查看相关的HTML,它有一个相关的超链接:
I use a helper function to parse the page HTML to get this relative link and construct the absolute path: GetNextURL(page.body.innerHTML)
.
我使用一个助手函数来解析页面HTML以获得这个相对链接并构造绝对路径:GetNextURL(page. body.innerhtml)。
④ Making a new request:
④制造一个新的请求:
I re-use my HTTPRequest function GetPage
to send a second request, with an empty body, and grab all the tables from the HTML document returned via: page.getElementsByTagName("table")
.
我重新使用HTTPRequest函数GetPage发送第二个请求,请求的主体为空,并从通过:page.getElementsByTagName(“table”)返回的HTML文档中获取所有表。
⑤ Writing the tables to the Excel worksheet:
⑤写作Excel工作表的表:
I loop all the tables on the page using helper function AddHeaders
to write out the table headers, and WriteTables
to write the current table to the sheet.
我使用helper函数addheader对页面上的所有表进行循环,并使用WriteTables将当前表写入表。
Example page content:
示例页面内容:
Example code output:
示例代码输出:
VBA:
VBA:
Option Explicit
Public Sub GetWellInfo()
Dim ws As Worksheet, page As HTMLDocument, targetTable As HTMLTable, apiNumbers(), currNumber As Long
Const PARAM1 As String = "p_apinum"
Const BASESTRING As String = "http://sonlite.dnr.state.la.us/sundown/cart_prod/"
apiNumbers = Array(1708300502, 1708300503)
Application.ScreenUpdating = False
Set ws = ThisWorkbook.Worksheets("Sheet1")
With ws
.Cells.ClearContents
For currNumber = LBound(apiNumbers) To UBound(apiNumbers)
Set page = GetPage(BASESTRING & "cart_con_wellapi2", apiNumbers(currNumber), PARAM1)
Set page = GetPage(BASESTRING & GetNextURL(page.body.innerHTML))
Dim allTables As Object
Set allTables = page.getElementsByTagName("table")
For Each targetTable In allTables
AddHeaders targetTable, GetLastRow(ws, 1) + 2, ws
WriteTables targetTable, GetLastRow(ws, 1), ws
Next targetTable
Next currNumber
End With
Application.ScreenUpdating = True
End Sub
Public Function GetPage(ByVal url As String, Optional ByVal apiNumber As Long, Optional ByVal paramN As String = vbNullString) As HTMLDocument
Dim objHTTP As Object, html As New HTMLDocument
Set objHTTP = CreateObject("WinHttp.WinHttpRequest.5.1")
Dim sBody As String
If Not paramN = vbNullString Then sBody = paramN & "=" & apiNumber
With objHTTP
.SetTimeouts 10000, 10000, 10000, 10000
.Open "POST", url, False
.setRequestHeader "User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
.setRequestHeader "Content-type", "application/x-www-form-urlencoded"
On Error Resume Next
.send (sBody)
If Err.Number = 0 Then
If .Status = "200" Then
html.body.innerHTML = .responseText
Set GetPage = html
Else
Debug.Print "HTTP " & .Status & " " & .statusText
Exit Function
End If
Else
Debug.Print "Error " & Err.Number & " " & Err.Source & " " & Err.Description
Exit Function
End If
On Error GoTo 0
End With
End Function
Public Function GetNextURL(ByVal inputString As String)
GetNextURL = Replace$(Replace$(Split(Split(inputString, "href=")(1), ">")(0), Chr$(34), vbNullString), "about:", vbNullString)
End Function
Public Sub AddHeaders(ByVal hTable As Object, ByVal startRow As Long, ByVal ws As Worksheet)
Dim headers As Object, header As Object, columnCounter As Long
Set headers = hTable.getElementsByTagName("th")
For Each header In headers
columnCounter = columnCounter + 1
ws.Cells(startRow, columnCounter) = header.innerText
Next header
End Sub
Public Sub WriteTables(ByVal hTable As HTMLTable, Optional ByVal startRow As Long = 1, Optional ByRef ws As Worksheet)
If ws Is Nothing Then Set ws = ActiveSheet
Dim tRow As Object, tCell As Object, tr As Object, td As Object, r As Long, c As Long
r = startRow
With ActiveSheet
Set tRow = hTable.getElementsByTagName("tr")
For Each tr In tRow
Set tCell = tr.getElementsByTagName("td")
For Each td In tCell
.Cells(r, c).Value = td.innerText
c = c + 1
Next td
r = r + 1: c = 1
Next tr
End With
End Sub
Public Function GetLastRow(ByVal ws As Worksheet, Optional ByVal columnNumber As Long = 1) As Long
With ws
GetLastRow = .Cells(.Rows.Count, columnNumber).End(xlUp).Row
End With
End Function
References:
引用:
VBE > Tools > References > HTML Object Library.
VBE >工具>引用> HTML对象库。
#1
1
This seems to be working. Rather than DoEvents
use the WinAPI Sleep function. I also added a call to the Sleep
function after the form submit.
这似乎行得通。而不是DoEvents使用WinAPI Sleep函数。我还在表单提交之后向Sleep函数添加了一个调用。
MOre often we are seeing sites that are dynamically served by some javascript/etc., in these cases the browser may appear to be READYSTATE_COMPLETE
or not Busy
but the page has not yet rendered the "new" results.
更多时候,我们看到的是由一些javascript等动态服务的站点。,在这些情况下,浏览器可能显示为READYSTATE_COMPLETE或不繁忙,但页面尚未呈现“new”结果。
Option Explicit
Public Declare Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As Long)
Sub sonris_WellData()
Dim IE As Object 'InternetExplorer
Set IE = CreateObject("InternetExplorer.Application")
IE.Visible = True
Dim i As Integer
'Open SONRIS website
Application.StatusBar = "Opening Website"
IE.navigate "http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellapi1"
Do While IE.readyState <> 4
Sleep 1000
Loop
Application.StatusBar = False
IE.document.forms(0).p_apinum.Value = "1708300502" 'plug-in API
IE.document.forms(0).submit
Sleep 1000
' Wait until the next page opens
Application.StatusBar = "Opening Website"
Do While IE.readyState <> 4
Sleep 1000
Loop
Application.StatusBar = False
' Read the page - this is where the issue occurs, MsgBox keeps returning text from the very 1st page
MsgBox IE.document.getElementsByTagName("body")(0).innerText
IE.Quit
End Sub
You can experiment maybe with a slightly longer Sleep
after the .submit
.
你可以尝试在。submit之后稍微多睡一会儿。
Alternatively, I notice that after you submit, the URL changes, so you could also try changing the second waiting loop to:
另外,我注意到您提交后,URL改变了,所以您也可以尝试将第二个等待循环改为:
Do While IE.LocationURL ="http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellapi1"
Sleep 1000
Loop
This should put the Excel.Application to wait until the URL has changed.
这应该是Excel。应用程序等待URL更改。
Alternatively, you may have better luck using an XMLHTTPRequest (there are many examples of this here on SO and elsewhere on the internet). This allows you to send a request just like the browser, without actually using a web browser. Then you can simply parse the return text as HTML or XML. I would use the Microsoft XML, v6.0 library reference for this.
或者,使用XMLHTTPRequest您可能会有更好的运气(在internet上的SO和其他地方有很多这样的例子)。这允许您像浏览器一样发送请求,而实际上不使用web浏览器。然后,您可以简单地将返回文本解析为HTML或XML。为此,我将使用Microsoft XML, v6.0库引用。
#2
0
POST requests:
① Entering the Well API number
①进入API数量
I examined the web page making the selections you mention. I inspected the web traffic using fiddler and noticed that the initial request, when you submit the API number is handled by a POST
request.
我检查了你提到的网页的选择。我使用fiddler检查了web流量,并注意到最初的请求,当您提交API编号时,是由POST请求处理的。
② POST request:
②POST请求:
The POST
body has the following parameter:
后文主体有以下参数:
p_apinum
is the key and the associated value is the original Well API number.
p_apinum是关键,相关的值是原始的Well API编号。
Using this info I formulated a POST request direct thus avoiding your first landing page.
使用这个信息,我制定了一个直接的POST请求,从而避免了你的第一个登陆页面。
③ Pressing the hyperlink:
③按超链接:
Next, I noticed that the element you wanted to press:
接下来,我注意到你想要的元素:
Looking at the associated HTML it has an associated relative hyperlink:
查看相关的HTML,它有一个相关的超链接:
I use a helper function to parse the page HTML to get this relative link and construct the absolute path: GetNextURL(page.body.innerHTML)
.
我使用一个助手函数来解析页面HTML以获得这个相对链接并构造绝对路径:GetNextURL(page. body.innerhtml)。
④ Making a new request:
④制造一个新的请求:
I re-use my HTTPRequest function GetPage
to send a second request, with an empty body, and grab all the tables from the HTML document returned via: page.getElementsByTagName("table")
.
我重新使用HTTPRequest函数GetPage发送第二个请求,请求的主体为空,并从通过:page.getElementsByTagName(“table”)返回的HTML文档中获取所有表。
⑤ Writing the tables to the Excel worksheet:
⑤写作Excel工作表的表:
I loop all the tables on the page using helper function AddHeaders
to write out the table headers, and WriteTables
to write the current table to the sheet.
我使用helper函数addheader对页面上的所有表进行循环,并使用WriteTables将当前表写入表。
Example page content:
示例页面内容:
Example code output:
示例代码输出:
VBA:
VBA:
Option Explicit
Public Sub GetWellInfo()
Dim ws As Worksheet, page As HTMLDocument, targetTable As HTMLTable, apiNumbers(), currNumber As Long
Const PARAM1 As String = "p_apinum"
Const BASESTRING As String = "http://sonlite.dnr.state.la.us/sundown/cart_prod/"
apiNumbers = Array(1708300502, 1708300503)
Application.ScreenUpdating = False
Set ws = ThisWorkbook.Worksheets("Sheet1")
With ws
.Cells.ClearContents
For currNumber = LBound(apiNumbers) To UBound(apiNumbers)
Set page = GetPage(BASESTRING & "cart_con_wellapi2", apiNumbers(currNumber), PARAM1)
Set page = GetPage(BASESTRING & GetNextURL(page.body.innerHTML))
Dim allTables As Object
Set allTables = page.getElementsByTagName("table")
For Each targetTable In allTables
AddHeaders targetTable, GetLastRow(ws, 1) + 2, ws
WriteTables targetTable, GetLastRow(ws, 1), ws
Next targetTable
Next currNumber
End With
Application.ScreenUpdating = True
End Sub
Public Function GetPage(ByVal url As String, Optional ByVal apiNumber As Long, Optional ByVal paramN As String = vbNullString) As HTMLDocument
Dim objHTTP As Object, html As New HTMLDocument
Set objHTTP = CreateObject("WinHttp.WinHttpRequest.5.1")
Dim sBody As String
If Not paramN = vbNullString Then sBody = paramN & "=" & apiNumber
With objHTTP
.SetTimeouts 10000, 10000, 10000, 10000
.Open "POST", url, False
.setRequestHeader "User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
.setRequestHeader "Content-type", "application/x-www-form-urlencoded"
On Error Resume Next
.send (sBody)
If Err.Number = 0 Then
If .Status = "200" Then
html.body.innerHTML = .responseText
Set GetPage = html
Else
Debug.Print "HTTP " & .Status & " " & .statusText
Exit Function
End If
Else
Debug.Print "Error " & Err.Number & " " & Err.Source & " " & Err.Description
Exit Function
End If
On Error GoTo 0
End With
End Function
Public Function GetNextURL(ByVal inputString As String)
GetNextURL = Replace$(Replace$(Split(Split(inputString, "href=")(1), ">")(0), Chr$(34), vbNullString), "about:", vbNullString)
End Function
Public Sub AddHeaders(ByVal hTable As Object, ByVal startRow As Long, ByVal ws As Worksheet)
Dim headers As Object, header As Object, columnCounter As Long
Set headers = hTable.getElementsByTagName("th")
For Each header In headers
columnCounter = columnCounter + 1
ws.Cells(startRow, columnCounter) = header.innerText
Next header
End Sub
Public Sub WriteTables(ByVal hTable As HTMLTable, Optional ByVal startRow As Long = 1, Optional ByRef ws As Worksheet)
If ws Is Nothing Then Set ws = ActiveSheet
Dim tRow As Object, tCell As Object, tr As Object, td As Object, r As Long, c As Long
r = startRow
With ActiveSheet
Set tRow = hTable.getElementsByTagName("tr")
For Each tr In tRow
Set tCell = tr.getElementsByTagName("td")
For Each td In tCell
.Cells(r, c).Value = td.innerText
c = c + 1
Next td
r = r + 1: c = 1
Next tr
End With
End Sub
Public Function GetLastRow(ByVal ws As Worksheet, Optional ByVal columnNumber As Long = 1) As Long
With ws
GetLastRow = .Cells(.Rows.Count, columnNumber).End(xlUp).Row
End With
End Function
References:
引用:
VBE > Tools > References > HTML Object Library.
VBE >工具>引用> HTML对象库。