Selenium PhantomJS webdriver无法获取ajax内容

时间:2022-02-20 02:50:10

I am trying to scrape a page that loads most of its content via ajax.

我试图刮一个页面,通过ajax加载其大部分内容。

I am trying to grab all li nodes with a data-section attribute from this webpage, for example. The response html has six required nodes that I need, but the majority of the rest are loaded via an ajax request which returns html containing the remaining li nodes.

我试图从这个网页获取具有数据部分属性的所有li节点。响应html有六个我需要的节点,但其余大部分是通过ajax请求加载的,该请求返回包含剩余li节点的html。

So I switched from using requests to using selenium with PhantomJS driver a its supposed to be xhr friendly but I am not getting the extra ajax loaded content.

所以我从使用请求切换到使用selenium和PhantomJS驱动程序,它应该是xhr友好的,但我没有得到额外的ajax加载内容。

Runnable:

from selenium import webdriver
from lxml import html

br = webdriver.PhantomJS()
br.get(url)
tree = html.fromstring(br.page_source)
print tree.xpath('//li[@data-section]/a/text()')

In brief, above code cannot get html injected into the webpage via xhr. How can I make it do so? If not, what are my other headless options.

简而言之,上面的代码无法通过xhr将html注入到网页中。我怎么能这样做?如果没有,我的其他无头选择是什么。

1 个解决方案

#1


8  

The linked page prominently displays a loading spinner (.archive_loading_bar) which vanishes as soon as the data is loaded. You can use an explicit wait with the expected condition of invisibility_of_element_located.

链接页面突出显示加载微调器(.archive_loading_bar),一旦加载数据就会消失。您可以使用具有invisibility_of_element_located的预期条件的显式等待。

from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
from lxml import html

driver = webdriver.PhantomJS()
driver.get(url)
wait = WebDriverWait(driver, 10)
wait.until(EC.invisibility_of_element_located((By.CSS_SELECTOR, '.archive_loading_bar')))
tree = html.fromstring(driver.page_source)

This is adapted from this answer and waits up to 10 seconds or until the data is loaded.

这是从这个答案改编而来,等待最多10秒或直到数据加载。

#1


8  

The linked page prominently displays a loading spinner (.archive_loading_bar) which vanishes as soon as the data is loaded. You can use an explicit wait with the expected condition of invisibility_of_element_located.

链接页面突出显示加载微调器(.archive_loading_bar),一旦加载数据就会消失。您可以使用具有invisibility_of_element_located的预期条件的显式等待。

from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
from lxml import html

driver = webdriver.PhantomJS()
driver.get(url)
wait = WebDriverWait(driver, 10)
wait.until(EC.invisibility_of_element_located((By.CSS_SELECTOR, '.archive_loading_bar')))
tree = html.fromstring(driver.page_source)

This is adapted from this answer and waits up to 10 seconds or until the data is loaded.

这是从这个答案改编而来,等待最多10秒或直到数据加载。