I am trying to scrape a page that loads most of its content via ajax.
我试图刮一个页面,通过ajax加载其大部分内容。
I am trying to grab all li
nodes with a data-section
attribute from this webpage, for example. The response html has six required nodes that I need, but the majority of the rest are loaded via an ajax request which returns html containing the remaining li
nodes.
我试图从这个网页获取具有数据部分属性的所有li节点。响应html有六个我需要的节点,但其余大部分是通过ajax请求加载的,该请求返回包含剩余li节点的html。
So I switched from using requests to using selenium with PhantomJS driver a its supposed to be xhr friendly but I am not getting the extra ajax loaded content.
所以我从使用请求切换到使用selenium和PhantomJS驱动程序,它应该是xhr友好的,但我没有得到额外的ajax加载内容。
Runnable:
from selenium import webdriver
from lxml import html
br = webdriver.PhantomJS()
br.get(url)
tree = html.fromstring(br.page_source)
print tree.xpath('//li[@data-section]/a/text()')
In brief, above code cannot get html injected into the webpage via xhr. How can I make it do so? If not, what are my other headless options.
简而言之,上面的代码无法通过xhr将html注入到网页中。我怎么能这样做?如果没有,我的其他无头选择是什么。
1 个解决方案
#1
8
The linked page prominently displays a loading spinner (.archive_loading_bar
) which vanishes as soon as the data is loaded. You can use an explicit wait with the expected condition of invisibility_of_element_located
.
链接页面突出显示加载微调器(.archive_loading_bar),一旦加载数据就会消失。您可以使用具有invisibility_of_element_located的预期条件的显式等待。
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
from lxml import html
driver = webdriver.PhantomJS()
driver.get(url)
wait = WebDriverWait(driver, 10)
wait.until(EC.invisibility_of_element_located((By.CSS_SELECTOR, '.archive_loading_bar')))
tree = html.fromstring(driver.page_source)
This is adapted from this answer and waits up to 10 seconds or until the data is loaded.
这是从这个答案改编而来,等待最多10秒或直到数据加载。
#1
8
The linked page prominently displays a loading spinner (.archive_loading_bar
) which vanishes as soon as the data is loaded. You can use an explicit wait with the expected condition of invisibility_of_element_located
.
链接页面突出显示加载微调器(.archive_loading_bar),一旦加载数据就会消失。您可以使用具有invisibility_of_element_located的预期条件的显式等待。
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
from lxml import html
driver = webdriver.PhantomJS()
driver.get(url)
wait = WebDriverWait(driver, 10)
wait.until(EC.invisibility_of_element_located((By.CSS_SELECTOR, '.archive_loading_bar')))
tree = html.fromstring(driver.page_source)
This is adapted from this answer and waits up to 10 seconds or until the data is loaded.
这是从这个答案改编而来,等待最多10秒或直到数据加载。