使用python搜索javascript查询生成的元素

I am trying to access the text in an element whose content is generated by javascript. For example getting the number of twitter shares from this site.

我试图访问其内容由javascript生成的元素中的文本。例如,获取此站点的推特份额数量。

I've tried using urllib and pyqt to obtain the html of the page, however since the content requires javascript to be generated, its HTML is not present in the response of urllib/pyqt. I am currently using selenium for this task, however it is taking longer than I would like it to.

我已经尝试使用urllib和pyqt来获取页面的html,但是由于内容需要生成javascript,因此urllib / pyqt的响应中不存在其HTML。我目前正在使用硒来完成这项任务,但是它需要的时间比我想要的要长。

Is it possible to get access to this data without opening the page in a browser?

是否可以在不在浏览器中打开页面的情况下访问此数据?

This question has already been asked in the past, but the results I found are either c# specific or provide a link to a solution that has since gone dead

这个问题在过去已经被问过,但我发现的结果要么是c#特定的,要么提供一个解决方案的链接,该解决方案已经死了

2 个解决方案

#1

Working example :

工作范例:

import urllib
import requests
import json

url = "https://daphnecaruanagalizia.com/2017/10/crook-schembri-court-today-pleading-not-crook/"

encoded = urllib.parse.quote_plus(url)
# encoded = urllib.quote_plus(url) # for python 2 replace previous line by this
j = requests.get('https://count-server.sharethis.com/v2.0/get_counts?url=%s' % encoded).text
obj = json.loads(j)
print(obj['clicks']['twitter'] + obj['shares']['twitter'])

# => 5008

Explanation :

Inspecting the webpage, you can see that it does a request to this :

检查网页,您可以看到它向此发出请求:

https://count-server.sharethis.com/v2.0/get_counts?url=https%3A%2F%2Fdaphnecaruanagalizia.com%2F2017%2F10%2Fcrook-schembri-court-today-pleading-not-crook%2F&cb=stButtons.processCB&wd=true

If you paste it in your browser you'll have all your answers. Then playing a bit with the url, you can see that removing extra parameters will give you a nice json.

如果您将其粘贴到浏览器中,您将获得所有答案。然后用url玩一下,你可以看到删除额外的参数会给你一个不错的json。

So as you can see, you just have to replace the url parameter of the request with the url of the page you want to get the twitter counts.

正如您所看到的,您只需要将请求的url参数替换为您想要获取twitter计数的页面的url。

#2

You could do something similar to the following after launching a Selenium web browser, then passing driver.page_source to the BeautifulSoup library (unfortunately cannot test this at work with firewalls in place):

在启动Selenium Web浏览器之后,您可以执行类似于以下的操作,然后将driver.page_source传递给BeautifulSoup库(遗憾的是,无法在防火墙处于适当位置时对其进行测试):

soup = BeautifulSoup(driver.page_source, 'html.parser')

shares = soup.find('span', {'class': 'st_twitter_hcount'}).find('span', {'class': 'stBubble_hcount'})

#1