Python:使用Beatifulsoup从html获取文本

时间:2021-07-11 23:25:23

I am trying to extract the ranking text number from this link link example: kaggle user ranking no1. More clear in an image:

我正在尝试从这个链接链接示例中提取排名文本的编号:kaggle用户排名no1。更清晰的形象:

Python:使用Beatifulsoup从html获取文本

I am using the following code:

我使用的代码如下:

def get_single_item_data(item_url):
    sourceCode = requests.get(item_url)
    plainText = sourceCode.text
    soup = BeautifulSoup(plainText)
    for item_name in soup.findAll('h4',{'data-bind':"text: rankingText"}):
        print(item_name.string)

item_url = 'https://www.kaggle.com/titericz'   
get_single_item_data(item_url)

The result is None. The problem is that soup.findAll('h4',{'data-bind':"text: rankingText"}) outputs:

结果是没有。问题是那汤。findAll(h4,{“数据绑定”:“文本:rankingText " })输出:

[<h4 data-bind="text: rankingText"></h4>]

(< h4数据绑定= "文本:rankingText " > < / h4 >)

but in the html of the link when inspecting this is like:

但在html中的链接检查时,如:

<h4 data-bind="text: rankingText">1st</h4>. It can be seen in the image:

< h4数据绑定= " text:rankingText " > < / h4 > 1。如图所示:

Python:使用Beatifulsoup从html获取文本

Its clear that the text is missing. How can I overpass that?

很明显,文中没有提到。我怎么能超越它呢?

Edit: Printing the soup variable in the terminal I can see that this value exists: Python:使用Beatifulsoup从html获取文本

编辑:在终端中打印soup变量,我可以看到这个值存在:

So there should be a way to access through soup.

所以应该有办法通过汤。

Edit 2: I tried unsuccessfully to use the most voted answer from this * question. Could be a solution around there.

编辑2:我尝试使用这个*问题中投票最多的答案,但没有成功。可能是一个解决方案。

4 个解决方案

#1


4  

If you aren't going to try browser automation through selenium as @Ali suggested, you would have to parse the javascript containing the desired information. You can do this in different ways. Here is a working code that locates the script by a regular expression pattern, then extracts the profile object, loads it with json into a Python dictionary and prints out the desired ranking:

如果不像@Ali建议的那样尝试通过selenium实现浏览器自动化,那么必须解析包含所需信息的javascript。你可以用不同的方法来做。这里是一个工作代码,它通过正则表达式模式查找脚本,然后提取概要文件对象,将其加载到Python字典中,并打印出所需的排序:

import re
import json

from bs4 import BeautifulSoup
import requests


response = requests.get("https://www.kaggle.com/titericz")
soup = BeautifulSoup(response.content, "html.parser")

pattern = re.compile(r"profile: ({.*}),", re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)

profile_text = pattern.search(script.text).group(1)
profile = json.loads(profile_text)

print profile["ranking"], profile["rankingText"]

Prints:

打印:

1 1st

#2


3  

The data is databound using javascript, as the "data-bind" attribute suggests.

如“数据绑定”属性所示,使用javascript对数据进行数据库处理。

However, if you download the page with e.g. wget, you'll see that the rankingText value is actually there inside this script element on initial load:

但是,如果您下载带有wget的页面,您将看到在初始加载时,在这个脚本元素中实际上存在排序文本值:

<script type="text/javascript"
profile: {
...
   "ranking": 96,
   "rankingText": "96th",
   "highestRanking": 3,
   "highestRankingText": "3rd",
...

So you could use that instead.

所以你可以用这个代替。

#3


0  

I have solved your problem using regex on the plain text:

我已经在纯文本中使用regex解决了您的问题:

def get_single_item_data(item_url):
    sourceCode = requests.get(item_url)
    plainText = sourceCode.text
    #soup = BeautifulSoup(plainText, "html.parser")
    pattern = re.compile("ranking\": [0-9]+")
    name = pattern.search(plainText)
    ranking = name.group().split()[1]
    print(ranking)

item_url = 'https://www.kaggle.com/titericz'
get_single_item_data(item_url)

This return only the rank number, but I think it will help you, since from what I see the rankText just add 'st', 'th' and etc to the right of the number

这个只返回rank值,但我认为它会对你有帮助,因为我看到rankText只在数字的右边加上“st”,“th”等等。

#4


-1  

This could because of dynamic data filling.

这可能是因为动态数据填充。

Some javascript code, fill this tag after page loading. Thus if you fetch the html using requests it is not filled yet.

一些javascript代码,在页面加载之后填充这个标记。因此,如果您使用请求获取html,它还没有被填充。

<h4 data-bind="text: rankingText"></h4>

Please take a look at Selenium web driver. Using this driver you can fetch the complete page and running js as normal.

请查看Selenium web驱动程序。使用这个驱动程序,您可以获取完整的页面并运行js。

#1


4  

If you aren't going to try browser automation through selenium as @Ali suggested, you would have to parse the javascript containing the desired information. You can do this in different ways. Here is a working code that locates the script by a regular expression pattern, then extracts the profile object, loads it with json into a Python dictionary and prints out the desired ranking:

如果不像@Ali建议的那样尝试通过selenium实现浏览器自动化,那么必须解析包含所需信息的javascript。你可以用不同的方法来做。这里是一个工作代码,它通过正则表达式模式查找脚本,然后提取概要文件对象,将其加载到Python字典中,并打印出所需的排序:

import re
import json

from bs4 import BeautifulSoup
import requests


response = requests.get("https://www.kaggle.com/titericz")
soup = BeautifulSoup(response.content, "html.parser")

pattern = re.compile(r"profile: ({.*}),", re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)

profile_text = pattern.search(script.text).group(1)
profile = json.loads(profile_text)

print profile["ranking"], profile["rankingText"]

Prints:

打印:

1 1st

#2


3  

The data is databound using javascript, as the "data-bind" attribute suggests.

如“数据绑定”属性所示,使用javascript对数据进行数据库处理。

However, if you download the page with e.g. wget, you'll see that the rankingText value is actually there inside this script element on initial load:

但是,如果您下载带有wget的页面,您将看到在初始加载时,在这个脚本元素中实际上存在排序文本值:

<script type="text/javascript"
profile: {
...
   "ranking": 96,
   "rankingText": "96th",
   "highestRanking": 3,
   "highestRankingText": "3rd",
...

So you could use that instead.

所以你可以用这个代替。

#3


0  

I have solved your problem using regex on the plain text:

我已经在纯文本中使用regex解决了您的问题:

def get_single_item_data(item_url):
    sourceCode = requests.get(item_url)
    plainText = sourceCode.text
    #soup = BeautifulSoup(plainText, "html.parser")
    pattern = re.compile("ranking\": [0-9]+")
    name = pattern.search(plainText)
    ranking = name.group().split()[1]
    print(ranking)

item_url = 'https://www.kaggle.com/titericz'
get_single_item_data(item_url)

This return only the rank number, but I think it will help you, since from what I see the rankText just add 'st', 'th' and etc to the right of the number

这个只返回rank值,但我认为它会对你有帮助,因为我看到rankText只在数字的右边加上“st”,“th”等等。

#4


-1  

This could because of dynamic data filling.

这可能是因为动态数据填充。

Some javascript code, fill this tag after page loading. Thus if you fetch the html using requests it is not filled yet.

一些javascript代码,在页面加载之后填充这个标记。因此,如果您使用请求获取html,它还没有被填充。

<h4 data-bind="text: rankingText"></h4>

Please take a look at Selenium web driver. Using this driver you can fetch the complete page and running js as normal.

请查看Selenium web驱动程序。使用这个驱动程序,您可以获取完整的页面并运行js。