使用python来抓取数据?

时间:2022-11-25 10:11:33

I'm trying to scrape the left side of this news site (= SENESTE NYT): https://www.dr.dk/nyheder/

我正试图刮掉这个新闻网站的左侧(= SENESTE NYT):https://www.dr.dk/nyheder/

But it seems the data isn't anywhere to be found? Neither in the html or related api/json etc. Is it some kind of push data?

但似乎数据无处可寻?在html或相关的api / json等都没有。它是某种推送数据吗?

Using Chrome's Network console I've found this api but it doesn't contain the news items on the left side: https://www.dr.dk/tjenester/newsapp-content/teasers?reqoffset=0&reqlimit=100

使用Chrome的网络控制台我发现这个api但它不包含左侧的新闻项目:https://www.dr.dk/tjenester/newsapp-content/teasers?reqoffset=0&reqlimit=100

Can anyone help me? How do I scrape "SENESTE NYT"?

谁能帮我?我怎么刮“SENESTE NYT”?

1 个解决方案

#1


0  

I first loaded the page with selenium and then processed with BeautifulSoup.

我先用selenium加载页面然后用BeautifulSoup处理。

from selenium import webdriver
from bs4 import BeautifulSoup

url = "https://www.dr.dk/nyheder"
driver = webdriver.Chrome()
driver.get(url)
page_source = driver.page_source

soup = BeautifulSoup(page_source, "lxml")
div = soup.find("div", {"class":"timeline-container"})
headlines = div.find_all("h3")

print(headlines)

And it seems to find the headlines:

它似乎找到了头条新闻:

[<h3>Puigdemont: Debatterede spørgsmål af interesse for hele Europa</h3>,
 <h3>Afblæser tsunami-varsel for Hawaii</h3>,
 <h3>56.000 flygter fra vulkan i udbrud </h3>,
 <h3>Pence: USA offentliggør snart plan for ambassadeflytning </h3>,
 <h3>Østjysk motorvej genåbnet </h3>]

Not sure if this is what you wanted.

不确定这是不是你想要的。

-----EDITED----

More efficient way would be to create request with some custom headers (already confirmed this is not working)

更有效的方法是使用一些自定义标头创建请求(已经确认这不起作用)

import requests    
headers = {
        "Accept":"*/*",
        "Host":"www.dr.dk",
        "Referer":"https://www.dr.dk/nyheder",
        "User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"

    }

r = requests.get(url="https://www.dr.dk/tjenester/newsapp-content/teasers?reqoffset=0&reqlimit=100", headers=headers)

r.json()

#1


0  

I first loaded the page with selenium and then processed with BeautifulSoup.

我先用selenium加载页面然后用BeautifulSoup处理。

from selenium import webdriver
from bs4 import BeautifulSoup

url = "https://www.dr.dk/nyheder"
driver = webdriver.Chrome()
driver.get(url)
page_source = driver.page_source

soup = BeautifulSoup(page_source, "lxml")
div = soup.find("div", {"class":"timeline-container"})
headlines = div.find_all("h3")

print(headlines)

And it seems to find the headlines:

它似乎找到了头条新闻:

[<h3>Puigdemont: Debatterede spørgsmål af interesse for hele Europa</h3>,
 <h3>Afblæser tsunami-varsel for Hawaii</h3>,
 <h3>56.000 flygter fra vulkan i udbrud </h3>,
 <h3>Pence: USA offentliggør snart plan for ambassadeflytning </h3>,
 <h3>Østjysk motorvej genåbnet </h3>]

Not sure if this is what you wanted.

不确定这是不是你想要的。

-----EDITED----

More efficient way would be to create request with some custom headers (already confirmed this is not working)

更有效的方法是使用一些自定义标头创建请求(已经确认这不起作用)

import requests    
headers = {
        "Accept":"*/*",
        "Host":"www.dr.dk",
        "Referer":"https://www.dr.dk/nyheder",
        "User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"

    }

r = requests.get(url="https://www.dr.dk/tjenester/newsapp-content/teasers?reqoffset=0&reqlimit=100", headers=headers)

r.json()