在Windows计算机上使用python 3.x刮擦JS驱动的网页

This is my first post here, so I hope you'll be kind enough to point out my mistakes if ever I crossed any rules of this website.

这是我在这里的第一篇文章,所以我希望你能够指出我的错误,如果我越过这个网站的任何规则。

First off, I'm quite "self-taught" in both english and python, so I apologize in advance if I make any language mistakes.

首先,我在英语和蟒蛇方面都非常“自学成才”,所以如果我犯了任何语言错误,我会提前道歉。

So, I'm learning Python as I said, and I was trying to write a script able to scrape a webpage to get an element of it so that it continues to the next link, and so on. On my different attempts, I sometimes stumbled on a webpage whose interesting link is generated by a script (most certainly JavaScript), and so, when the webpage is retrieved by requests.get(url) doesn't contain the link I'm interested in (while I see it in my web browser while Inspecting the page or viewing source code.

所以,正如我所说,我正在学习Python,而我正在尝试编写一个脚本,能够抓取一个网页来获取它的元素,以便它继续到下一个链接,依此类推。在我的不同尝试中,我偶尔偶然发现一个网页,其有趣的链接是由脚本生成的(当然是JavaScript),因此,当通过requestsue.get(url)检索网页时,它不包含我感兴趣的链接in(当我在检查页面或查看源代码时在我的Web浏览器中看到它。

I KNOW there is the Selenium solution, but I was wondering if there was ANOTHER way. I found several, but none I actually got to make work. I've tried with dryscrape, which I found out, isn't supported on Windows computers.

我知道有Selenium解决方案,但我想知道是否有另一种方式。我发现了几个,但实际上我没有做任何工作。我尝试使用dryscrape,我发现,Windows计算机不支持。

Any hint on what direction I should direct my research at? Again, I'm hoping for a solution without using selenium, that works on Windows computers.

我应该指导我的研究方向的任何暗示?再一次,我希望找到一种不使用硒的解决方案,适用于Windows计算机。

EDIT: Oh, seeing as the answers suggested that already, I probably should have mentionned that my code uses requests and BeautifulSoup already. Problem is, neither deals with javascript that modifies the source code directly in the client. When I try to scrape the webpage in question with BeautifulSoup, many tags (including the one I'm interested in) don't appear in the whole page. It appears JavaScript injects some code when the page is loaded within the browser. In any case, there is no occurence of the link I'm after in the webpage I point requests.get at, nor in the requests.get(url).text I am looking in with BS4.

编辑:哦,看到答案已经提到,我可能应该提到我的代码已经使用了请求和BeautifulSoup。问题是,既没有处理直接在客户端修改源代码的javascript。当我尝试使用BeautifulSoup刮取相关网页时,许多标签(包括我感兴趣的标签)都没有出现在整个页面中。当浏览器中加载页面时,似乎JavaScript会注入一些代码。在任何情况下,我都没有出现我在网页中发出的链接,我指向的是request.get at,而不是在request.get(url).text中我正在查看BS4。

Thanks folks :)

谢谢大家:)

2 个解决方案

#1

There are already full solutions out there like scrapy.

已有完整的解决方案,如scrapy。

Instead going that route, I'd recommend you give it a shot to libraries like lxml and requests

而不是走那条路,我建议你给像lxml和请求这样的库

#2

I would suggest you try Beautiful Soup

我建议你试试美丽的汤

#1