I'm trying to develop a simple web scraper. I want to extract text without the HTML code. In fact, I achieve this goal, but I have seen that in some pages where JavaScript is loaded I didn't obtain good results.
我正在开发一个简单的网页刮刀。我想提取没有HTML代码的文本。实际上,我实现了这个目标,但是我看到在一些JavaScript加载的页面中,我没有得到好的结果。
For example, if some JavaScript code adds some text, I can't see it, because when I call
例如,如果一些JavaScript代码添加了一些文本,我就看不到了,因为当我调用时。
response = urllib2.urlopen(request)
I get the original text without the added one (because JavaScript is executed in the client).
我得到的原始文本没有添加的文本(因为JavaScript是在客户机中执行的)。
So, I'm looking for some ideas to solve this problem.
所以,我在寻找解决这个问题的方法。
11 个解决方案
#1
155
EDIT 30/Dec/2017: This answer appears in top results of Google searches, so I decided to update it. The old answer is still at the end.
编辑30/Dec/2017:这个答案出现在谷歌搜索的顶部,所以我决定更新它。旧的答案仍在最后。
dryscape isn't maintained anymore and the library dryscape developers recommend is Python 2 only. I have found using Selenium's python library with Phantom JS as a web driver fast enough and easy to get the work done.
dryscape不再被维护,库dryscape开发人员建议只使用Python 2。我发现使用Selenium的python库和Phantom JS作为web驱动程序非常快速、容易地完成工作。
Once you have installed Phantom JS, make sure the phantomjs
binary is available in the current path:
一旦安装了Phantom JS,请确保当前路径中有phantomjs二进制文件:
phantomjs --version
# result:
2.1.1
Example
To give an example, I created a sample page with following HTML code. (link):
为了给出一个示例,我创建了一个带有以下HTML代码的示例页面。(链接):
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>Javascript scraping test</title>
</head>
<body>
<p id='intro-text'>No javascript support</p>
<script>
document.getElementById('intro-text').innerHTML = 'Yay! Supports javascript';
</script>
</body>
</html>
without javascript it says: No javascript support
and with javascript: Yay! Supports javascript
如果没有javascript,它会说:不支持javascript,使用javascript: Yay!支持javascript
Scraping without JS support:
import requests
from bs4 import BeautifulSoup
response = requests.get(my_url)
soup = BeautifulSoup(response.text)
soup.find(id="intro-text")
# Result:
<p id="intro-text">No javascript support</p>
Scraping with JS support:
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get(my_url)
p_element = driver.find_element_by_id(id_='intro-text')
print(p_element.text)
# result:
'Yay! Supports javascript'
You can also use Python library dryscrape to scrape javascript driven websites.
您还可以使用Python库dryscratch来抓取javascript驱动的网站。
Scraping with JS support:
import dryscrape
from bs4 import BeautifulSoup
session = dryscrape.Session()
session.visit(my_url)
response = session.body()
soup = BeautifulSoup(response)
soup.find(id="intro-text")
# Result:
<p id="intro-text">Yay! Supports javascript</p>
#2
34
Maybe selenium can do it.
也许硒能做到。
from selenium import webdriver
import time
driver = webdriver.Firefox()
driver.get(url)
time.sleep(5)
htmlSource = driver.page_source
#3
13
This seems to be a good solution also, taken from a great blog post
这似乎是一个很好的解决方案,从一个伟大的博客帖子。
import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
from lxml import html
#Take this class for granted.Just use result of rendering.
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
url = 'http://pycoders.com/archive/'
r = Render(url)
result = r.frame.toHtml()
# This step is important.Converting QString to Ascii for lxml to process
# The following returns an lxml element tree
archive_links = html.fromstring(str(result.toAscii()))
print archive_links
# The following returns an array containing the URLs
raw_links = archive_links.xpath('//div[@class="campaign"]/a/@href')
print raw_links
#4
12
It sounds like the data you're really looking for can be accessed via secondary URL called by some javascript on the primary page.
听起来您真正要查找的数据可以通过主页面上的一些javascript调用的辅助URL访问。
While you could try running javascript on the server to handle this, a simpler approach to might be to load up the page using Firefox and use a tool like Charles or Firebug to identify exactly what that secondary URL is. Then you can just query that URL directly for the data you are interested in.
虽然可以尝试在服务器上运行javascript来处理这个问题,但是更简单的方法可能是使用Firefox加载页面,并使用Charles或Firebug这样的工具来确定辅助URL的确切位置。然后,您可以直接查询该URL以获取感兴趣的数据。
#5
12
We are not getting the correct results because any javascript generated content needs to be rendered on the DOM. When we fetch an HTML page, we fetch the initial, unmodified by javascript, DOM.
我们没有得到正确的结果,因为任何javascript生成的内容都需要在DOM上呈现。当我们获取HTML页面时,我们获取初始的,不被javascript修改的DOM。
Therefore we need to render the javascript content before we crawl the page.
因此,在抓取页面之前,我们需要呈现javascript内容。
As selenium is already mentioned many times in this thread (and how slow it gets sometimes was mentioned also), I will list two other possible solutions.
由于在这个线程中已经多次提到selenium(以及它有时会变得多慢),我将列出另外两个可能的解决方案。
Solution 1: This is a very nice tutorial on how to use Scrapy to crawl javascript generated content and we are going to follow just that.
解决方案1:这是一个非常好的教程,介绍如何使用剪贴板来抓取javascript生成的内容,我们将继续学习。
What we will need:
我们需要:
-
Docker installed in our machine. This is a plus over other solutions until this point, as it utilizes an OS-independent platform.
Docker安装在我们的机器上。在此之前,这比其他解决方案要多,因为它使用的是与操作系统无关的平台。
-
Install Splash following the instruction listed for our corresponding OS.
Quoting from splash documentation:按照我们相应操作系统列出的说明安装Splash。引用飞溅文档:
Splash is a javascript rendering service. It’s a lightweight web browser with an HTTP API, implemented in Python 3 using Twisted and QT5.
Splash特效是一种javascript渲染服务。它是一个带有HTTP API的轻量级web浏览器,使用Twisted和QT5在Python 3中实现。
Essentially we are going to use Splash to render Javascript generated content.
基本上,我们将使用Splash来呈现Javascript生成的内容。
-
Run the splash server:
sudo docker run -p 8050:8050 scrapinghub/splash
.运行飞溅服务器:sudo docker Run -p 8050:8050 scrapinghub/splash。
-
Install the scrapy-splash plugin:
pip install scrapy-splash
安装刮板插件:pip安装刮擦。
-
Assuming that we already have a Scrapy project created (if not, let's make one), we will follow the guide and update the
settings.py
:假设我们已经创建了一个剪贴画项目(如果没有的话,让我们做一个),我们将按照指南进行更新。
Then go to your scrapy project’s
settings.py
and set these middlewares:然后进入你的剪贴画的设置。py和设置这些中间件:
DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, }
The url of the Splash server(if you’re using Win or OSX this should be the URL of the docker machine: How to get a Docker container's IP address from the host?):
Splash服务器的url(如果您使用Win或OSX,这应该是docker机器的url:如何从主机获得docker容器的IP地址?):
SPLASH_URL = 'http://localhost:8050'
And finally you need to set these values too:
最后,你还需要设置这些值:
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
-
Finally, we can use a
SplashRequest
:最后,我们可以使用SplashRequest:
In a normal spider you have Request objects which you can use to open URLs. If the page you want to open contains JS generated data you have to use SplashRequest(or SplashFormRequest) to render the page. Here’s a simple example:
在普通的爬行器中,您有可以用来打开url的请求对象。如果要打开的页面包含JS生成的数据,则必须使用SplashRequest(或SplashFormRequest)来呈现页面。这是一个简单的例子:
class MySpider(scrapy.Spider): name = "jsscraper" start_urls = ["http://quotes.toscrape.com/js/"] def start_requests(self): for url in self.start_urls: yield SplashRequest( url=url, callback=self.parse, endpoint='render.html' ) def parse(self, response): for q in response.css("div.quote"): quote = QuoteItem() quote["author"] = q.css(".author::text").extract_first() quote["quote"] = q.css(".text::text").extract_first() yield quote
SplashRequest renders the URL as html and returns the response which you can use in the callback(parse) method.
SplashRequest将URL呈现为html,并返回回调(parse)方法中可以使用的响应。
Solution 2: Let's call this experimental at the moment (May 2018)...
This solution is for Python's version 3.6 only (at the moment).
解决方案2:让我们暂且把这个实验叫做(2018年5月)……此解决方案仅适用于Python的3.6版本(目前)。
Do you know the requests module (well how doesn't)?
Now it has a web crawling little sibling: requests-HTML:
你知道请求模块吗?现在它有一个网络爬行的小同胞:请求- html:
This library intends to make parsing HTML (e.g. scraping the web) as simple and intuitive as possible.
这个库打算使解析HTML(例如抓取web)尽可能简单和直观。
-
Install requests-html:
pipenv install requests-html
安装请求-html: pipenv安装请求-html
-
Make a request to the page's url:
请求页面的url:
from requests_html import HTMLSession session = HTMLSession() r = session.get(a_page_url)
-
Render the response to get the Javascript generated bits:
呈现响应以获取Javascript生成的位:
r.html.render()
Finally, the module seems to offer scraping capabilities.
Alternatively, we can try the well-documented way of using BeautifulSoup with the r.html
object we just rendered.
最后,该模块似乎提供了抓取功能。或者,我们也可以尝试使用有记录的方法,用r来做漂亮的汤。我们刚刚渲染的html对象。
#6
8
If you have ever used the Requests
module for python before, I recently found out that the developer created a new module called Requests-HTML
which now also has the ability to render JavaScript.
如果您以前使用过python的请求模块,我最近发现开发人员创建了一个名为request - html的新模块,它现在也有呈现JavaScript的能力。
You can also visit https://html.python-requests.org/ to learn more about this module, or if your only interested about rendering JavaScript then you can visit https://html.python-requests.org/?#javascript-support to directly learn how to use the module to render JavaScript using Python.
您还可以访问https://html.python-requests.org/以了解关于这个模块的更多信息,或者如果您只对渲染JavaScript感兴趣,那么您可以访问https://html.python-requests.org/?JavaScript支持直接学习如何使用模块来使用Python呈现JavaScript。
Essentially, Once you correctly install the Requests-HTML
module, the following example, which is shown on the above link, shows how you can use this module to scrape a website and render JavaScript contained within the website:
从本质上说,一旦你正确地安装了request - html模块,下面的例子,如上面的链接所示,展示了你如何使用这个模块来抓取一个网站并呈现包含在网站中的JavaScript:
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('http://python-requests.org/')
r.html.render()
r.html.search('Python 2 will retire in only {months} months!')['months']
'<time>25</time>' #This is the result.
I recently learnt about this from a YouTube video. Click Here! to watch the YouTube video, which demonstrates how the module works.
我最近从YouTube上的一段视频中了解到这一点。点击这里!观看YouTube视频,演示了该模块是如何工作的。
#7
6
Selenium is the best for scraping JS and Ajax content.
Selenium是抓取JS和Ajax内容的最佳选择。
Check this article https://likegeeks.com/python-web-scraping/
检查这篇文章https://likegeeks.com/python-web-scraping/
$ pip install selenium
Then download Chrome webdriver.
然后下载Chrome webdriver。
from selenium import webdriver
browser = webdriver.Chrome()
browser.get("https://www.python.org/")
nav = browser.find_element_by_id("mainnav")
print(nav.text)
Easy, right?
容易,对吧?
#8
5
You can also execute javascript using webdriver.
您还可以使用webdriver来执行javascript。
from selenium import webdriver
driver = webdriver.Firefox()
driver.get(url)
driver.execute_script('document.title')
or store the value in a variable
或将值存储在一个变量中。
result = driver.execute_script('var text = document.title ; return var')
#9
4
You'll want to use urllib, requests, beautifulSoup and selenium web driver in your script for different parts of the page, (to name a few).
Sometimes you'll get what you need with just one of these modules.
Sometimes you'll need two, three, or all of these modules.
Sometimes you'll need to switch off the js on your browser.
Sometimes you'll need header info in your script.
No websites can be scraped the same way and no website can be scraped in the same way forever without having to modify your crawler, usually after a few months. But they can all be scraped! Where there's a will there's a way for sure.
If you need scraped data continuously into the future just scrape everything you need and store it in .dat files with pickle.
Just keep searching how to try what with these modules and copying and pasting your errors into the Google.
您将希望在脚本中为页面的不同部分使用urllib、request、beautifulSoup和selenium web驱动程序(举几个例子)。有时候你只需要其中一个模块就可以得到你需要的东西。有时您需要两个、三个或所有这些模块。有时需要关闭浏览器上的js。有时候你需要在你的脚本中添加标题信息。任何网站都不能以同样的方式被删除,没有网站可以永远以同样的方式被删除,而不需要修改你的爬虫,通常在几个月之后。但它们都可以被刮掉!有志者,事竟成。如果你需要连续不断地收集数据,只需刮掉所有你需要的东西,然后把它存储在。dat文件中,就可以了。继续搜索如何使用这些模块,并将错误复制粘贴到谷歌中。
#10
2
A mix of BeautifulSoup and Selenium works very well for me.
一份漂亮的汤和硒的混合物对我来说很有效。
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup as bs
driver = webdriver.Firefox()
driver.get("http://somedomain/url_that_delays_loading")
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "myDynamicElement"))) #waits 10 seconds until element is located. Can have other wait conditions such as visibility_of_element_located or text_to_be_present_in_element
html = driver.page_source
soup = bs(html, "lxml")
dynamic_text = soup.find_all("p", {"class":"class_name"}) #or other attributes, optional
else:
print("Couldnt locate element")
P.S. You can find more wait conditions here
注:你可以在这里找到更多的等待条件
#11
2
I personally prefer using scrapy and selenium and dockerizing both in separate containers. This way you can install both with minimal hassle and crawl modern websites that almost all contain javascript in one form or another. Here's an example:
我个人更喜欢在不同的容器中使用痒病和硒和dockerizing。通过这种方式,您既可以轻松安装,也可以抓取几乎所有包含javascript的现代网站。这里有一个例子:
Use the scrapy startproject
to create your scraper and write your spider, the skeleton can be as simple as this:
使用痒的startproject创建你的刮刀和写你的蜘蛛,骨架可以是这样简单:
import scrapy
class MySpider(scrapy.Spider):
name = 'my_spider'
start_urls = ['https://somewhere.com']
def start_requests(self):
yield scrapy.Request(url=self.start_urls[0])
def parse(self, response):
# do stuff with results, scrape items etc.
# now were just checking everything worked
print(response.body)
The real magic happens in the middlewares.py. Overwrite two methods in the downloader middleware, __init__
and process_request
, in the following way:
真正的奇迹发生在中量级。以以下方式覆盖downloader中间件中的两种方法:__init__和process_request:
# import some additional modules that we need
import os
from copy import deepcopy
from time import sleep
from scrapy import signals
from scrapy.http import HtmlResponse
from selenium import webdriver
class SampleProjectDownloaderMiddleware(object):
def __init__(self):
SELENIUM_LOCATION = os.environ.get('SELENIUM_LOCATION', 'NOT_HERE')
SELENIUM_URL = f'http://{SELENIUM_LOCATION}:4444/wd/hub'
chrome_options = webdriver.ChromeOptions()
# chrome_options.add_experimental_option("mobileEmulation", mobile_emulation)
self.driver = webdriver.Remote(command_executor=SELENIUM_URL,
desired_capabilities=chrome_options.to_capabilities())
def process_request(self, request, spider):
self.driver.get(request.url)
# sleep a bit so the page has time to load
# or monitor items on page to continue as soon as page ready
sleep(4)
# if you need to manipulate the page content like clicking and scrolling, you do it here
# self.driver.find_element_by_css_selector('.my-class').click()
# you only need the now properly and completely rendered html from your page to get results
body = deepcopy(self.driver.page_source)
# copy the current url in case of redirects
url = deepcopy(self.driver.current_url)
return HtmlResponse(url, body=body, encoding='utf-8', request=request)
Dont forget to enable this middlware by uncommenting the next lines in the settings.py file:
不要忘记通过取消设置中的下一行的注释来启用这个中间件。py文件:
DOWNLOADER_MIDDLEWARES = {
'sample_project.middlewares.SampleProjectDownloaderMiddleware': 543,}
Next for dockerization. Create your Dockerfile
from a lightweight image (I'm using python Alpine here), copy your project directory to it, install requirements:
接下来dockerization。从轻量级映像(我在这里使用python Alpine)创建Dockerfile,将项目目录复制到它,安装需求:
# Use an official Python runtime as a parent image
FROM python:3.6-alpine
# install some packages necessary to scrapy and then curl because it's handy for debugging
RUN apk --update add linux-headers libffi-dev openssl-dev build-base libxslt-dev libxml2-dev curl python-dev
WORKDIR /my_scraper
ADD requirements.txt /my_scraper/
RUN pip install -r requirements.txt
ADD . /scrapers
And finally bring it all together in docker-compose.yaml
:
最后把它放在码头上。
version: '2'
services:
selenium:
image: selenium/standalone-chrome
ports:
- "4444:4444"
shm_size: 1G
my_scraper:
build: .
depends_on:
- "selenium"
environment:
- SELENIUM_LOCATION=samplecrawler_selenium_1
volumes:
- .:/my_scraper
# use this command to keep the container running
command: tail -f /dev/null
Run docker-compose up -d
. If you're doing this the first time it will take a while for it to fetch the latest selenium/standalone-chrome and the build your scraper image as well.
运行docker-compose - d。如果您是第一次这样做,它将花费一段时间来获取最新的selenium/ standone -chrome和构建您的scraper图像。
Once it's done, you can check that your containers are running with docker ps
and also check that the name of the selenium container matches that of the environment variable that we passed to our scraper container (here, it was SELENIUM_LOCATION=samplecrawler_selenium_1
).
完成之后,您可以检查容器是否使用docker ps运行,并检查selenium容器的名称是否与传递给我们的scraper容器的环境变量的名称相匹配(在这里,它是SELENIUM_LOCATION=samplecrawler_selenium_1)。
Enter your scraper container with docker exec -ti YOUR_CONTAINER_NAME sh
, the command for me was docker exec -ti samplecrawler_my_scraper_1 sh
, cd into the right directory and run your scraper with scrapy crawl my_spider
.
使用docker exec -ti YOUR_CONTAINER_NAME sh输入您的刮刀容器,我的命令是docker exec -ti samplecrawler_my_scraper_1 sh,将cd输入到正确的目录中,并使用scrapy爬行my_spider运行刮刀。
The entire thing is on my github page and you can get it from here
整个东西都在我的github页面上你可以从这里得到
#1
155
EDIT 30/Dec/2017: This answer appears in top results of Google searches, so I decided to update it. The old answer is still at the end.
编辑30/Dec/2017:这个答案出现在谷歌搜索的顶部,所以我决定更新它。旧的答案仍在最后。
dryscape isn't maintained anymore and the library dryscape developers recommend is Python 2 only. I have found using Selenium's python library with Phantom JS as a web driver fast enough and easy to get the work done.
dryscape不再被维护,库dryscape开发人员建议只使用Python 2。我发现使用Selenium的python库和Phantom JS作为web驱动程序非常快速、容易地完成工作。
Once you have installed Phantom JS, make sure the phantomjs
binary is available in the current path:
一旦安装了Phantom JS,请确保当前路径中有phantomjs二进制文件:
phantomjs --version
# result:
2.1.1
Example
To give an example, I created a sample page with following HTML code. (link):
为了给出一个示例,我创建了一个带有以下HTML代码的示例页面。(链接):
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>Javascript scraping test</title>
</head>
<body>
<p id='intro-text'>No javascript support</p>
<script>
document.getElementById('intro-text').innerHTML = 'Yay! Supports javascript';
</script>
</body>
</html>
without javascript it says: No javascript support
and with javascript: Yay! Supports javascript
如果没有javascript,它会说:不支持javascript,使用javascript: Yay!支持javascript
Scraping without JS support:
import requests
from bs4 import BeautifulSoup
response = requests.get(my_url)
soup = BeautifulSoup(response.text)
soup.find(id="intro-text")
# Result:
<p id="intro-text">No javascript support</p>
Scraping with JS support:
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get(my_url)
p_element = driver.find_element_by_id(id_='intro-text')
print(p_element.text)
# result:
'Yay! Supports javascript'
You can also use Python library dryscrape to scrape javascript driven websites.
您还可以使用Python库dryscratch来抓取javascript驱动的网站。
Scraping with JS support:
import dryscrape
from bs4 import BeautifulSoup
session = dryscrape.Session()
session.visit(my_url)
response = session.body()
soup = BeautifulSoup(response)
soup.find(id="intro-text")
# Result:
<p id="intro-text">Yay! Supports javascript</p>
#2
34
Maybe selenium can do it.
也许硒能做到。
from selenium import webdriver
import time
driver = webdriver.Firefox()
driver.get(url)
time.sleep(5)
htmlSource = driver.page_source
#3
13
This seems to be a good solution also, taken from a great blog post
这似乎是一个很好的解决方案,从一个伟大的博客帖子。
import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
from lxml import html
#Take this class for granted.Just use result of rendering.
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
url = 'http://pycoders.com/archive/'
r = Render(url)
result = r.frame.toHtml()
# This step is important.Converting QString to Ascii for lxml to process
# The following returns an lxml element tree
archive_links = html.fromstring(str(result.toAscii()))
print archive_links
# The following returns an array containing the URLs
raw_links = archive_links.xpath('//div[@class="campaign"]/a/@href')
print raw_links
#4
12
It sounds like the data you're really looking for can be accessed via secondary URL called by some javascript on the primary page.
听起来您真正要查找的数据可以通过主页面上的一些javascript调用的辅助URL访问。
While you could try running javascript on the server to handle this, a simpler approach to might be to load up the page using Firefox and use a tool like Charles or Firebug to identify exactly what that secondary URL is. Then you can just query that URL directly for the data you are interested in.
虽然可以尝试在服务器上运行javascript来处理这个问题,但是更简单的方法可能是使用Firefox加载页面,并使用Charles或Firebug这样的工具来确定辅助URL的确切位置。然后,您可以直接查询该URL以获取感兴趣的数据。
#5
12
We are not getting the correct results because any javascript generated content needs to be rendered on the DOM. When we fetch an HTML page, we fetch the initial, unmodified by javascript, DOM.
我们没有得到正确的结果,因为任何javascript生成的内容都需要在DOM上呈现。当我们获取HTML页面时,我们获取初始的,不被javascript修改的DOM。
Therefore we need to render the javascript content before we crawl the page.
因此,在抓取页面之前,我们需要呈现javascript内容。
As selenium is already mentioned many times in this thread (and how slow it gets sometimes was mentioned also), I will list two other possible solutions.
由于在这个线程中已经多次提到selenium(以及它有时会变得多慢),我将列出另外两个可能的解决方案。
Solution 1: This is a very nice tutorial on how to use Scrapy to crawl javascript generated content and we are going to follow just that.
解决方案1:这是一个非常好的教程,介绍如何使用剪贴板来抓取javascript生成的内容,我们将继续学习。
What we will need:
我们需要:
-
Docker installed in our machine. This is a plus over other solutions until this point, as it utilizes an OS-independent platform.
Docker安装在我们的机器上。在此之前,这比其他解决方案要多,因为它使用的是与操作系统无关的平台。
-
Install Splash following the instruction listed for our corresponding OS.
Quoting from splash documentation:按照我们相应操作系统列出的说明安装Splash。引用飞溅文档:
Splash is a javascript rendering service. It’s a lightweight web browser with an HTTP API, implemented in Python 3 using Twisted and QT5.
Splash特效是一种javascript渲染服务。它是一个带有HTTP API的轻量级web浏览器,使用Twisted和QT5在Python 3中实现。
Essentially we are going to use Splash to render Javascript generated content.
基本上,我们将使用Splash来呈现Javascript生成的内容。
-
Run the splash server:
sudo docker run -p 8050:8050 scrapinghub/splash
.运行飞溅服务器:sudo docker Run -p 8050:8050 scrapinghub/splash。
-
Install the scrapy-splash plugin:
pip install scrapy-splash
安装刮板插件:pip安装刮擦。
-
Assuming that we already have a Scrapy project created (if not, let's make one), we will follow the guide and update the
settings.py
:假设我们已经创建了一个剪贴画项目(如果没有的话,让我们做一个),我们将按照指南进行更新。
Then go to your scrapy project’s
settings.py
and set these middlewares:然后进入你的剪贴画的设置。py和设置这些中间件:
DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, }
The url of the Splash server(if you’re using Win or OSX this should be the URL of the docker machine: How to get a Docker container's IP address from the host?):
Splash服务器的url(如果您使用Win或OSX,这应该是docker机器的url:如何从主机获得docker容器的IP地址?):
SPLASH_URL = 'http://localhost:8050'
And finally you need to set these values too:
最后,你还需要设置这些值:
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
-
Finally, we can use a
SplashRequest
:最后,我们可以使用SplashRequest:
In a normal spider you have Request objects which you can use to open URLs. If the page you want to open contains JS generated data you have to use SplashRequest(or SplashFormRequest) to render the page. Here’s a simple example:
在普通的爬行器中,您有可以用来打开url的请求对象。如果要打开的页面包含JS生成的数据,则必须使用SplashRequest(或SplashFormRequest)来呈现页面。这是一个简单的例子:
class MySpider(scrapy.Spider): name = "jsscraper" start_urls = ["http://quotes.toscrape.com/js/"] def start_requests(self): for url in self.start_urls: yield SplashRequest( url=url, callback=self.parse, endpoint='render.html' ) def parse(self, response): for q in response.css("div.quote"): quote = QuoteItem() quote["author"] = q.css(".author::text").extract_first() quote["quote"] = q.css(".text::text").extract_first() yield quote
SplashRequest renders the URL as html and returns the response which you can use in the callback(parse) method.
SplashRequest将URL呈现为html,并返回回调(parse)方法中可以使用的响应。
Solution 2: Let's call this experimental at the moment (May 2018)...
This solution is for Python's version 3.6 only (at the moment).
解决方案2:让我们暂且把这个实验叫做(2018年5月)……此解决方案仅适用于Python的3.6版本(目前)。
Do you know the requests module (well how doesn't)?
Now it has a web crawling little sibling: requests-HTML:
你知道请求模块吗?现在它有一个网络爬行的小同胞:请求- html:
This library intends to make parsing HTML (e.g. scraping the web) as simple and intuitive as possible.
这个库打算使解析HTML(例如抓取web)尽可能简单和直观。
-
Install requests-html:
pipenv install requests-html
安装请求-html: pipenv安装请求-html
-
Make a request to the page's url:
请求页面的url:
from requests_html import HTMLSession session = HTMLSession() r = session.get(a_page_url)
-
Render the response to get the Javascript generated bits:
呈现响应以获取Javascript生成的位:
r.html.render()
Finally, the module seems to offer scraping capabilities.
Alternatively, we can try the well-documented way of using BeautifulSoup with the r.html
object we just rendered.
最后,该模块似乎提供了抓取功能。或者,我们也可以尝试使用有记录的方法,用r来做漂亮的汤。我们刚刚渲染的html对象。
#6
8
If you have ever used the Requests
module for python before, I recently found out that the developer created a new module called Requests-HTML
which now also has the ability to render JavaScript.
如果您以前使用过python的请求模块,我最近发现开发人员创建了一个名为request - html的新模块,它现在也有呈现JavaScript的能力。
You can also visit https://html.python-requests.org/ to learn more about this module, or if your only interested about rendering JavaScript then you can visit https://html.python-requests.org/?#javascript-support to directly learn how to use the module to render JavaScript using Python.
您还可以访问https://html.python-requests.org/以了解关于这个模块的更多信息,或者如果您只对渲染JavaScript感兴趣,那么您可以访问https://html.python-requests.org/?JavaScript支持直接学习如何使用模块来使用Python呈现JavaScript。
Essentially, Once you correctly install the Requests-HTML
module, the following example, which is shown on the above link, shows how you can use this module to scrape a website and render JavaScript contained within the website:
从本质上说,一旦你正确地安装了request - html模块,下面的例子,如上面的链接所示,展示了你如何使用这个模块来抓取一个网站并呈现包含在网站中的JavaScript:
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('http://python-requests.org/')
r.html.render()
r.html.search('Python 2 will retire in only {months} months!')['months']
'<time>25</time>' #This is the result.
I recently learnt about this from a YouTube video. Click Here! to watch the YouTube video, which demonstrates how the module works.
我最近从YouTube上的一段视频中了解到这一点。点击这里!观看YouTube视频,演示了该模块是如何工作的。
#7
6
Selenium is the best for scraping JS and Ajax content.
Selenium是抓取JS和Ajax内容的最佳选择。
Check this article https://likegeeks.com/python-web-scraping/
检查这篇文章https://likegeeks.com/python-web-scraping/
$ pip install selenium
Then download Chrome webdriver.
然后下载Chrome webdriver。
from selenium import webdriver
browser = webdriver.Chrome()
browser.get("https://www.python.org/")
nav = browser.find_element_by_id("mainnav")
print(nav.text)
Easy, right?
容易,对吧?
#8
5
You can also execute javascript using webdriver.
您还可以使用webdriver来执行javascript。
from selenium import webdriver
driver = webdriver.Firefox()
driver.get(url)
driver.execute_script('document.title')
or store the value in a variable
或将值存储在一个变量中。
result = driver.execute_script('var text = document.title ; return var')
#9
4
You'll want to use urllib, requests, beautifulSoup and selenium web driver in your script for different parts of the page, (to name a few).
Sometimes you'll get what you need with just one of these modules.
Sometimes you'll need two, three, or all of these modules.
Sometimes you'll need to switch off the js on your browser.
Sometimes you'll need header info in your script.
No websites can be scraped the same way and no website can be scraped in the same way forever without having to modify your crawler, usually after a few months. But they can all be scraped! Where there's a will there's a way for sure.
If you need scraped data continuously into the future just scrape everything you need and store it in .dat files with pickle.
Just keep searching how to try what with these modules and copying and pasting your errors into the Google.
您将希望在脚本中为页面的不同部分使用urllib、request、beautifulSoup和selenium web驱动程序(举几个例子)。有时候你只需要其中一个模块就可以得到你需要的东西。有时您需要两个、三个或所有这些模块。有时需要关闭浏览器上的js。有时候你需要在你的脚本中添加标题信息。任何网站都不能以同样的方式被删除,没有网站可以永远以同样的方式被删除,而不需要修改你的爬虫,通常在几个月之后。但它们都可以被刮掉!有志者,事竟成。如果你需要连续不断地收集数据,只需刮掉所有你需要的东西,然后把它存储在。dat文件中,就可以了。继续搜索如何使用这些模块,并将错误复制粘贴到谷歌中。
#10
2
A mix of BeautifulSoup and Selenium works very well for me.
一份漂亮的汤和硒的混合物对我来说很有效。
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup as bs
driver = webdriver.Firefox()
driver.get("http://somedomain/url_that_delays_loading")
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "myDynamicElement"))) #waits 10 seconds until element is located. Can have other wait conditions such as visibility_of_element_located or text_to_be_present_in_element
html = driver.page_source
soup = bs(html, "lxml")
dynamic_text = soup.find_all("p", {"class":"class_name"}) #or other attributes, optional
else:
print("Couldnt locate element")
P.S. You can find more wait conditions here
注:你可以在这里找到更多的等待条件
#11
2
I personally prefer using scrapy and selenium and dockerizing both in separate containers. This way you can install both with minimal hassle and crawl modern websites that almost all contain javascript in one form or another. Here's an example:
我个人更喜欢在不同的容器中使用痒病和硒和dockerizing。通过这种方式,您既可以轻松安装,也可以抓取几乎所有包含javascript的现代网站。这里有一个例子:
Use the scrapy startproject
to create your scraper and write your spider, the skeleton can be as simple as this:
使用痒的startproject创建你的刮刀和写你的蜘蛛,骨架可以是这样简单:
import scrapy
class MySpider(scrapy.Spider):
name = 'my_spider'
start_urls = ['https://somewhere.com']
def start_requests(self):
yield scrapy.Request(url=self.start_urls[0])
def parse(self, response):
# do stuff with results, scrape items etc.
# now were just checking everything worked
print(response.body)
The real magic happens in the middlewares.py. Overwrite two methods in the downloader middleware, __init__
and process_request
, in the following way:
真正的奇迹发生在中量级。以以下方式覆盖downloader中间件中的两种方法:__init__和process_request:
# import some additional modules that we need
import os
from copy import deepcopy
from time import sleep
from scrapy import signals
from scrapy.http import HtmlResponse
from selenium import webdriver
class SampleProjectDownloaderMiddleware(object):
def __init__(self):
SELENIUM_LOCATION = os.environ.get('SELENIUM_LOCATION', 'NOT_HERE')
SELENIUM_URL = f'http://{SELENIUM_LOCATION}:4444/wd/hub'
chrome_options = webdriver.ChromeOptions()
# chrome_options.add_experimental_option("mobileEmulation", mobile_emulation)
self.driver = webdriver.Remote(command_executor=SELENIUM_URL,
desired_capabilities=chrome_options.to_capabilities())
def process_request(self, request, spider):
self.driver.get(request.url)
# sleep a bit so the page has time to load
# or monitor items on page to continue as soon as page ready
sleep(4)
# if you need to manipulate the page content like clicking and scrolling, you do it here
# self.driver.find_element_by_css_selector('.my-class').click()
# you only need the now properly and completely rendered html from your page to get results
body = deepcopy(self.driver.page_source)
# copy the current url in case of redirects
url = deepcopy(self.driver.current_url)
return HtmlResponse(url, body=body, encoding='utf-8', request=request)
Dont forget to enable this middlware by uncommenting the next lines in the settings.py file:
不要忘记通过取消设置中的下一行的注释来启用这个中间件。py文件:
DOWNLOADER_MIDDLEWARES = {
'sample_project.middlewares.SampleProjectDownloaderMiddleware': 543,}
Next for dockerization. Create your Dockerfile
from a lightweight image (I'm using python Alpine here), copy your project directory to it, install requirements:
接下来dockerization。从轻量级映像(我在这里使用python Alpine)创建Dockerfile,将项目目录复制到它,安装需求:
# Use an official Python runtime as a parent image
FROM python:3.6-alpine
# install some packages necessary to scrapy and then curl because it's handy for debugging
RUN apk --update add linux-headers libffi-dev openssl-dev build-base libxslt-dev libxml2-dev curl python-dev
WORKDIR /my_scraper
ADD requirements.txt /my_scraper/
RUN pip install -r requirements.txt
ADD . /scrapers
And finally bring it all together in docker-compose.yaml
:
最后把它放在码头上。
version: '2'
services:
selenium:
image: selenium/standalone-chrome
ports:
- "4444:4444"
shm_size: 1G
my_scraper:
build: .
depends_on:
- "selenium"
environment:
- SELENIUM_LOCATION=samplecrawler_selenium_1
volumes:
- .:/my_scraper
# use this command to keep the container running
command: tail -f /dev/null
Run docker-compose up -d
. If you're doing this the first time it will take a while for it to fetch the latest selenium/standalone-chrome and the build your scraper image as well.
运行docker-compose - d。如果您是第一次这样做,它将花费一段时间来获取最新的selenium/ standone -chrome和构建您的scraper图像。
Once it's done, you can check that your containers are running with docker ps
and also check that the name of the selenium container matches that of the environment variable that we passed to our scraper container (here, it was SELENIUM_LOCATION=samplecrawler_selenium_1
).
完成之后,您可以检查容器是否使用docker ps运行,并检查selenium容器的名称是否与传递给我们的scraper容器的环境变量的名称相匹配(在这里,它是SELENIUM_LOCATION=samplecrawler_selenium_1)。
Enter your scraper container with docker exec -ti YOUR_CONTAINER_NAME sh
, the command for me was docker exec -ti samplecrawler_my_scraper_1 sh
, cd into the right directory and run your scraper with scrapy crawl my_spider
.
使用docker exec -ti YOUR_CONTAINER_NAME sh输入您的刮刀容器,我的命令是docker exec -ti samplecrawler_my_scraper_1 sh,将cd输入到正确的目录中,并使用scrapy爬行my_spider运行刮刀。
The entire thing is on my github page and you can get it from here
整个东西都在我的github页面上你可以从这里得到