I have been trying this simple task for hours. No available libraries seem to help and no questions here seem to tackle this scenario.
我一直在尝试这个简单的任务几个小时。没有可用的库似乎有帮助,这里似乎没有任何问题可以解决这个问题。
It's fairly simple:
这很简单:
- I have an entire page's markup as a string.
- I need to use CSS selectors to point to the elements I need to scrape the data from.
- I DO NOT want to create actual HTML DOM elements. Only scrape the data from them. The page might contain image, audio, video and other elements that I don't want to create.
- It needs to be able to deal with markup errors and HTML5-style tagging. Currently, trying to parse it as XML throws an "Invalid XML" exception.
- It needs to happen in the browser. So, no NodeJS modules.
我将整个页面的标记作为字符串。
我需要使用CSS选择器指向我需要从中抓取数据的元素。
我不想创建实际的HTML DOM元素。只抓取他们的数据。该页面可能包含我不想创建的图像,音频,视频和其他元素。
它需要能够处理标记错误和HTML5样式标记。目前,尝试将其解析为XML会引发“无效的XML”异常。
它需要在浏览器中发生。所以,没有NodeJS模块。
In JAVA I've been able to do exactly this using JSoup. But there doesn't seem to be an equivalent library for JS running on a browser.
在JAVA中,我已经能够使用JSoup做到这一点。但似乎没有一个等效的库在浏览器上运行JS。
Thanks for your time.
谢谢你的时间。
2 个解决方案
#1
0
@JaromandaX's suggestion was correct. A way to do this is to use a DOMParser
object. It allows you to create the elements and then use .querySelector
or .querySelectorAll
on them while also not loading any external resources or running any scripts.
@ JaromandaX的建议是正确的。一种方法是使用DOMParser对象。它允许您创建元素,然后在它们上使用.querySelector或.querySelectorAll,同时也不加载任何外部资源或运行任何脚本。
This is what worked for me:
这对我有用:
var parser = new DOMParser();
var doc = parser.parseFromString(markup, "text/html");
#2
0
You can use PHP Goutte or Python's BeautifulSoup4 library where you can use CSS Selectors
or XPaths
as well, whatever you are comfortable with.
您可以使用PHP Goutte或Python的BeautifulSoup4库,您也可以使用CSS选择器或XPath,无论您喜欢什么。
Here are some simple examples to get started.
以下是一些简单的示例。
PHP Goutte:
require_once 'vendor/autoload.php';
use Goutte\Client;
$client = new Client();
$resp = $client->request('GET', $url);
foreach ($resp->filter(' your css selector here') as $li) {
// your logic here
}
Python BeautifulSoup example:
Python BeautifulSoup示例:
import requests
from bs4 import BeautifulSoup
timeout_time = 30;
def tryAgain(passed_url):
try:
page = requests.get(passed_url,headers = random.choice(header), timeout = timeout_time).text
return page
except Exception:
while 1:
print("Trying again the URL:")
print(passed_url)
try:
page = requests.get(passed_url,headers = random.choice(header), timeout = timeout_time).text
print("-------------------------------------")
print("---- URL was successfully scraped ---")
print("-------------------------------------")
return page
except Exception:
time.sleep(20)
continue
header = [{"User-Agent": "Mozilla/5.0 (Windows NT 5.1; rv:14.0) Gecko/20100101 Firefox/14.0.1"},
{"User-Agent":"Opera/9.80 (Windows NT 6.0) Presto/2.12.388 Version/12.14"},
{"User-Agent":"Mozilla/5.0 (Windows; U; Windows NT 6.1; rv:2.2) Gecko/20110201"},
{"User-Agent":"Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5355d Safari/8536.25"}]
main_url = " your URL here "
main_page_html = tryAgain(main_url)
main_page_soup = BeautifulSoup(main_page_html, "html.parser")
for a in main_page_soup.select(' css selector here '):
print a.select(' your css selector here ')[0].text
#1
0
@JaromandaX's suggestion was correct. A way to do this is to use a DOMParser
object. It allows you to create the elements and then use .querySelector
or .querySelectorAll
on them while also not loading any external resources or running any scripts.
@ JaromandaX的建议是正确的。一种方法是使用DOMParser对象。它允许您创建元素,然后在它们上使用.querySelector或.querySelectorAll,同时也不加载任何外部资源或运行任何脚本。
This is what worked for me:
这对我有用:
var parser = new DOMParser();
var doc = parser.parseFromString(markup, "text/html");
#2
0
You can use PHP Goutte or Python's BeautifulSoup4 library where you can use CSS Selectors
or XPaths
as well, whatever you are comfortable with.
您可以使用PHP Goutte或Python的BeautifulSoup4库,您也可以使用CSS选择器或XPath,无论您喜欢什么。
Here are some simple examples to get started.
以下是一些简单的示例。
PHP Goutte:
require_once 'vendor/autoload.php';
use Goutte\Client;
$client = new Client();
$resp = $client->request('GET', $url);
foreach ($resp->filter(' your css selector here') as $li) {
// your logic here
}
Python BeautifulSoup example:
Python BeautifulSoup示例:
import requests
from bs4 import BeautifulSoup
timeout_time = 30;
def tryAgain(passed_url):
try:
page = requests.get(passed_url,headers = random.choice(header), timeout = timeout_time).text
return page
except Exception:
while 1:
print("Trying again the URL:")
print(passed_url)
try:
page = requests.get(passed_url,headers = random.choice(header), timeout = timeout_time).text
print("-------------------------------------")
print("---- URL was successfully scraped ---")
print("-------------------------------------")
return page
except Exception:
time.sleep(20)
continue
header = [{"User-Agent": "Mozilla/5.0 (Windows NT 5.1; rv:14.0) Gecko/20100101 Firefox/14.0.1"},
{"User-Agent":"Opera/9.80 (Windows NT 6.0) Presto/2.12.388 Version/12.14"},
{"User-Agent":"Mozilla/5.0 (Windows; U; Windows NT 6.1; rv:2.2) Gecko/20110201"},
{"User-Agent":"Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5355d Safari/8536.25"}]
main_url = " your URL here "
main_page_html = tryAgain(main_url)
main_page_soup = BeautifulSoup(main_page_html, "html.parser")
for a in main_page_soup.select(' css selector here '):
print a.select(' your css selector here ')[0].text