I need to parse (server side) big amounts of HTML pages.
We all agree that regexp is not the way to go here.
It seems to me that javascript is the native way of parsing a HTML page, but that assumption relies on the server side code having all the DOM ability javascript has inside a browser.
我需要解析(服务器端)大量的HTML页面。我们都同意regexp不是这里的方式。在我看来,javascript是解析HTML页面的本地方式,但该假设依赖于服务器端代码具有javascript在浏览器中具有的所有DOM能力。
Does Node.js have that ability built in?
Is there a better approach to this problem, parsing HTML on the server side?
Node.js内置了这种能力吗?有没有更好的方法解决这个问题,解析服务器端的HTML?
6 个解决方案
#1
70
You can use the npm modules jsdom and htmlparser to create and parse a DOM in Node.JS.
您可以使用npm模块jsdom和htmlparser在Node.JS中创建和解析DOM。
Other options include:
其他选择包括:
- BeautifulSoup for python
- you can convert you html to xhtml and use XSLT
- HTMLAgilityPack for .NET
- CsQuery for .NET (my new favorite)
- The spidermonkey and rhino JS engines have native E4X support. This may be useful, only if you convert your html to xhtml.
BeautifulSpoup for python
你可以将html转换为xhtml并使用XSLT
HTMLAgilityPack for .NET
CsQuery for .NET(我最喜欢的)
spidermonkey和rhino JS引擎具有本机E4X支持。仅当您将html转换为xhtml时,这可能很有用。
Out of all these options, I prefer using the Node.js option, because it uses the standard W3C DOM accessor methods and I can reuse code on both the client and server. I wish BeautifulSoup's methods were more similar to the W3C dom, and I think converting your HTML to XHTML to write XSLT is just plain sadistic.
在所有这些选项中,我更喜欢使用Node.js选项,因为它使用标准的W3C DOM访问器方法,我可以在客户端和服务器上重用代码。我希望BeautifulSoup的方法更类似于W3C dom,我认为将HTML转换为XHTML来编写XSLT只是简单的虐待狂。
#2
57
Use Cheerio. It isn't as strict as jsdom and is optimized for scraping. As a bonus, uses the jQuery selectors you already know.
使用Cheerio。它不像jsdom那样严格,并且针对刮擦进行了优化。作为奖励,使用您已经知道的jQuery选择器。
❤ Familiar syntax: Cheerio implements a subset of core jQuery. Cheerio removes all the DOM inconsistencies and browser cruft from the jQuery library, revealing its truly gorgeous API.
❤熟悉的语法:Cheerio实现了核心jQuery的一个子集。 Cheerio从jQuery库中删除了所有DOM不一致和浏览器残骸,揭示了它真正华丽的API。
ϟ Blazingly fast: Cheerio works with a very simple, consistent DOM model. As a result parsing, manipulating, and rendering are incredibly efficient. Preliminary end-to-end benchmarks suggest that cheerio is about 8x faster than JSDOM.
ϟ极快:Cheerio使用非常简单,一致的DOM模型。因此,解析,操作和渲染非常有效。初步的端到端基准测试表明,cheerio比JSDOM快约8倍。
❁ Insanely flexible: Cheerio wraps around @FB55's forgiving htmlparser. Cheerio can parse nearly any HTML or XML document.
❁疯狂灵活:Cheerio环绕@ FB55宽容的htmlparser。 Cheerio几乎可以解析任何HTML或XML文档。
#3
7
Use htmlparser2, its way faster and pretty straightforward. Consult this usage example:
使用htmlparser2,它的方式更快,更简单。请参考此用法示例:
https://www.npmjs.org/package/htmlparser2#usage
And the live demo here:
这里的现场演示:
#5
1
jsdom is too strict to do any real screen scraping sort of things, but beautifulsoup doesn't choke on bad markup.
jsdom太严格了,不能做任何真正的屏幕抓取的东西,但beautifulsoup不会扼杀坏标记。
node-soupselect is a port of python's beautifulsoup into nodejs, and it works beautifully
node-soupselect是python的beautifulsoup到nodejs的一个端口,它工作得很漂亮
#6
0
In .NET, there's the HTML Agility Pack, which is an extremely solid HTML parsing library.
在.NET中,有一个HTML Agility Pack,它是一个非常可靠的HTML解析库。
#1
70
You can use the npm modules jsdom and htmlparser to create and parse a DOM in Node.JS.
您可以使用npm模块jsdom和htmlparser在Node.JS中创建和解析DOM。
Other options include:
其他选择包括:
- BeautifulSoup for python
- you can convert you html to xhtml and use XSLT
- HTMLAgilityPack for .NET
- CsQuery for .NET (my new favorite)
- The spidermonkey and rhino JS engines have native E4X support. This may be useful, only if you convert your html to xhtml.
BeautifulSpoup for python
你可以将html转换为xhtml并使用XSLT
HTMLAgilityPack for .NET
CsQuery for .NET(我最喜欢的)
spidermonkey和rhino JS引擎具有本机E4X支持。仅当您将html转换为xhtml时,这可能很有用。
Out of all these options, I prefer using the Node.js option, because it uses the standard W3C DOM accessor methods and I can reuse code on both the client and server. I wish BeautifulSoup's methods were more similar to the W3C dom, and I think converting your HTML to XHTML to write XSLT is just plain sadistic.
在所有这些选项中,我更喜欢使用Node.js选项,因为它使用标准的W3C DOM访问器方法,我可以在客户端和服务器上重用代码。我希望BeautifulSoup的方法更类似于W3C dom,我认为将HTML转换为XHTML来编写XSLT只是简单的虐待狂。
#2
57
Use Cheerio. It isn't as strict as jsdom and is optimized for scraping. As a bonus, uses the jQuery selectors you already know.
使用Cheerio。它不像jsdom那样严格,并且针对刮擦进行了优化。作为奖励,使用您已经知道的jQuery选择器。
❤ Familiar syntax: Cheerio implements a subset of core jQuery. Cheerio removes all the DOM inconsistencies and browser cruft from the jQuery library, revealing its truly gorgeous API.
❤熟悉的语法:Cheerio实现了核心jQuery的一个子集。 Cheerio从jQuery库中删除了所有DOM不一致和浏览器残骸,揭示了它真正华丽的API。
ϟ Blazingly fast: Cheerio works with a very simple, consistent DOM model. As a result parsing, manipulating, and rendering are incredibly efficient. Preliminary end-to-end benchmarks suggest that cheerio is about 8x faster than JSDOM.
ϟ极快:Cheerio使用非常简单,一致的DOM模型。因此,解析,操作和渲染非常有效。初步的端到端基准测试表明,cheerio比JSDOM快约8倍。
❁ Insanely flexible: Cheerio wraps around @FB55's forgiving htmlparser. Cheerio can parse nearly any HTML or XML document.
❁疯狂灵活:Cheerio环绕@ FB55宽容的htmlparser。 Cheerio几乎可以解析任何HTML或XML文档。
#3
7
Use htmlparser2, its way faster and pretty straightforward. Consult this usage example:
使用htmlparser2,它的方式更快,更简单。请参考此用法示例:
https://www.npmjs.org/package/htmlparser2#usage
And the live demo here:
这里的现场演示:
#4
#5
1
jsdom is too strict to do any real screen scraping sort of things, but beautifulsoup doesn't choke on bad markup.
jsdom太严格了,不能做任何真正的屏幕抓取的东西,但beautifulsoup不会扼杀坏标记。
node-soupselect is a port of python's beautifulsoup into nodejs, and it works beautifully
node-soupselect是python的beautifulsoup到nodejs的一个端口,它工作得很漂亮
#6
0
In .NET, there's the HTML Agility Pack, which is an extremely solid HTML parsing library.
在.NET中,有一个HTML Agility Pack,它是一个非常可靠的HTML解析库。