使用Java将HTML文件读入DOM树

时间:2023-01-25 21:54:05

Is there a parser/library which is able to read an HTML document into a DOM tree using Java? I'd like to use the standard DOM/Xpath API that Java provides.

是否有一个解析器/库可以使用Java将HTML文档读入DOM树?我想使用Java提供的标准DOM/Xpath API。

Most libraries seem have custom API's to solve this task. Furthermore the conversion HTML to XML-DOM seems unsupported by the most of the available parsers.

大多数库似乎都有自定义API来解决这个任务。此外,大多数可用解析器似乎不支持将HTML转换为XML-DOM。

Any ideas or experience with a good HTML DOM parser?

有好的HTML DOM解析器的想法或经验吗?

5 个解决方案

#1


6  

JTidy, either by processing the stream to XHTML then using your favourite DOM implementation to re-parse, or using parseDOM if the limited DOM imp that gives you is enough.

JTidy要么通过处理到XHTML的流,然后使用您最喜欢的DOM实现重新解析,要么使用parseDOM(如果提供给您的有限的DOM imp足够的话)。

Alternatively Neko.

另外三氯二苯脲。

#2


3  

Since HTML files are generally problematic, you'll need to first clean them up using a parser/scanner. I've used JTidy but never happily. NekoHTML works okay, but any of these tools are always just making a best guess of what is intended. You're effectively asking to let a program alter a document's markup until it conforms to a schema. That will likely cause structural (markup), style or content loss. It's unavoidable, and you won't really know what's missing unless you manually scan via a browser (and then you have to trust the browser too).

由于HTML文件通常是有问题的,您需要首先使用解析器/扫描器清理它们。我曾经用过JTidy,但从来没有开心过。NekoHTML运行良好,但是这些工具中的任何一个都只是对预期目标做出最好的猜测。您实际上是在要求程序修改文档的标记,直到它符合模式。这可能会导致结构(标记)、样式或内容丢失。这是不可避免的,除非你通过浏览器手动扫描(然后你也必须信任浏览器),否则你不会真正知道缺失了什么。

It really depends on your purpose — if you have thousands of ugly documents with tons of extraneous (non-HTML) markup, then a manual process is probably unreasonable. If your goal is accuracy on a few important documents, then manually fixing them is a reasonable proposition.

这实际上取决于您的目的——如果您有数千个丑陋的文档,其中包含大量无关的(非html)标记,那么手工处理可能是不合理的。如果您的目标是对一些重要文档的准确性,那么手动修复它们是一个合理的建议。

One approach is the manual process of repeatedly passing the source through a well-formed and/or validating parser, in an edit cycle using the error messages to eventually fix the broken markup. This does require some understanding of XML, but that's not a bad education to undertake.

一种方法是手动过程,它通过一个格式良好的和/或验证解析器反复传递源代码,在一个编辑周期中使用错误消息来最终修复损坏的标记。这确实需要对XML的一些理解,但这并不是一种糟糕的教育。

With Java 5 the necessary XML features — called the JAXP API — are now built into Java itself; you don't need any external libraries.

使用Java 5,必要的XML特性(称为JAXP API)现在被构建到Java本身中;您不需要任何外部库。

You first obtain an instance of a DocumentBuilderFactory, set its features, create a DocumentBuilder (parser), then call its parse() method with an InputSource. InputSource has a number of possible constructors, with a StringReader used in the following example:

首先获取一个DocumentBuilderFactory实例,设置其特性,创建一个DocumentBuilder(解析器),然后使用InputSource调用它的parse()方法。InputSource有许多可能的构造函数,下面的示例中使用了StringReader:

import javax.xml.parsers.*;
// ...

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setValidating(false);
dbf.setNamespaceAware(true);
dbf.setIgnoringComments(false);
dbf.setIgnoringElementContentWhitespace(false);
dbf.setExpandEntityReferences(false);
DocumentBuilder db = dbf.newDocumentBuilder();
return db.parse(new InputSource(new StringReader(source)));

This returns a DOM Document. If you don't mind using external libraries there's also the JDOM and XOM APIs, and while these have some advantages over the SAX and DOM APIs in JAXP, they do require non-Java libraries to be added. The DOM can be somewhat cumbersome, but after so many years of using it I don't really mind any longer.

这将返回一个DOM文档。如果您不介意使用外部库,也可以使用JDOM和XOM api,尽管它们比JAXP中的SAX和DOM api有一些优势,但它们确实需要添加非java库。DOM可能有点麻烦,但是经过这么多年的使用之后,我真的不再介意了。

#3


2  

Here is a link that might be useful. It's a list of Open Source HTML Parser in Java Open Source HTML Parsers in Java

这里有一个可能有用的链接。它是Java开源HTML解析器的开源HTML解析器列表。

#4


1  

TagSoup can do what you want.

泰式汤可以做你想做的事。

#5


-1  

Apache's Xerces2 parser should do what you want.

Apache的Xerces2解析器应该做您想做的事情。

#1


6  

JTidy, either by processing the stream to XHTML then using your favourite DOM implementation to re-parse, or using parseDOM if the limited DOM imp that gives you is enough.

JTidy要么通过处理到XHTML的流,然后使用您最喜欢的DOM实现重新解析,要么使用parseDOM(如果提供给您的有限的DOM imp足够的话)。

Alternatively Neko.

另外三氯二苯脲。

#2


3  

Since HTML files are generally problematic, you'll need to first clean them up using a parser/scanner. I've used JTidy but never happily. NekoHTML works okay, but any of these tools are always just making a best guess of what is intended. You're effectively asking to let a program alter a document's markup until it conforms to a schema. That will likely cause structural (markup), style or content loss. It's unavoidable, and you won't really know what's missing unless you manually scan via a browser (and then you have to trust the browser too).

由于HTML文件通常是有问题的,您需要首先使用解析器/扫描器清理它们。我曾经用过JTidy,但从来没有开心过。NekoHTML运行良好,但是这些工具中的任何一个都只是对预期目标做出最好的猜测。您实际上是在要求程序修改文档的标记,直到它符合模式。这可能会导致结构(标记)、样式或内容丢失。这是不可避免的,除非你通过浏览器手动扫描(然后你也必须信任浏览器),否则你不会真正知道缺失了什么。

It really depends on your purpose — if you have thousands of ugly documents with tons of extraneous (non-HTML) markup, then a manual process is probably unreasonable. If your goal is accuracy on a few important documents, then manually fixing them is a reasonable proposition.

这实际上取决于您的目的——如果您有数千个丑陋的文档,其中包含大量无关的(非html)标记,那么手工处理可能是不合理的。如果您的目标是对一些重要文档的准确性,那么手动修复它们是一个合理的建议。

One approach is the manual process of repeatedly passing the source through a well-formed and/or validating parser, in an edit cycle using the error messages to eventually fix the broken markup. This does require some understanding of XML, but that's not a bad education to undertake.

一种方法是手动过程,它通过一个格式良好的和/或验证解析器反复传递源代码,在一个编辑周期中使用错误消息来最终修复损坏的标记。这确实需要对XML的一些理解,但这并不是一种糟糕的教育。

With Java 5 the necessary XML features — called the JAXP API — are now built into Java itself; you don't need any external libraries.

使用Java 5,必要的XML特性(称为JAXP API)现在被构建到Java本身中;您不需要任何外部库。

You first obtain an instance of a DocumentBuilderFactory, set its features, create a DocumentBuilder (parser), then call its parse() method with an InputSource. InputSource has a number of possible constructors, with a StringReader used in the following example:

首先获取一个DocumentBuilderFactory实例,设置其特性,创建一个DocumentBuilder(解析器),然后使用InputSource调用它的parse()方法。InputSource有许多可能的构造函数,下面的示例中使用了StringReader:

import javax.xml.parsers.*;
// ...

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setValidating(false);
dbf.setNamespaceAware(true);
dbf.setIgnoringComments(false);
dbf.setIgnoringElementContentWhitespace(false);
dbf.setExpandEntityReferences(false);
DocumentBuilder db = dbf.newDocumentBuilder();
return db.parse(new InputSource(new StringReader(source)));

This returns a DOM Document. If you don't mind using external libraries there's also the JDOM and XOM APIs, and while these have some advantages over the SAX and DOM APIs in JAXP, they do require non-Java libraries to be added. The DOM can be somewhat cumbersome, but after so many years of using it I don't really mind any longer.

这将返回一个DOM文档。如果您不介意使用外部库,也可以使用JDOM和XOM api,尽管它们比JAXP中的SAX和DOM api有一些优势,但它们确实需要添加非java库。DOM可能有点麻烦,但是经过这么多年的使用之后,我真的不再介意了。

#3


2  

Here is a link that might be useful. It's a list of Open Source HTML Parser in Java Open Source HTML Parsers in Java

这里有一个可能有用的链接。它是Java开源HTML解析器的开源HTML解析器列表。

#4


1  

TagSoup can do what you want.

泰式汤可以做你想做的事。

#5


-1  

Apache's Xerces2 parser should do what you want.

Apache的Xerces2解析器应该做您想做的事情。