在xml文件中使用CDATA来解析html数据

时间:2022-12-01 13:15:37

I have a xml file with a malformed HTML in its content .. Since xml cannot parse html tags like <br> I have used CDATA for saving and parsing .

我的内容中有一个格式错误的HTML文件。由于xml无法解析像我这样的html标签,因此我使用CDATA进行保存和解析。

I have used documentBuilder.setCoalescing(true) ; while parsing for recovering data <![CDATA[<br>test<br>data<br>]]> without CDATA tag ..

我用过documentBuilder.setCoalescing(true);解析恢复数据 test
data
]]>没有CDATA标签..

but in the optput < and > tags are replaced by &lt; and &gt; respectively ..

但在optput中, <和> 标签被替换为<和>分别 ..

I m expecting this string in result ...

我期待结果中的这个字符串......

<br>test<br>data<br>

in the parsed string .

在解析的字符串中。

How to do this ? Any Idea ? Thanks in advance !

这个怎么做 ?任何想法 ?提前致谢 !

UPDATE:I have two more Questions in follow up ..

更新:我还有两个问题需要跟进..

1.Is there any way to make a malformed HTML (eg.<br>) to parsable xml (eg.<br/>) via code , if so will it handle &nbsp; also ?

1.有没有办法通过代码将格式错误的HTML(例如。
)转换为可解析的xml(例如。
),如果是这样,它将处理 还呢?

2.Is there any solution to convert a html text to plain text via java (eg.<div>test&nbsp;text</div> to test text)?

2.有没有解决办法通过java将html文本转换为纯文本(例如

test  text 到测试文本)?

4 个解决方案

#1


2  

Coalescing is an operation where the contents of CDATA sections (nodes) are converted to text nodes and merged with the contents of adjacent text nodes. This requirement in itself of converting CDATA sections to text nodes will impose the restriction that the resulting text nodes be composed of valid XML characters. This will preserve original document formatting; in other words, the structure of the nodes in the original document will not undergo a change.

合并是一种操作,其中CDATA部分(节点)的内容被转换为文本节点并与相邻文本节点的内容合并。将CDATA部分转换为文本节点本身的这一要求将强加限制,即生成的文本节点由有效的XML字符组成。这将保留原始文档格式;换句话说,原始文档中节点的结构不会发生变化。

The resulting behavior is that of the 5 predefined entities - <, >, &, " and ', the first three will be expanded, for their unaltered presence will change document structure.

由此产生的行为是5个预定义实体的行为 - <,>,&,“和”,前三个将被扩展,因为它们未改变的存在将改变文档结构。

In short, you cannot do what you intend to do, by extracting values from the DOM. You'll need to decode the values into what you desire, after parsing the document. Apache Commons Lang has a utility class - StringEscapeUtils that possesses the desired method.

简而言之,您无法通过从DOM中提取值来执行您想要执行的操作。解析文档后,您需要将值解码为您想要的值。 Apache Commons Lang有一个实用程序类 - StringEscapeUtils,它拥有所需的方法。

#2


2  

Coalescing means that the parser will convert CDATA nodes to Text nodes. When the document is serialized to XML, of course the text content (HTML) must be escaped. If you want to do something with the HTML you must first extract it as text--then you can render it in a browser, or whatever.

合并意味着解析器将CDATA节点转换为Text节点。当文档序列化为XML时,当然必须转义文本内容(HTML)。如果你想对HTML做一些事情,你必须先将它作为文本提取 - 然后你可以在浏览器中渲染它,或者其他什么。

UPDATE:

1) You can use JTidy, http://jtidy.sourceforge.net/index.html, to parse the HTML content and produce XML or XHTML. Something like this:

1)您可以使用JTidy,http://jtidy.sourceforge.net/index.html来解析HTML内容并生成XML或XHTML。像这样的东西:

DocumentBuilder db = factory.newDocumentBuilder();
Document doc = db.parse(..)); // parse your input document

// Obtain the HTML content, may be buried deeper down or
// or scattered around in different places
String text = doc.getDocumentElement().getTextContent();

// Parse with JTidy to convert from HTML to XHTML
Tidy tidy = new Tidy();
tidy.setXHTML(true);

Document htmlDoc = tidy.parseDOM(new StringReader(text), null);
Transformer t = TransformerFactory.newInstance().newTransformer();
t.setOutputProperty(OutputKeys.INDENT, "yes");
t.transform(new DOMSource(htmlDoc), new StreamResult(System.out));

2) Yes. When you have the parsed htmlDoc (above) you can travserse it or apply XPATH or whatever to extract the wanted text pieces. Just remember that   will be unescaped to '\u00A0'. So if want really plain text, you should perhaps do

2)是的。如果你有解析的htmlDoc(上面),你可以travserse它或应用XPATH或其他什么来提取想要的文本片段。请记住,这将被转移到'\ u00A0'。所以,如果想要真正的纯文本,你或许应该这样做

String s = text.replace('\u00A0', ' ');

#3


1  

If you are simply troubled by ill-formed XML, you might consider the tidy tool which can turn your HTML into well-formed XML.

如果您只是对格式不正确的XML感到困扰,您可能会考虑使用可以将HTML转换为格式良好的XML的整洁工具。

In general, you'll need an XML parser that lets you access the raw content of the CDATA marked sections and then put that raw data to whatever use you have in mind.

通常,您需要一个XML解析器,它允许您访问CDATA标记部分的原始内容,然后将原始数据放入您考虑的任何用途。

#4


0  

@Billu: You can have a look at apache open library:- org.apache.commons.lang.StringEscapeUtils. This class got escapeXML()/escapeHTML() and unescapeXML()/escapeHTML() methods. for example to your first problem about converting < and > you can use unescapeHTML(your-data).

@Billu:你可以看看apache open library: - org.apache.commons.lang.StringEscapeUtils。这个类得到了escapeXML()/ escapeHTML()和unescapeXML()/ escapeHTML()方法。例如,关于转换 <和> 的第一个问题,您可以使用unescapeHTML(您的数据)。

You may not even need to store/pass data in CDATA section, you can just use escapeXML(data) at sending/storing end; and user unescapeXML(data) at receiving/retreival end.

您可能甚至不需要在CDATA部分存储/传递数据,您可以在发送/存储端使用escapeXML(数据);和接收/撤销结束时的用户unescapeXML(数据)。

for more information, here is the link:- StringEscapeUtils

有关更多信息,请访问以下链接: - StringEscapeUtils

Please let me know if aboe information helped you.

如果aboe信息对您有帮助,请告知我们。

#1


2  

Coalescing is an operation where the contents of CDATA sections (nodes) are converted to text nodes and merged with the contents of adjacent text nodes. This requirement in itself of converting CDATA sections to text nodes will impose the restriction that the resulting text nodes be composed of valid XML characters. This will preserve original document formatting; in other words, the structure of the nodes in the original document will not undergo a change.

合并是一种操作,其中CDATA部分(节点)的内容被转换为文本节点并与相邻文本节点的内容合并。将CDATA部分转换为文本节点本身的这一要求将强加限制,即生成的文本节点由有效的XML字符组成。这将保留原始文档格式;换句话说,原始文档中节点的结构不会发生变化。

The resulting behavior is that of the 5 predefined entities - <, >, &, " and ', the first three will be expanded, for their unaltered presence will change document structure.

由此产生的行为是5个预定义实体的行为 - <,>,&,“和”,前三个将被扩展,因为它们未改变的存在将改变文档结构。

In short, you cannot do what you intend to do, by extracting values from the DOM. You'll need to decode the values into what you desire, after parsing the document. Apache Commons Lang has a utility class - StringEscapeUtils that possesses the desired method.

简而言之,您无法通过从DOM中提取值来执行您想要执行的操作。解析文档后,您需要将值解码为您想要的值。 Apache Commons Lang有一个实用程序类 - StringEscapeUtils,它拥有所需的方法。

#2


2  

Coalescing means that the parser will convert CDATA nodes to Text nodes. When the document is serialized to XML, of course the text content (HTML) must be escaped. If you want to do something with the HTML you must first extract it as text--then you can render it in a browser, or whatever.

合并意味着解析器将CDATA节点转换为Text节点。当文档序列化为XML时,当然必须转义文本内容(HTML)。如果你想对HTML做一些事情,你必须先将它作为文本提取 - 然后你可以在浏览器中渲染它,或者其他什么。

UPDATE:

1) You can use JTidy, http://jtidy.sourceforge.net/index.html, to parse the HTML content and produce XML or XHTML. Something like this:

1)您可以使用JTidy,http://jtidy.sourceforge.net/index.html来解析HTML内容并生成XML或XHTML。像这样的东西:

DocumentBuilder db = factory.newDocumentBuilder();
Document doc = db.parse(..)); // parse your input document

// Obtain the HTML content, may be buried deeper down or
// or scattered around in different places
String text = doc.getDocumentElement().getTextContent();

// Parse with JTidy to convert from HTML to XHTML
Tidy tidy = new Tidy();
tidy.setXHTML(true);

Document htmlDoc = tidy.parseDOM(new StringReader(text), null);
Transformer t = TransformerFactory.newInstance().newTransformer();
t.setOutputProperty(OutputKeys.INDENT, "yes");
t.transform(new DOMSource(htmlDoc), new StreamResult(System.out));

2) Yes. When you have the parsed htmlDoc (above) you can travserse it or apply XPATH or whatever to extract the wanted text pieces. Just remember that   will be unescaped to '\u00A0'. So if want really plain text, you should perhaps do

2)是的。如果你有解析的htmlDoc(上面),你可以travserse它或应用XPATH或其他什么来提取想要的文本片段。请记住,这将被转移到'\ u00A0'。所以,如果想要真正的纯文本,你或许应该这样做

String s = text.replace('\u00A0', ' ');

#3


1  

If you are simply troubled by ill-formed XML, you might consider the tidy tool which can turn your HTML into well-formed XML.

如果您只是对格式不正确的XML感到困扰,您可能会考虑使用可以将HTML转换为格式良好的XML的整洁工具。

In general, you'll need an XML parser that lets you access the raw content of the CDATA marked sections and then put that raw data to whatever use you have in mind.

通常,您需要一个XML解析器,它允许您访问CDATA标记部分的原始内容,然后将原始数据放入您考虑的任何用途。

#4


0  

@Billu: You can have a look at apache open library:- org.apache.commons.lang.StringEscapeUtils. This class got escapeXML()/escapeHTML() and unescapeXML()/escapeHTML() methods. for example to your first problem about converting < and > you can use unescapeHTML(your-data).

@Billu:你可以看看apache open library: - org.apache.commons.lang.StringEscapeUtils。这个类得到了escapeXML()/ escapeHTML()和unescapeXML()/ escapeHTML()方法。例如,关于转换 <和> 的第一个问题,您可以使用unescapeHTML(您的数据)。

You may not even need to store/pass data in CDATA section, you can just use escapeXML(data) at sending/storing end; and user unescapeXML(data) at receiving/retreival end.

您可能甚至不需要在CDATA部分存储/传递数据,您可以在发送/存储端使用escapeXML(数据);和接收/撤销结束时的用户unescapeXML(数据)。

for more information, here is the link:- StringEscapeUtils

有关更多信息,请访问以下链接: - StringEscapeUtils

Please let me know if aboe information helped you.

如果aboe信息对您有帮助,请告知我们。