如何解析不是100%有效的XHTML文件?

I have XHTML files whose source is not completely valid, it does not follow the DTD of an XML document.

我有一些XHTML文件，它们的源不完全有效，它不遵循XML文档的DTD。

Like there are places where for " it uses &Idquo; or for apostrophes it uses ’. This causes exceptions in my C# code.

就像有些地方“它使用和保持”;或者用撇号。这会导致c#代码中的异常。

So is there any method or any weblink that i can use to get rid of this?

有什么方法或者weblink可以帮我去掉吗?

3 个解决方案

#1

If the file is otherwise well-formed you can define the character entities in your own DTD.

如果文件是格式良好的，您可以在自己的DTD中定义字符实体。

If the file is ill-formed the HTML Agility Pack from CodePlex will parse it.

如果文件格式不正确，CodePlex上的HTML敏捷包将对其进行解析。

#2

You could parse the document as HTML instead since they both end up in a DOM and HTML parsers scoff at these pansy quotation mark problems. Going along with unknown's HTML Tidy idea, you could then serialize the DOM back into a valid XHTML file. (This is identical to using HTML Tidy, wihch presumably uses an HTML parser anyway, except you'd do it from C# programatically.)

您可以将文档解析为HTML，因为它们都在DOM中结束，而HTML解析器会嘲笑这些pansy引号问题。按照unknown的HTML Tidy想法，您可以将DOM序列化为一个有效的XHTML文件。(这和使用HTML Tidy是一样的，wihch可能会使用HTML解析器，但是您可以从c#编程方式进行解析。)

#3

Well by the nature of XML it needs to be valid otherwise it won't render at all. I'd first see what type of errors it generates with W3C's validator http://validator.w3.org/

根据XML的性质，它必须是有效的否则它就不会呈现。我将首先看到它使用W3C的validator http://validator.w3.org/生成的错误类型

Also consider using HTML tidy, which can be configured to fix XML as well.

还可以考虑使用HTML tidy，它也可以配置为修复XML。

We use hpricot to fix our XML, but then again we are building rails apps. Not sure about C#

我们使用hpricot来修复我们的XML，但是我们仍然在构建rails应用程序。不确定c#

#1