使用未定义的实体解析XHTML文档

时间:2021-03-06 20:09:27

While coding with Python, if I had to load XHTML document with undefined entity, I would create a parser and update entity dict (i.e. nbsp):

在使用Python编码时,如果我必须使用未定义的实体加载XHTML文档,我将创建一个解析器并更新实体dict(即nbsp):

import xml.etree.ElementTree as ET
parser = ET.XMLParser()
parser.entity['nbsp'] = ' '
tree = ET.parse(opener.open(url), parser=parser)

With VB.Net I tried to parse XHTML document as Linq XDocument:

使用VB.Net,我尝试将XHTML文档解析为Linq XDocument:

Dim x As XDocument = XDocument.Load(url)

which raised XmlException:

它引发了XmlException:

Reference to undeclared entity 'nbsp'

参考未申报的实体'nbsp'

Googling around I couldn't find any example how to update entity table or use simple means to be able to parse XHTML document with undefined entity.

谷歌搜索我找不到任何示例如何更新实体表或使用简单的方法来解析具有未定义的实体的XHTML文档。

How to solve this apparently simple problem?

如何解决这个看似简单的问题?

3 个解决方案

#1


2  

Entity resolution is done by the underlying parser which is here a standard XmlReader (or XmlTextReader).

实体解析由底层解析器完成,该解析器是标准的XmlReader(或XmlTextReader)。

Officially, you're supposed to declare entities in DTDs (see Oleg's answer here: Problem with XHTML entities), or load DTDs dynamically into your documents. There are some examples here on SO like this: How do I resolve entities when loading into an XDocument?

正式地说,您应该在DTD中声明实体(请参阅Oleg的答案:XHTML实体的问题),或者将DTD动态加载到文档中。这里有一些示例,如下所示:如何在加载到XDocument时解析实体?

What you can also do is create a hacky XmlTextReader derived class that returns Text nodes when entities are detected, based on a dictionary, like I demonstrate here in the following sample code:

你还可以做的是创建一个hacky XmlTextReader派生类,它在检测到实体时根据字典返回Text节点,就像我在下面的示例代码中演示的那样:

using (XmlTextReaderWithEntities reader = new XmlTextReaderWithEntities(MyXmlFile))
{
    reader.AddEntity("nbsp", "\u00A0");
    XDocument xdoc = XDocument.Load(reader);
}

...

public class XmlTextReaderWithEntities : XmlTextReader
{
    private string _nextEntity;
    private Dictionary<string, string> _entities = new Dictionary<string, string>();

    // NOTE: override other constructors for completeness
    public XmlTextReaderWithEntities(string path)
        : base(path)
    {
    }

    public void AddEntity(string entity, string value)
    {
        _entities[entity] = value;
    }

    public override bool Read()
    {
        if (_nextEntity != null)
            return true;

        return base.Read();
    }

    public override XmlNodeType NodeType
    {
        get
        {
            if (_nextEntity != null)
                return XmlNodeType.Text;

            return base.NodeType;
        }
    }

    public override string Value
    {
        get
        {
            if (_nextEntity != null)
            {
                string value = _nextEntity;
                _nextEntity = null;
                return value;
            }
            return base.Value;
        }
    }

    public override void ResolveEntity()
    {
        // if not found, return the string as is
        if (!_entities.TryGetValue(LocalName, out _nextEntity))
        {
            _nextEntity = "&" + LocalName + ";";
        }
        // NOTE: we don't use base here. Depends on the scenario
    }
}

This approach works in simple scenarios, but you may need to override some other stuff for completeness.

这种方法适用于简单的场景,但您可能需要覆盖其他一些东西才能完整。

PS: sorry it's in C#, you'll have to adapt to VB.NET :)

PS:对不起,在C#中,你必须适应VB.NET :)

#2


1  

I haven't done this, but you could create a XmlParserContext object with required entity declarations as internalSubset. Pass that context to XmlTextReader in the constructor and create the XDocument object by loading the reader. In MSDN there already is a simple looking example code snippet in VB for using a pre-defined entity.

我没有这样做,但您可以使用必需的实体声明创建一个XmlParserContext对象作为internalSubset。将该上下文传递给构造函数中的XmlTextReader,并通过加载读取器来创建XDocument对象。在MSDN中,VB中已经有一个简单的示例代码片段,用于使用预定义的实体。

#3


0  

in this case i suppose your taking about of a page on the web so you may use html agility pack which could met your need.

在这种情况下,我认为您在网络上浏览了一个页面,因此您可以使用可满足您需求的html敏捷包。

I use xpath, element and more other stuff.It will very usefull to search into an html page etc.

我使用xpath,元素和更多其他东西。它将非常有用,可以搜索到html页面等。

You may find documentation here : htmlagilitypack

您可以在这里找到文档:htmlagilitypack

#1


2  

Entity resolution is done by the underlying parser which is here a standard XmlReader (or XmlTextReader).

实体解析由底层解析器完成,该解析器是标准的XmlReader(或XmlTextReader)。

Officially, you're supposed to declare entities in DTDs (see Oleg's answer here: Problem with XHTML entities), or load DTDs dynamically into your documents. There are some examples here on SO like this: How do I resolve entities when loading into an XDocument?

正式地说,您应该在DTD中声明实体(请参阅Oleg的答案:XHTML实体的问题),或者将DTD动态加载到文档中。这里有一些示例,如下所示:如何在加载到XDocument时解析实体?

What you can also do is create a hacky XmlTextReader derived class that returns Text nodes when entities are detected, based on a dictionary, like I demonstrate here in the following sample code:

你还可以做的是创建一个hacky XmlTextReader派生类,它在检测到实体时根据字典返回Text节点,就像我在下面的示例代码中演示的那样:

using (XmlTextReaderWithEntities reader = new XmlTextReaderWithEntities(MyXmlFile))
{
    reader.AddEntity("nbsp", "\u00A0");
    XDocument xdoc = XDocument.Load(reader);
}

...

public class XmlTextReaderWithEntities : XmlTextReader
{
    private string _nextEntity;
    private Dictionary<string, string> _entities = new Dictionary<string, string>();

    // NOTE: override other constructors for completeness
    public XmlTextReaderWithEntities(string path)
        : base(path)
    {
    }

    public void AddEntity(string entity, string value)
    {
        _entities[entity] = value;
    }

    public override bool Read()
    {
        if (_nextEntity != null)
            return true;

        return base.Read();
    }

    public override XmlNodeType NodeType
    {
        get
        {
            if (_nextEntity != null)
                return XmlNodeType.Text;

            return base.NodeType;
        }
    }

    public override string Value
    {
        get
        {
            if (_nextEntity != null)
            {
                string value = _nextEntity;
                _nextEntity = null;
                return value;
            }
            return base.Value;
        }
    }

    public override void ResolveEntity()
    {
        // if not found, return the string as is
        if (!_entities.TryGetValue(LocalName, out _nextEntity))
        {
            _nextEntity = "&" + LocalName + ";";
        }
        // NOTE: we don't use base here. Depends on the scenario
    }
}

This approach works in simple scenarios, but you may need to override some other stuff for completeness.

这种方法适用于简单的场景,但您可能需要覆盖其他一些东西才能完整。

PS: sorry it's in C#, you'll have to adapt to VB.NET :)

PS:对不起,在C#中,你必须适应VB.NET :)

#2


1  

I haven't done this, but you could create a XmlParserContext object with required entity declarations as internalSubset. Pass that context to XmlTextReader in the constructor and create the XDocument object by loading the reader. In MSDN there already is a simple looking example code snippet in VB for using a pre-defined entity.

我没有这样做,但您可以使用必需的实体声明创建一个XmlParserContext对象作为internalSubset。将该上下文传递给构造函数中的XmlTextReader,并通过加载读取器来创建XDocument对象。在MSDN中,VB中已经有一个简单的示例代码片段,用于使用预定义的实体。

#3


0  

in this case i suppose your taking about of a page on the web so you may use html agility pack which could met your need.

在这种情况下,我认为您在网络上浏览了一个页面,因此您可以使用可满足您需求的html敏捷包。

I use xpath, element and more other stuff.It will very usefull to search into an html page etc.

我使用xpath,元素和更多其他东西。它将非常有用,可以搜索到html页面等。

You may find documentation here : htmlagilitypack

您可以在这里找到文档:htmlagilitypack