I am working on producing an xml document from python. We are using the xml.dom package to create the xml document. We are having a problem where we want to produce the character φ which is a φ. However, when we put that string in a text node and call toxml() on it we get φ. Our current solution is to use saxutils.unescape() on the result of toxml() but this is not ideal because we will have to parse the xml twice.
我正在从python生成一个xml文档。我们使用xml.dom包来创建xml文档。我们遇到了一个问题,我们想要制作角色φ这是一个φ。但是,当我们将该字符串放在文本节点中并在其上调用toxml()时,我们得到φ。我们当前的解决方案是在toxml()的结果上使用saxutils.unescape()但这并不理想,因为我们必须解析xml两次。
Is there someway to get the dom package to recognize "φ" as an xml character?
有没有办法让dom包识别“φ”作为一个xml字符?
1 个解决方案
#1
I think you need to use a Unicode string with \u03c6
in it, because the .data
field of a text node is supposed (as far as I understand) to be "parsed" data, not including XML entities (whence the &
when made back into XML). If you want to ensure that, on output, non-ascii characters are expressed as entities, you could do:
我认为你需要在其中使用带有\ u03c6的Unicode字符串,因为文本节点的.data字段被认为(据我所知)是“解析”数据,不包括XML实体(从而& when回到XML)。如果要确保在输出时将非ascii字符表示为实体,则可以执行以下操作:
import codecs
def ent_replace(exc):
if isinstance(exc, (UnicodeEncodeError, UnicodeTranslateError)):
s = []
for c in exc.object[exc.start:exc.end]:
s.append(u'&#x%4.4x;' % ord(c))
return (''.join(s), exc.end)
else:
raise TypeError("can't handle %s" % exc.__name__)
codecs.register_error('ent_replace', ent_replace)
and use x.toxml().encode('ascii', 'ent_replace')
.
并使用x.toxml()。encode('ascii','ent_replace')。
#1
I think you need to use a Unicode string with \u03c6
in it, because the .data
field of a text node is supposed (as far as I understand) to be "parsed" data, not including XML entities (whence the &
when made back into XML). If you want to ensure that, on output, non-ascii characters are expressed as entities, you could do:
我认为你需要在其中使用带有\ u03c6的Unicode字符串,因为文本节点的.data字段被认为(据我所知)是“解析”数据,不包括XML实体(从而& when回到XML)。如果要确保在输出时将非ascii字符表示为实体,则可以执行以下操作:
import codecs
def ent_replace(exc):
if isinstance(exc, (UnicodeEncodeError, UnicodeTranslateError)):
s = []
for c in exc.object[exc.start:exc.end]:
s.append(u'&#x%4.4x;' % ord(c))
return (''.join(s), exc.end)
else:
raise TypeError("can't handle %s" % exc.__name__)
codecs.register_error('ent_replace', ent_replace)
and use x.toxml().encode('ascii', 'ent_replace')
.
并使用x.toxml()。encode('ascii','ent_replace')。