SO, I am asking as a last resort, as I am completely out of ideas.
所以,我请求作为最后的手段,因为我完全没有想法了。
I have a Windows ASP.NET ASMX web services app that returns a serialized Person object with a -- name, address, email... etc
我有一个Windows ASP。NET ASMX web服务应用程序返回一个序列化的Person对象,该对象具有——名称、地址、电子邮件……等
but some attributes in the xml are encoded very weirdly, for instance- 
(I dont know where the encoding takes place. I assume in the serialization process)
但是xml中的一些属性编码得非常奇怪,例如(我不知道编码发生在哪里)。我假设在序列化过程中)
googling those characters I see that it is "Windows-1252" encoding.
在谷歌上搜索这些字符,我看到它是“Windows-1252”编码。
The problem occurs during parsing of the XML, I found, a parse error of "invalid unicode character" at the position of the 1252 encoding.
我发现在解析XML时出现了问题,在1252编码的位置出现了“无效unicode字符”的解析错误。
how can I successfully parse it? what solutions do you suggest?
如何成功地解析它?你有什么建议?
1 个解决方案
#1
7
The parser is correct, whatever produced the serialisation is wrong. As with most of the C0/C1 control characters, it is invalid—actually, worse than that: not well-formed—to put a U+001A SUBSTITUTE into an XML 1.0 file(*), even if encoded as a character reference such as 
.
解析器是正确的,无论产生的序列化是错误的。与大多数C0/C1控制字符一样,将U+001A替换到XML 1.0文件(*)中是无效的(实际上更糟:不太好理解),即使将其编码为这样的字符引用;
No XML parser will read this, nor should it. Whilst you could put some horrific hack in to try to filter out 
sequences before passing them to the parser, such crude hacks wouldn't work for the general case. The serialiser should be fixed to stop producing them.
任何XML解析器都不会读取它,也不应该读取它。同时你也可以加入一些可怕的黑客试图过滤掉#x1A;在将序列传递给解析器之前,这种粗糙的修改对于一般情况是行不通的。应该修复序列化器以停止生成它们。
Actually I have no idea how the character (often used to mark end-of-file in ancient horrible operating systems) would get into the dataset used by an ASP.NET app, but it wouldn't seem to play any valid role in a name, address or e-mail. Perhaps really you need to be looking at cleaning your data.
实际上,我不知道这个字符(在古代可怕的操作系统中经常用来标记文件结束)是如何进入ASP使用的数据集的。NET应用程序,但它似乎在名称、地址或电子邮件中不起任何作用。也许你真的需要清理你的数据。
(*: It would be legal if encoded as a character reference in an XML 1.1 document. If you absolutely must round-trip control characters through XML, you will have to use XML 1.1. Though that may lead to compatibility issues with older XML parsers, and you still can't use the U+0000 NULL character, so you're never going to be completely binary-safe.)
(*:在XML 1.1文档中作为字符引用进行编码是合法的。如果您必须通过XML进行双向控制字符,那么您必须使用XML 1.1。尽管这可能会导致与旧的XML解析器的兼容性问题,而且您仍然不能使用U+0000空字符,因此永远不会是完全二进制安全的)。
#1
7
The parser is correct, whatever produced the serialisation is wrong. As with most of the C0/C1 control characters, it is invalid—actually, worse than that: not well-formed—to put a U+001A SUBSTITUTE into an XML 1.0 file(*), even if encoded as a character reference such as 
.
解析器是正确的,无论产生的序列化是错误的。与大多数C0/C1控制字符一样,将U+001A替换到XML 1.0文件(*)中是无效的(实际上更糟:不太好理解),即使将其编码为这样的字符引用;
No XML parser will read this, nor should it. Whilst you could put some horrific hack in to try to filter out 
sequences before passing them to the parser, such crude hacks wouldn't work for the general case. The serialiser should be fixed to stop producing them.
任何XML解析器都不会读取它,也不应该读取它。同时你也可以加入一些可怕的黑客试图过滤掉#x1A;在将序列传递给解析器之前,这种粗糙的修改对于一般情况是行不通的。应该修复序列化器以停止生成它们。
Actually I have no idea how the character (often used to mark end-of-file in ancient horrible operating systems) would get into the dataset used by an ASP.NET app, but it wouldn't seem to play any valid role in a name, address or e-mail. Perhaps really you need to be looking at cleaning your data.
实际上,我不知道这个字符(在古代可怕的操作系统中经常用来标记文件结束)是如何进入ASP使用的数据集的。NET应用程序,但它似乎在名称、地址或电子邮件中不起任何作用。也许你真的需要清理你的数据。
(*: It would be legal if encoded as a character reference in an XML 1.1 document. If you absolutely must round-trip control characters through XML, you will have to use XML 1.1. Though that may lead to compatibility issues with older XML parsers, and you still can't use the U+0000 NULL character, so you're never going to be completely binary-safe.)
(*:在XML 1.1文档中作为字符引用进行编码是合法的。如果您必须通过XML进行双向控制字符,那么您必须使用XML 1.1。尽管这可能会导致与旧的XML解析器的兼容性问题,而且您仍然不能使用U+0000空字符,因此永远不会是完全二进制安全的)。