I have the task to write a reader for a file format with the following specification:
我的任务是编写一个具有以下规格的文件格式的阅读器:
- First section is plain xml with metadata (utf-8);
- 第一部分是带有元数据的纯xml (utf-8);
- Last section is a stream of 16bit values (binary);
- 最后一节是一个16位值(二进制)的流;
- These two sections are separated by one byte with value
29
(group separator in the ASCII table). - 这两个部分由一个字节分隔,值为29 (ASCII表中的组分隔符)。
I see two ways to read the xml part of the file. The first one is to build a string byte by byte until I find the separator.
我看到了两种读取文件xml部分的方法。第一个是逐字节构建一个字符串,直到找到分隔符。
The other is to use some library that would parse the xml and automatically detect the end of well-formed xml.
另一种方法是使用一些库来解析xml并自动检测格式良好的xml的末尾。
The question is: is there any .NET library that would stop automatically after the last closing tag in the XML?
问题是:是否有。net库在XML中最后一个结束标记之后自动停止?
(or, can anyone suggest a saner way to read this kind of file format?)
(或者,有人能提出一种更合理的方式来阅读这种文件格式吗?)
UPDATE: Following the answer from Peter Duniho, with slight modifications, I ended up with this (it works, though not thoroughly unit-tested yet).
更新:根据Peter Duniho的答案,稍微修改一下,我得到了这个(它是有效的,但是还没有完全通过单元测试)。
int position = 0;
MemoryStream ms;
using (FileStream fs = File.OpenRead("file.xml"))
using (ms = new MemoryStream())
{
int current;
while ((current = fs.ReadByte()) > 0)
{
position++;
if (current == 29)
break;
ms.WriteByte((byte)current);
}
}
var xmlheader = new XmlDocument();
xmlheader.LoadXml(Encoding.UTF8.GetString(ms.ToArray()));
2 个解决方案
#1
2
Given the information you've provided, simply searching for the byte with value 29 should work , because XML is UTF8 and a byte of value 29 should appear only if the character code point of 29 is present in the file. Now, I guess it could be present, but it would be surprising since that's in the control character range of the ASCII values.
根据您提供的信息,只需搜索带有值29的字节就可以了,因为XML是UTF8,只有当文件中出现字符代码点29时,才会出现值29的字节。现在,我猜它可能存在,但这将是令人惊讶的,因为它在ASCII值的控制字符范围内。
From the XML 1.0 spec:
来自XML 1.0规范:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
Char:= #x9 | # | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /任何Unicode字符,不包括代理块、FFFE和FFFF。* /
While the comment implies 29 would be a valid codepoint in an XML file (since it is itself a valid Unicode character), I consider the actual grammar normative. I.e. it specifically excludes characters below codepoint 32 except tab, newline, and carriage return, so 29 is not a valid XML character (just as Jon Skeet said).
虽然该注释暗示29将是XML文件中的有效码点(因为它本身就是一个有效的Unicode字符),但我认为实际的语法是规范的。也就是说,除了制表符、换行符和回车符,它特别排除代码点32以下的字符,因此29不是一个有效的XML字符(正如Jon Skeet所说)。
That said, without a complete specification of the input, I can't rule out the possibility. So if you really want to be on the safe side, you'd have to go ahead and parse the XML, hoping to find a proper closing tag for the root element. Then you can search for the byte 29 (since there might be whitespace after the closing tag), to identify where the binary data starts.
也就是说,如果没有输入的完整说明,我不能排除这种可能性。因此,如果您真的希望安全起见,您必须继续分析XML,希望为根元素找到一个合适的结束标记。然后可以搜索字节29(因为在结束标记之后可能有空格),以确定二进制数据的起始位置。
(Note: asking for a library is "off-topic". But you might be able to use XmlReader
to do this, since it operates on an iterative basis; i.e. you can terminate its operation after you hit the final closing tag, and before it starts complaining about finding invalid XML. This would depend, however, on buffering that XmlReader
might do; if it buffers additional data past the closing tag, then the position of the underlying stream would be past the 29 byte, making it harder to find. Frankly, just searching for the 29 byte seems like the way to go).
(注:申请图书馆是“离题”。但是您可能可以使用XmlReader来实现这一点,因为它是在迭代的基础上运行的;例如,您可以在到达最后的结束标记之后,在它开始抱怨发现无效的XML之前终止它的操作。然而,这取决于XmlReader可能会进行的缓冲处理;如果它将额外的数据缓存到结束标记之后,那么底层流的位置将超过29字节,从而使查找变得更加困难。坦率地说,仅仅搜索29字节似乎是一种方法。
You could search the header for the 29 byte like this (warning: browser code...uncompiled, untested):
你可以像这样搜索页眉的29字节(警告:浏览器代码……)没有编译,未经测试):
MemoryStream xmlStream = new MemoryStream();
using (FileStream stream = File.OpenRead(path))
{
int offset = 0, bytesRead = 0;
// arbitrary size...whatever you think is reasonable would be fine
byte[] buffer = new byte[1024];
while ((bytesRead = stream.Read(buffer, 0, buffer.Length)) > 0)
{
bool found = false;
for (int i = 0; i < bytesRead; i++)
{
if (buffer[i] == 29)
{
offset += i;
found = true;
xmlStream.Write(buffer, 0, i - 1);
break;
}
}
if (found)
{
break;
}
offset += bytesRead;
xmlStream.Write(buffer, 0, bytesRead);
}
if (bytesRead > 0)
{
// found byte 29 at offset "offset"
xmlStream.Position = 0;
// pass "xmlStream" object to your preferred XML-parsing API to
// parse the XML, or just return it or "xmlStream.ToArray()" as
// appropriate to the caller to let the caller deal with it.
}
else
{
// byte 29 not found!
}
}
EDIT:
编辑:
I've updated the above code example to write to a MemoryStream
object, so that once you've found the byte 29 value, you've got a stream all ready to go for XML parsing. Of course, I'm sure you could have added that yourself if you really needed to. In any case, obviously you would modify the code, with or without that feature, to suit your needs.
我已经更新了上面的代码示例,以便将其写入到MemoryStream对象中,这样一旦找到了byte 29值,您就得到了一个可以进行XML解析的流。当然,我相信如果你真的需要的话,你自己也可以加入。在任何情况下,显然您都将修改代码,无论是否使用该特性,以满足您的需求。
(There is the obvious hazard in writing to the MemoryStream
as you search: if you don't ever find the byte 29 value, you'll wind up with a complete copy of the entire file in memory, which you'd suggested you might prefer to avoid. But given that that's the error scenario, that might be okay).
(在搜索时写入MemoryStream有一个明显的危险:如果您从来没有找到字节29值,那么您最终将得到内存中整个文件的完整副本,您可能希望避免这样的情况。但考虑到这是错误的情况,这可能没问题)。
#2
2
While the "read to the closing tag" sounds appealing, you'd need to have a parser which didn't end up buffering all the data.
虽然“读到结束标记”听起来很吸引人,但是您需要一个解析器,它不会最终缓冲所有数据。
I would read all the data into a byte[]
, then search for the separator there - then you can split the binary data into two, and parse each part appropriately. I would do that entirely working in binary, with no strings involved - you can create a MemoryStream
for each section using new MemoryStrem(byte[], int, int)
and then pass that to an XML parser and whatever your final section parser is. That way you don't need to worry about handling UTF-8, or detecting if a later version of the XML doesn't use UTF-8, etc.
我将所有的数据读入一个字节[]中,然后在那里搜索分隔符——然后您可以将二进制数据分割为两个,并适当地解析每个部分。我将完全使用二进制,不涉及任何字符串——您可以使用新的MemoryStrem(byte[], int, int)为每个部分创建一个MemoryStream,然后将其传递给XML解析器和任何您的最终部分解析器。这样,您就不需要担心处理UTF-8,或者检测XML的后续版本是否不使用UTF-8等等。
So something like:
所以类似:
byte[] allData = File.ReadAllBytes(filename);
int separatorIndex = Array.IndexOf(allData, (byte) 29);
if (separatorIndex == -1)
{
// throw an exception or whatever
}
var xmlStream = new MemoryStream(allData, 0, separatorIndex);
var lastPartStream = new MemoryStream(
allData, separatorIndex + 1, allData.Length - separatorIndex - 1);
#1
2
Given the information you've provided, simply searching for the byte with value 29 should work , because XML is UTF8 and a byte of value 29 should appear only if the character code point of 29 is present in the file. Now, I guess it could be present, but it would be surprising since that's in the control character range of the ASCII values.
根据您提供的信息,只需搜索带有值29的字节就可以了,因为XML是UTF8,只有当文件中出现字符代码点29时,才会出现值29的字节。现在,我猜它可能存在,但这将是令人惊讶的,因为它在ASCII值的控制字符范围内。
From the XML 1.0 spec:
来自XML 1.0规范:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
Char:= #x9 | # | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /任何Unicode字符,不包括代理块、FFFE和FFFF。* /
While the comment implies 29 would be a valid codepoint in an XML file (since it is itself a valid Unicode character), I consider the actual grammar normative. I.e. it specifically excludes characters below codepoint 32 except tab, newline, and carriage return, so 29 is not a valid XML character (just as Jon Skeet said).
虽然该注释暗示29将是XML文件中的有效码点(因为它本身就是一个有效的Unicode字符),但我认为实际的语法是规范的。也就是说,除了制表符、换行符和回车符,它特别排除代码点32以下的字符,因此29不是一个有效的XML字符(正如Jon Skeet所说)。
That said, without a complete specification of the input, I can't rule out the possibility. So if you really want to be on the safe side, you'd have to go ahead and parse the XML, hoping to find a proper closing tag for the root element. Then you can search for the byte 29 (since there might be whitespace after the closing tag), to identify where the binary data starts.
也就是说,如果没有输入的完整说明,我不能排除这种可能性。因此,如果您真的希望安全起见,您必须继续分析XML,希望为根元素找到一个合适的结束标记。然后可以搜索字节29(因为在结束标记之后可能有空格),以确定二进制数据的起始位置。
(Note: asking for a library is "off-topic". But you might be able to use XmlReader
to do this, since it operates on an iterative basis; i.e. you can terminate its operation after you hit the final closing tag, and before it starts complaining about finding invalid XML. This would depend, however, on buffering that XmlReader
might do; if it buffers additional data past the closing tag, then the position of the underlying stream would be past the 29 byte, making it harder to find. Frankly, just searching for the 29 byte seems like the way to go).
(注:申请图书馆是“离题”。但是您可能可以使用XmlReader来实现这一点,因为它是在迭代的基础上运行的;例如,您可以在到达最后的结束标记之后,在它开始抱怨发现无效的XML之前终止它的操作。然而,这取决于XmlReader可能会进行的缓冲处理;如果它将额外的数据缓存到结束标记之后,那么底层流的位置将超过29字节,从而使查找变得更加困难。坦率地说,仅仅搜索29字节似乎是一种方法。
You could search the header for the 29 byte like this (warning: browser code...uncompiled, untested):
你可以像这样搜索页眉的29字节(警告:浏览器代码……)没有编译,未经测试):
MemoryStream xmlStream = new MemoryStream();
using (FileStream stream = File.OpenRead(path))
{
int offset = 0, bytesRead = 0;
// arbitrary size...whatever you think is reasonable would be fine
byte[] buffer = new byte[1024];
while ((bytesRead = stream.Read(buffer, 0, buffer.Length)) > 0)
{
bool found = false;
for (int i = 0; i < bytesRead; i++)
{
if (buffer[i] == 29)
{
offset += i;
found = true;
xmlStream.Write(buffer, 0, i - 1);
break;
}
}
if (found)
{
break;
}
offset += bytesRead;
xmlStream.Write(buffer, 0, bytesRead);
}
if (bytesRead > 0)
{
// found byte 29 at offset "offset"
xmlStream.Position = 0;
// pass "xmlStream" object to your preferred XML-parsing API to
// parse the XML, or just return it or "xmlStream.ToArray()" as
// appropriate to the caller to let the caller deal with it.
}
else
{
// byte 29 not found!
}
}
EDIT:
编辑:
I've updated the above code example to write to a MemoryStream
object, so that once you've found the byte 29 value, you've got a stream all ready to go for XML parsing. Of course, I'm sure you could have added that yourself if you really needed to. In any case, obviously you would modify the code, with or without that feature, to suit your needs.
我已经更新了上面的代码示例,以便将其写入到MemoryStream对象中,这样一旦找到了byte 29值,您就得到了一个可以进行XML解析的流。当然,我相信如果你真的需要的话,你自己也可以加入。在任何情况下,显然您都将修改代码,无论是否使用该特性,以满足您的需求。
(There is the obvious hazard in writing to the MemoryStream
as you search: if you don't ever find the byte 29 value, you'll wind up with a complete copy of the entire file in memory, which you'd suggested you might prefer to avoid. But given that that's the error scenario, that might be okay).
(在搜索时写入MemoryStream有一个明显的危险:如果您从来没有找到字节29值,那么您最终将得到内存中整个文件的完整副本,您可能希望避免这样的情况。但考虑到这是错误的情况,这可能没问题)。
#2
2
While the "read to the closing tag" sounds appealing, you'd need to have a parser which didn't end up buffering all the data.
虽然“读到结束标记”听起来很吸引人,但是您需要一个解析器,它不会最终缓冲所有数据。
I would read all the data into a byte[]
, then search for the separator there - then you can split the binary data into two, and parse each part appropriately. I would do that entirely working in binary, with no strings involved - you can create a MemoryStream
for each section using new MemoryStrem(byte[], int, int)
and then pass that to an XML parser and whatever your final section parser is. That way you don't need to worry about handling UTF-8, or detecting if a later version of the XML doesn't use UTF-8, etc.
我将所有的数据读入一个字节[]中,然后在那里搜索分隔符——然后您可以将二进制数据分割为两个,并适当地解析每个部分。我将完全使用二进制,不涉及任何字符串——您可以使用新的MemoryStrem(byte[], int, int)为每个部分创建一个MemoryStream,然后将其传递给XML解析器和任何您的最终部分解析器。这样,您就不需要担心处理UTF-8,或者检测XML的后续版本是否不使用UTF-8等等。
So something like:
所以类似:
byte[] allData = File.ReadAllBytes(filename);
int separatorIndex = Array.IndexOf(allData, (byte) 29);
if (separatorIndex == -1)
{
// throw an exception or whatever
}
var xmlStream = new MemoryStream(allData, 0, separatorIndex);
var lastPartStream = new MemoryStream(
allData, separatorIndex + 1, allData.Length - separatorIndex - 1);