在XML中编码二进制数据:有比base64更好的选择吗?

时间:2022-11-11 20:21:06

I want to encode and decode binary data within an XML file (with Python, but whatever). I have to face the fact that an XML tag content has illegal characters. The only allowed ones are described in XML specs:

我想要在XML文件中编码和解码二进制数据(使用Python,但无论如何)。我必须面对这样一个事实:XML标签内容有非法字符。在XML规范中只允许描述:

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

Which means that the unallowed are:

也就是说,不允许的是:

  • 29 Unicode control characters are illegal (0x00 - 0x20) ie (000xxxxx) except 0x09, 0x0A, 0x0D
  • Unicode控制字符是非法的(0x00 - 0x20) ie (000xxxxx),除了0x09, 0x0A, 0x0D。
  • Any Unicode character representation above 2 bytes (UTF-16+) is illegal (U+D800 - U+DFFF) ie (11011xxx)
  • 任何超过2字节(UTF-16+)的Unicode字符表示都是非法的(U+D800 - U+DFFF)
  • The special Unicode noncharacters are illegal (0xFFFE - 0xFFFF) ie (11111111 1111111x)
  • 特殊的Unicode非字符是非法的(0xFFFE - 0xFFFF) (1111111111111x)
  • <, >, & according to this post for entities content
  • <, >, &根据这篇文章的实体内容

1 byte can encode 256 possibles. With these restrictions the first byte is limited to 256-29-8-1-3 = 215 possiblities.

一个字节可以编码256个可能。有了这些限制,第一个字节被限制为256-29-8-1-3 = 215种可能。

Of that first bytes's 215 possibilites, base64 only uses 64 possibilites. Base64 generates 33% overhead (6 bits becomes 1 byte once encoded with base64).

在第一个字节的215种可能性中,base64只使用64种可能性。Base64产生33%的开销(使用Base64编码后,6位变成1字节)。

So my question is simple: Is there an algorithm more efficient than base64 to encode binary data within XML? If not, where should we start to create it? (libraries, etc.)

所以我的问题很简单:是否有一种算法比base64更有效地在XML中编码二进制数据?如果没有,我们应该从哪里开始创建它呢?(库等)。

NB: You wouldn't answer this post by "You shouldn't use XML to encode binary data because...". Just don't. You could at best argue why not to use the 215 possibilities for bad XML parser's support.

NB:你不会这样回答:“你不应该用XML来编码二进制数据,因为……”只是不喜欢。您最多可以争论为什么不使用215种方法来支持糟糕的XML解析器。

NB2: I'm not speaking about the second byte but there are certainly some considerations that wa can develop regarding the number of posibilities and the fact it should start by 10xxxxxx to respect UTF8 standard when we use the supplementary Unicode planes (what if not?).

NB2:我并不是说第二个字节,但是对于位置的数量,我们确实可以考虑一些问题,并且当我们使用补充的Unicode平面时,应该从10xxxxxx开始,以尊重UTF8标准(如果没有呢?)

3 个解决方案

#1


6  

Thank you Aya for the Asci85 link, there are very good ideas.

谢谢阿雅的ascii 85链接,有很好的想法。

I developed them below for our case.

我在下面为我们的案例开发了它们。


UTF-8 characters possibilites:


For 1 byte characters (0xxxxxxx): 96 possibilites per byte

对于一个字节字符(0xxxxx):每个字节有96种可能

  • + UTF-8 ASCII chars 0xxxxxxx = +2^7
  • + utf - 8 ASCII字符0 xxxxxxx = + 2 ^ 7
  • - UTF-8 Control chars 000xxxxx = -2^5
  • ——utf - 8控制字符000 xxxxx = 2 ^ 5
  • + XML allowed UTF-8 control chars (00000009, 0000000A, 0000000D) = +3
  • + XML允许UTF-8控制字符(00000009,0000000A, 0000000D) = +3
  • - XML entity unallowed chars (<, >, &) = -3
  • - XML实体unallowed chars (<, >, &) = -3

EDIT: This is for XML1.0 specs. XML 1.1 specs allows the use of control chars except 0x00...

编辑:这是XML1.0规范。XML 1.1规范允许使用除0x00以外的控制字符…

For 2-bytes characters (110xxxxx 10xxxxxx): 1920 possibilites per 2 bytes

对于2字节的字符(110xxxxx 10xxxxxx):每2字节有1920种可能

  • + UTF-8 2-byte chars 110xxxxx 10xxxxxx = +2^11
  • utf - 8 + 2字节字符110 xxxxx xxxxxx = 10 + 2 ^ 11
  • - UTF-8 illegal non-canonical chars (1100000x 10xxxxxx) = -2^7
  • - - - - - -非法非规范utf - 8字符(1100000 x 10 xxxxxx)= 2 ^ 7

For 3-bytes characters (1110xxxx 10xxxxxx 10xxxxxx): 61440 possibilites per 3 bytes

对于3字节的字符(1110xxxx 10xxxxxxxx):每3个字节有61440种可能

  • + UTF-8 3-byte chars 1110xxxx 10xxxxxx 10xxxxxx = +2^16
  • utf - 8 + 3字节字符1110 xxxx 10 xxxxxx xxxxxx = + 2 ^ 16
  • - UTF-8 illegal non-canonical chars (11100000 100xxxxx 10xxxxxx) = -2^11
  • 非法非规范utf - 8字符(11100000 100 xxxxx xxxxxx)= 2 ^ 11
  • - Unicode reserved UTF-16 codepoints (11101101 101xxxxx 10xxxxxx) = -2^11
  • - Unicode utf - 16 codepoints保留(11101101 101 xxxxx 10 xxxxxx)= 2 ^ 11

And I won't make the calculation for 4-bytes characters, that's pointless: the number of possibles would be negligible and there are too many illegal UTF-8 charactrs in this range.

我不会计算4个字节的字符,这是没有意义的:可能的数目是可以忽略的,在这个范围内有太多的非法UTF-8字符。


The coding possibilites in let's say a 3-byte space


So let's see what combinations we can do on a 3-byte (24bit) space:

让我们看看在一个3字节(24位)的空间里我们能做什么组合:

  • 0xxxxxxx 0xxxxxxx 0xxxxxxx : That's 96*96*96 = 884736 possibilities
  • 那是96*96*96 = 884736种可能性
  • 0xxxxxxx 110xxxxx 10xxxxxx : That's 96*1920 = 184320 possibilities
  • 那是96*1920 = 184320种可能性
  • 110xxxxx 10xxxxxx 0xxxxxxx : That's 1920*96 = 184320 possibilities
  • 那是1920*96 = 184320种可能性
  • 1110xxxx 10xxxxxx 10xxxxxx : That's 61440 = 61440 possibilities
  • 这是61440 = 61440的可能性

There would be other possibilites (like a 3-bytes char ending or starting in the space but like 4-bytes chars, that would be difficult to evaluate (for me) and probably negligible).

还有其他的可能性(比如一个3字节的字符结束或从空格开始,但就像4字节的字符,这将很难评估(对我来说),而且可能可以忽略不计)。

Total number of possibilities:

总数量的可能性:

  • A 3-byte space has 2^24 = 16777216 possibilities.
  • 一个3字节空间有2 ^ 24 = 16777216的可能性。
  • UTF-8 COMPATIBLE possibilites in that space is 884736+2*184320+61440 = 1314816 possibilities.
  • 该空间内的UTF-8兼容可能性为884736+2*184320+61440 = 1314816。

How much overhead does that mean?

这意味着多少开销?

  • 24-bit space usable bits : Log2(16777216)=24 (of course! that's for the math understanding)
  • 24位空间可用位:Log2(16777216)=24(当然!这是为了理解数学)
  • This space's UTF-8 useful bits: Log2(1314816)=20,32 useful bits.
  • 这个空间的UTF-8有用位:Log2(1314816)=20,32个有用位。
  • That means we need 24 bits of space to encode 20,32 bits of useful information, ie. the minimum theorical overhead is 18% overhead. Way better than Base64's 33% overhead and Ascii85's 25% overhead!
  • 这意味着我们需要24比特的空间来编码20,32比特的有用信息。最小的理论开销是18%。比Base64的33%的开销和Ascii85的25%开销要好得多!

EDIT: This is for XML1.0 specs. With XML1.1 (not widely supported...), the theorical overhead is 12.55%. I managed to make a binary safe algorithm with 14.7% overhead for XML1.1.

编辑:这是XML1.0规范。使用XML1.1(不受广泛支持…),理论上的开销为12.55%。我成功地制作了一个二进制安全算法,XML1.1的开销为14.7%。


How to get near this 18% overhead?


The bad news is that we can't get easily that 18% ovearhead without using a big "dictionnary" (ie long enconding sets). But it's easy to get 20%, and quite easy but less practical to get 19%.

坏消息是,我们不可能轻易地获得18%的ovearhead,而不用使用一个大的“字典”(即long enconding set)。但是很容易得到20%,很简单但不太实际得到19%。

Good coding lengths candidates:

良好的编码长度的候选人:

  • 6 bits can encode 5 bits with 20% overhead (2^(6*0,84) > 2^5)
  • 6位编码5位有20%的开销(2 ^(6 * 84(0)> 2 ^ 5)
  • 12 bits can encode 10 bits with 20% overhead (2^(12*0,84) > 2^10)
  • 12位编码10位20%的开销(2 ^(12 * 84(0)> 2 ^ 10)
  • 24 bits can encode 20 bits with 20% overhead (2^(24*0,84) > 2^20)
  • 24位编码20位20%的开销(2 ^(24 * 84(0)> 2 ^ 20)
  • 25 bits can encode 21 bits with 19% overhead (2^(25*0,84) > 2^21)
  • 25位可以编码21位19%的开销(2 ^(25 * 84(0)> 2 ^ 21)

NB: 0,84 is the average "usefulness" of a space bit (20,32/24).

NB: 0,84是空间位(20,32/24)的平均“有用性”。


How to build our encoding algorithm?


We need to build a "dictionnary" that will map the "space possibles" (randoms sequence of bits whose length is 5, 10, 20 or 21 bits depending on the chosen coding length for the algorithm - just choose one) into the utf8-compatible sequences (utf8 bit sequence whose length is 6, 12, 24 or 25 bits accordingly).

我们需要建立一个“dictionnary”,将地图“空间可能性”(随机比特序列的长度是5、10、20或21位根据选择的编码算法,只是选择一个长度)到utf8-compatible序列(use utf8比特序列的长度是6,12日24或25位相应)。

The easiest starting point would be the encoding of 20bits sequence into 24bits compatible UTF-8 sequences: that's exactly the example that was taken above to calculate the possibilites and that's 3 UTF-8 bytes long (so we won't have to worry about unterminated UTF8 chars).

最简单的起点是将20位序列编码为24位兼容的UTF-8序列:这正是上面计算可能性的示例,即3个UTF-8字节长(因此我们不必担心未终止的UTF8字符)。

Note that we MUST USE the 2-byte (or above) UTF-8 characters encoding space to reach the 20% overhead. With only 1-byte long UTF8 characters we can only reach 25% overhead with RADIX-24. However, the 3-byte long UTF-8 characters are needless to reach the 20% overhead.

注意,我们必须使用2字节(或以上)UTF-8字符编码空间来达到20%的开销。只有1字节长的UTF8字符,使用基数24我们只能达到25%的开销。但是,3字节长的UTF-8字符不需要达到20%的开销。

That's the next challenge for this question. Who wants to to play? :)

这是这个问题的下一个挑战。谁想玩?:)


A proposal of algorithm, I'll name BaseUTF-8 for XML


20 binary bits to encode: ABCDEFGHIJKLMNOPQRST

要编码的20个二进制位:ABCDEFGHIJKLMNOPQRST

Resulting UTF-8 string named "encoded": 24 bits long

得到的UTF-8字符串命名为“encoded”:24位长

Mathematical encoding algorithm (not based on any known programming language):

数学编码算法(不基于任何已知的编程语言):

If GH != 00 && NO != 00:
    encoded = 01ABCDEF 0GHIJKLM 0NOPQRST # 20 bits to encode, 21 space bits with restrictions (1-byte UTF-8 char not starting by 000xxxxx ie ASCII control chars)

If ABCD != 0000:
    If GH == 00 && NO == 00: # 16 bits to encode
        encoded = 0010ABCD 01EFIJKL 01MPQRST    
    Else If GH == 00:  # 18 bits to encode, 18 space bits with restrictions (1-byte  UTF-8 ASCII control char, 2-bytes UTF-8 char noncanonical)
        encoded = 0NOFIJKL 110ABCDE 10MPQRST
    Else If NO == 00:  # 18 bits to encode
        encoded = 110ABCDE 10MPQRST 0GHFIJKL

If ABCD == 0000: # 16 bits to encode
    encoded = 0011EFGH 01IJKLMN 01OPQRST

On "encoded" variable apply:
    convert < (0x3C) to Line Feed (0x0A)
    convert > (0x3E) to Cariage Return (0x0D)
    convert & (0x26) to TAB (0x09)

And that's how you get a 20% overhead only.

这就是为什么只有20%的开销。

This algorithm doesn't provide yet a way to manage string termination, when the string to encode is not a multiple of 20. The decoding algorithm has also to be provided, but that's quite easy (just don't forget to throw exceptions to force the unicity of decoding).

当要编码的字符串不是20的倍数时,该算法还没有提供管理字符串终止的方法。还需要提供解码算法,但这很容易(只是不要忘记抛出异常以强制解码的单一性)。

#2


2  

I have developed the concept in a C code.

我在C代码中开发了这个概念。

The project is on GitHub and is finally called BaseXML: https://github.com/kriswebdev/BaseXML

该项目位于GitHub上,最终被称为BaseXML: https://github.com/kriswebdev/BaseXML

It has a 20% overhead, which is good for a binary safe version.

它有20%的开销,这对于一个二进制安全版本来说是好的。

I had a hard time making it work with Expat, which is the behind the scene XML parser of Python (THAT DOESN'T SUPPORT XML1.1!). So you'll find the BaseXML1.0 Binary safe version for XML1.0.

我很难让它与Expat一起工作,Expat是Python的后台XML解析器(它不支持XML1.1!)因此,您将找到XML1.0的BaseXML1.0二进制安全版本。

I will maybe release the "for XML1.1" version later if requested (it is also binary safe and have a 14.7% overhead), it's ready and working indeed but useless with Python built-in XML parsers so I don't want to confuse people with too many versions (yet).

如果需要,稍后我可能会发布“for XML1.1”版本(它也是二进制安全的,并且有14.7%的开销),它已经准备好了,可以使用Python内置的XML解析器工作,但是没有用,所以我不想把人们和太多的版本搞混(到目前为止)。

#3


1  

It's worse than that: you don't actually have 215 different byte values you can use. The resulting binary data have to be valid in whatever encoding the XML is represented in (which is almost certainly UTF-8), which means that many, many byte sequences are forbidden. 0xc2 followed by 0x41 would be just one random example. XML is text (a sequence of Unicode characters), not binary data. When transmitted, it is encoded using some encoding (which is almost alwats UTF-8). If you try to treat it as binary data, then you are, in my opinion, asking for way more trouble than it's worth dealing with.

更糟糕的是:您实际上没有215个不同的字节值可以使用。得到的二进制数据必须在表示XML的任何编码(几乎肯定是UTF-8)中都是有效的,这意味着许多字节序列是禁止的。0xc2后面是0x41只是一个随机的例子。XML是文本(Unicode字符序列),而不是二进制数据。当传输时,它使用某种编码(几乎是alwats UTF-8)进行编码。如果你试图把它当作二进制数据,那么,在我看来,你所要求的麻烦比它值得处理的多。

If you still want to do this...

如果你还想这么做……

XML is text. So let's not try to encode your binary data as binary data. That will not lead to an easy or obvious way to showhorn it into an XML document. Let's try instead encoding your binary data as text!

XML文本。因此,我们不要试图将二进制数据编码为二进制数据。这不会导致一种简单或明显的方法将它硬塞到XML文档中。让我们尝试将二进制数据编码为文本!

Let's try one very simple encoding:

让我们尝试一个简单的编码:

  • Group your binary data into blocks of 20 bits
  • 将二进制数据分组为20位的块
  • Encode each group of 20 bits as the Unicode character U+10000 plus the numeric value of the 20 bits.
  • 将每组20位编码为Unicode字符U+10000加上20位的数值。

This will mean you exclusively use characters from planes 1 through 16. All of the restricted characters are in plane 0 (the BMP), so you are safe here.

这意味着您只能使用来自第1到16个平面的字符。所有受限制字符都在平面0 (BMP)中,所以您在这里是安全的。

When you then encode this XML document as UTF-8 for transmission, each of these characters will require 4 bytes to encode. So you consume 32 bits for every 20 bits of original data, which is 60% overhead compared to pure binary encoding of the original data. This is worse than base64's 33%, which makes it a terrible idea.

当您将这个XML文档编码为UTF-8进行传输时,每个字符都需要4个字节进行编码。因此,每20位原始数据消耗32位,与原始数据的纯二进制编码相比,这是60%的开销。这比base64的33%更糟糕,这让它成为一个糟糕的想法。

This encoding scheme is slightly wasteful because it makes no use of BMP characters. Can we use BMP characters to make it better? Not trivially. 20 is the largest size we can use for the groups (log(0x10FFFF) ~ 20.09). We could remap out scheme to use some as manu BMP characters as possible because these take less space to encode with UTF-8, but not only would this complicate the encoding a lot (the forbidden characters are scattered, so we have several cases to handle) but it can only lead to improvement for about 6.25% of bit patterns (fraction of Unicode characters that are in the BMP), and for the majority of that 6.25%, we'd save only one byte. For random data, the overhead decreases from 60% to around 55%. The result would still be much worse than base64 except for some very contrived data. Note that the overhead is data-dependant though. For 0.2% of bit patterns, you will actually get compression instead of overhead (60% compression for 0.012% of patterns and 20% compression for 0.18% of patterns). But these fractions are really low. It's just not worth it.

这个编码方案有点浪费,因为它不使用BMP字符。我们可以使用BMP字符使其更好吗?不是非常。20是我们可以用于组的最大大小(log(0x10FFFF) ~ 20.09)。我们可以重新映射出计划使用一些尽可能马努BMP的人物,因为这些需要更少的空间与utf - 8编码,但这不仅会使编码(禁止字符分散,所以我们有几个病例处理)但它只能导致改善约6.25%的二进制模式(BMP Unicode字符的一部分),绝大多数,有6.25%,我们只保存一个字节。对于随机数据,开销从60%减少到55%左右。结果仍然比base64糟糕得多,除了一些经过精心设计的数据。注意,开销是依赖于数据的。对于0.2%的位模式,您实际上将获得压缩而不是开销(0.012%的模式将获得60%的压缩,0.18%的模式将获得20%的压缩)。但是这些分数真的很低。只是不值得。

To put this another way: if you want to encode anything using 4-byte UTF-8 sequences, you need to use 32 bits per sequence (of course) but 11 of those bits are fixed and unchangeable: the bits must fit the pattern 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx and there are only 21 xs in there). That overhead of 60% is built in to UTF-8 so if you want to use this as the basis of any encoding that improves upon the overhead of base64, you are starting from behind!

把这另一种方式:如果你想要任何4字节使用utf - 8编码序列,您需要使用32位/序列(当然)但11位是固定不变的:部分必须符合模式11110 xxx 10 xxxxxx xxxxxx 10 xxxxxx,只有21个x)。60%的开销构建在UTF-8中,所以如果您想使用它作为任何改进base64开销的编码的基础,那么您是从后面开始的!

I hope this convinces you that you can't improve on the density of base64 using any scheme of this type.

我希望这能使您相信,使用这种类型的任何方案都不能提高base64的密度。

#1


6  

Thank you Aya for the Asci85 link, there are very good ideas.

谢谢阿雅的ascii 85链接,有很好的想法。

I developed them below for our case.

我在下面为我们的案例开发了它们。


UTF-8 characters possibilites:


For 1 byte characters (0xxxxxxx): 96 possibilites per byte

对于一个字节字符(0xxxxx):每个字节有96种可能

  • + UTF-8 ASCII chars 0xxxxxxx = +2^7
  • + utf - 8 ASCII字符0 xxxxxxx = + 2 ^ 7
  • - UTF-8 Control chars 000xxxxx = -2^5
  • ——utf - 8控制字符000 xxxxx = 2 ^ 5
  • + XML allowed UTF-8 control chars (00000009, 0000000A, 0000000D) = +3
  • + XML允许UTF-8控制字符(00000009,0000000A, 0000000D) = +3
  • - XML entity unallowed chars (<, >, &) = -3
  • - XML实体unallowed chars (<, >, &) = -3

EDIT: This is for XML1.0 specs. XML 1.1 specs allows the use of control chars except 0x00...

编辑:这是XML1.0规范。XML 1.1规范允许使用除0x00以外的控制字符…

For 2-bytes characters (110xxxxx 10xxxxxx): 1920 possibilites per 2 bytes

对于2字节的字符(110xxxxx 10xxxxxx):每2字节有1920种可能

  • + UTF-8 2-byte chars 110xxxxx 10xxxxxx = +2^11
  • utf - 8 + 2字节字符110 xxxxx xxxxxx = 10 + 2 ^ 11
  • - UTF-8 illegal non-canonical chars (1100000x 10xxxxxx) = -2^7
  • - - - - - -非法非规范utf - 8字符(1100000 x 10 xxxxxx)= 2 ^ 7

For 3-bytes characters (1110xxxx 10xxxxxx 10xxxxxx): 61440 possibilites per 3 bytes

对于3字节的字符(1110xxxx 10xxxxxxxx):每3个字节有61440种可能

  • + UTF-8 3-byte chars 1110xxxx 10xxxxxx 10xxxxxx = +2^16
  • utf - 8 + 3字节字符1110 xxxx 10 xxxxxx xxxxxx = + 2 ^ 16
  • - UTF-8 illegal non-canonical chars (11100000 100xxxxx 10xxxxxx) = -2^11
  • 非法非规范utf - 8字符(11100000 100 xxxxx xxxxxx)= 2 ^ 11
  • - Unicode reserved UTF-16 codepoints (11101101 101xxxxx 10xxxxxx) = -2^11
  • - Unicode utf - 16 codepoints保留(11101101 101 xxxxx 10 xxxxxx)= 2 ^ 11

And I won't make the calculation for 4-bytes characters, that's pointless: the number of possibles would be negligible and there are too many illegal UTF-8 charactrs in this range.

我不会计算4个字节的字符,这是没有意义的:可能的数目是可以忽略的,在这个范围内有太多的非法UTF-8字符。


The coding possibilites in let's say a 3-byte space


So let's see what combinations we can do on a 3-byte (24bit) space:

让我们看看在一个3字节(24位)的空间里我们能做什么组合:

  • 0xxxxxxx 0xxxxxxx 0xxxxxxx : That's 96*96*96 = 884736 possibilities
  • 那是96*96*96 = 884736种可能性
  • 0xxxxxxx 110xxxxx 10xxxxxx : That's 96*1920 = 184320 possibilities
  • 那是96*1920 = 184320种可能性
  • 110xxxxx 10xxxxxx 0xxxxxxx : That's 1920*96 = 184320 possibilities
  • 那是1920*96 = 184320种可能性
  • 1110xxxx 10xxxxxx 10xxxxxx : That's 61440 = 61440 possibilities
  • 这是61440 = 61440的可能性

There would be other possibilites (like a 3-bytes char ending or starting in the space but like 4-bytes chars, that would be difficult to evaluate (for me) and probably negligible).

还有其他的可能性(比如一个3字节的字符结束或从空格开始,但就像4字节的字符,这将很难评估(对我来说),而且可能可以忽略不计)。

Total number of possibilities:

总数量的可能性:

  • A 3-byte space has 2^24 = 16777216 possibilities.
  • 一个3字节空间有2 ^ 24 = 16777216的可能性。
  • UTF-8 COMPATIBLE possibilites in that space is 884736+2*184320+61440 = 1314816 possibilities.
  • 该空间内的UTF-8兼容可能性为884736+2*184320+61440 = 1314816。

How much overhead does that mean?

这意味着多少开销?

  • 24-bit space usable bits : Log2(16777216)=24 (of course! that's for the math understanding)
  • 24位空间可用位:Log2(16777216)=24(当然!这是为了理解数学)
  • This space's UTF-8 useful bits: Log2(1314816)=20,32 useful bits.
  • 这个空间的UTF-8有用位:Log2(1314816)=20,32个有用位。
  • That means we need 24 bits of space to encode 20,32 bits of useful information, ie. the minimum theorical overhead is 18% overhead. Way better than Base64's 33% overhead and Ascii85's 25% overhead!
  • 这意味着我们需要24比特的空间来编码20,32比特的有用信息。最小的理论开销是18%。比Base64的33%的开销和Ascii85的25%开销要好得多!

EDIT: This is for XML1.0 specs. With XML1.1 (not widely supported...), the theorical overhead is 12.55%. I managed to make a binary safe algorithm with 14.7% overhead for XML1.1.

编辑:这是XML1.0规范。使用XML1.1(不受广泛支持…),理论上的开销为12.55%。我成功地制作了一个二进制安全算法,XML1.1的开销为14.7%。


How to get near this 18% overhead?


The bad news is that we can't get easily that 18% ovearhead without using a big "dictionnary" (ie long enconding sets). But it's easy to get 20%, and quite easy but less practical to get 19%.

坏消息是,我们不可能轻易地获得18%的ovearhead,而不用使用一个大的“字典”(即long enconding set)。但是很容易得到20%,很简单但不太实际得到19%。

Good coding lengths candidates:

良好的编码长度的候选人:

  • 6 bits can encode 5 bits with 20% overhead (2^(6*0,84) > 2^5)
  • 6位编码5位有20%的开销(2 ^(6 * 84(0)> 2 ^ 5)
  • 12 bits can encode 10 bits with 20% overhead (2^(12*0,84) > 2^10)
  • 12位编码10位20%的开销(2 ^(12 * 84(0)> 2 ^ 10)
  • 24 bits can encode 20 bits with 20% overhead (2^(24*0,84) > 2^20)
  • 24位编码20位20%的开销(2 ^(24 * 84(0)> 2 ^ 20)
  • 25 bits can encode 21 bits with 19% overhead (2^(25*0,84) > 2^21)
  • 25位可以编码21位19%的开销(2 ^(25 * 84(0)> 2 ^ 21)

NB: 0,84 is the average "usefulness" of a space bit (20,32/24).

NB: 0,84是空间位(20,32/24)的平均“有用性”。


How to build our encoding algorithm?


We need to build a "dictionnary" that will map the "space possibles" (randoms sequence of bits whose length is 5, 10, 20 or 21 bits depending on the chosen coding length for the algorithm - just choose one) into the utf8-compatible sequences (utf8 bit sequence whose length is 6, 12, 24 or 25 bits accordingly).

我们需要建立一个“dictionnary”,将地图“空间可能性”(随机比特序列的长度是5、10、20或21位根据选择的编码算法,只是选择一个长度)到utf8-compatible序列(use utf8比特序列的长度是6,12日24或25位相应)。

The easiest starting point would be the encoding of 20bits sequence into 24bits compatible UTF-8 sequences: that's exactly the example that was taken above to calculate the possibilites and that's 3 UTF-8 bytes long (so we won't have to worry about unterminated UTF8 chars).

最简单的起点是将20位序列编码为24位兼容的UTF-8序列:这正是上面计算可能性的示例,即3个UTF-8字节长(因此我们不必担心未终止的UTF8字符)。

Note that we MUST USE the 2-byte (or above) UTF-8 characters encoding space to reach the 20% overhead. With only 1-byte long UTF8 characters we can only reach 25% overhead with RADIX-24. However, the 3-byte long UTF-8 characters are needless to reach the 20% overhead.

注意,我们必须使用2字节(或以上)UTF-8字符编码空间来达到20%的开销。只有1字节长的UTF8字符,使用基数24我们只能达到25%的开销。但是,3字节长的UTF-8字符不需要达到20%的开销。

That's the next challenge for this question. Who wants to to play? :)

这是这个问题的下一个挑战。谁想玩?:)


A proposal of algorithm, I'll name BaseUTF-8 for XML


20 binary bits to encode: ABCDEFGHIJKLMNOPQRST

要编码的20个二进制位:ABCDEFGHIJKLMNOPQRST

Resulting UTF-8 string named "encoded": 24 bits long

得到的UTF-8字符串命名为“encoded”:24位长

Mathematical encoding algorithm (not based on any known programming language):

数学编码算法(不基于任何已知的编程语言):

If GH != 00 && NO != 00:
    encoded = 01ABCDEF 0GHIJKLM 0NOPQRST # 20 bits to encode, 21 space bits with restrictions (1-byte UTF-8 char not starting by 000xxxxx ie ASCII control chars)

If ABCD != 0000:
    If GH == 00 && NO == 00: # 16 bits to encode
        encoded = 0010ABCD 01EFIJKL 01MPQRST    
    Else If GH == 00:  # 18 bits to encode, 18 space bits with restrictions (1-byte  UTF-8 ASCII control char, 2-bytes UTF-8 char noncanonical)
        encoded = 0NOFIJKL 110ABCDE 10MPQRST
    Else If NO == 00:  # 18 bits to encode
        encoded = 110ABCDE 10MPQRST 0GHFIJKL

If ABCD == 0000: # 16 bits to encode
    encoded = 0011EFGH 01IJKLMN 01OPQRST

On "encoded" variable apply:
    convert < (0x3C) to Line Feed (0x0A)
    convert > (0x3E) to Cariage Return (0x0D)
    convert & (0x26) to TAB (0x09)

And that's how you get a 20% overhead only.

这就是为什么只有20%的开销。

This algorithm doesn't provide yet a way to manage string termination, when the string to encode is not a multiple of 20. The decoding algorithm has also to be provided, but that's quite easy (just don't forget to throw exceptions to force the unicity of decoding).

当要编码的字符串不是20的倍数时,该算法还没有提供管理字符串终止的方法。还需要提供解码算法,但这很容易(只是不要忘记抛出异常以强制解码的单一性)。

#2


2  

I have developed the concept in a C code.

我在C代码中开发了这个概念。

The project is on GitHub and is finally called BaseXML: https://github.com/kriswebdev/BaseXML

该项目位于GitHub上,最终被称为BaseXML: https://github.com/kriswebdev/BaseXML

It has a 20% overhead, which is good for a binary safe version.

它有20%的开销,这对于一个二进制安全版本来说是好的。

I had a hard time making it work with Expat, which is the behind the scene XML parser of Python (THAT DOESN'T SUPPORT XML1.1!). So you'll find the BaseXML1.0 Binary safe version for XML1.0.

我很难让它与Expat一起工作,Expat是Python的后台XML解析器(它不支持XML1.1!)因此,您将找到XML1.0的BaseXML1.0二进制安全版本。

I will maybe release the "for XML1.1" version later if requested (it is also binary safe and have a 14.7% overhead), it's ready and working indeed but useless with Python built-in XML parsers so I don't want to confuse people with too many versions (yet).

如果需要,稍后我可能会发布“for XML1.1”版本(它也是二进制安全的,并且有14.7%的开销),它已经准备好了,可以使用Python内置的XML解析器工作,但是没有用,所以我不想把人们和太多的版本搞混(到目前为止)。

#3


1  

It's worse than that: you don't actually have 215 different byte values you can use. The resulting binary data have to be valid in whatever encoding the XML is represented in (which is almost certainly UTF-8), which means that many, many byte sequences are forbidden. 0xc2 followed by 0x41 would be just one random example. XML is text (a sequence of Unicode characters), not binary data. When transmitted, it is encoded using some encoding (which is almost alwats UTF-8). If you try to treat it as binary data, then you are, in my opinion, asking for way more trouble than it's worth dealing with.

更糟糕的是:您实际上没有215个不同的字节值可以使用。得到的二进制数据必须在表示XML的任何编码(几乎肯定是UTF-8)中都是有效的,这意味着许多字节序列是禁止的。0xc2后面是0x41只是一个随机的例子。XML是文本(Unicode字符序列),而不是二进制数据。当传输时,它使用某种编码(几乎是alwats UTF-8)进行编码。如果你试图把它当作二进制数据,那么,在我看来,你所要求的麻烦比它值得处理的多。

If you still want to do this...

如果你还想这么做……

XML is text. So let's not try to encode your binary data as binary data. That will not lead to an easy or obvious way to showhorn it into an XML document. Let's try instead encoding your binary data as text!

XML文本。因此,我们不要试图将二进制数据编码为二进制数据。这不会导致一种简单或明显的方法将它硬塞到XML文档中。让我们尝试将二进制数据编码为文本!

Let's try one very simple encoding:

让我们尝试一个简单的编码:

  • Group your binary data into blocks of 20 bits
  • 将二进制数据分组为20位的块
  • Encode each group of 20 bits as the Unicode character U+10000 plus the numeric value of the 20 bits.
  • 将每组20位编码为Unicode字符U+10000加上20位的数值。

This will mean you exclusively use characters from planes 1 through 16. All of the restricted characters are in plane 0 (the BMP), so you are safe here.

这意味着您只能使用来自第1到16个平面的字符。所有受限制字符都在平面0 (BMP)中,所以您在这里是安全的。

When you then encode this XML document as UTF-8 for transmission, each of these characters will require 4 bytes to encode. So you consume 32 bits for every 20 bits of original data, which is 60% overhead compared to pure binary encoding of the original data. This is worse than base64's 33%, which makes it a terrible idea.

当您将这个XML文档编码为UTF-8进行传输时,每个字符都需要4个字节进行编码。因此,每20位原始数据消耗32位,与原始数据的纯二进制编码相比,这是60%的开销。这比base64的33%更糟糕,这让它成为一个糟糕的想法。

This encoding scheme is slightly wasteful because it makes no use of BMP characters. Can we use BMP characters to make it better? Not trivially. 20 is the largest size we can use for the groups (log(0x10FFFF) ~ 20.09). We could remap out scheme to use some as manu BMP characters as possible because these take less space to encode with UTF-8, but not only would this complicate the encoding a lot (the forbidden characters are scattered, so we have several cases to handle) but it can only lead to improvement for about 6.25% of bit patterns (fraction of Unicode characters that are in the BMP), and for the majority of that 6.25%, we'd save only one byte. For random data, the overhead decreases from 60% to around 55%. The result would still be much worse than base64 except for some very contrived data. Note that the overhead is data-dependant though. For 0.2% of bit patterns, you will actually get compression instead of overhead (60% compression for 0.012% of patterns and 20% compression for 0.18% of patterns). But these fractions are really low. It's just not worth it.

这个编码方案有点浪费,因为它不使用BMP字符。我们可以使用BMP字符使其更好吗?不是非常。20是我们可以用于组的最大大小(log(0x10FFFF) ~ 20.09)。我们可以重新映射出计划使用一些尽可能马努BMP的人物,因为这些需要更少的空间与utf - 8编码,但这不仅会使编码(禁止字符分散,所以我们有几个病例处理)但它只能导致改善约6.25%的二进制模式(BMP Unicode字符的一部分),绝大多数,有6.25%,我们只保存一个字节。对于随机数据,开销从60%减少到55%左右。结果仍然比base64糟糕得多,除了一些经过精心设计的数据。注意,开销是依赖于数据的。对于0.2%的位模式,您实际上将获得压缩而不是开销(0.012%的模式将获得60%的压缩,0.18%的模式将获得20%的压缩)。但是这些分数真的很低。只是不值得。

To put this another way: if you want to encode anything using 4-byte UTF-8 sequences, you need to use 32 bits per sequence (of course) but 11 of those bits are fixed and unchangeable: the bits must fit the pattern 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx and there are only 21 xs in there). That overhead of 60% is built in to UTF-8 so if you want to use this as the basis of any encoding that improves upon the overhead of base64, you are starting from behind!

把这另一种方式:如果你想要任何4字节使用utf - 8编码序列,您需要使用32位/序列(当然)但11位是固定不变的:部分必须符合模式11110 xxx 10 xxxxxx xxxxxx 10 xxxxxx,只有21个x)。60%的开销构建在UTF-8中,所以如果您想使用它作为任何改进base64开销的编码的基础,那么您是从后面开始的!

I hope this convinces you that you can't improve on the density of base64 using any scheme of this type.

我希望这能使您相信,使用这种类型的任何方案都不能提高base64的密度。