将unicode字符串拆分为300字节块而不破坏字符。

时间:2022-05-28 18:40:06

I want to split u"an arbitrary unicode string" into chunks of say 300 bytes without destroying any characters. The strings will be written to a socket that expects utf8 using unicode_string.encode("utf8"). I don't want to destroy any characters. How would I do this?

我想把“任意的unicode字符串”分割成300字节的块,而不破坏任何字符。字符串将被写入一个套接字,该套接字期望使用unicode_string.encode(“utf8”)编写utf8。我不想毁掉任何角色。我该怎么做呢?

5 个解决方案

#1


10  

UTF-8 is designed for this.

UTF-8是为这个设计的。

def split_utf8(s, n):
    """Split UTF-8 s into chunks of maximum length n."""
    while len(s) > n:
        k = n
        while (ord(s[k]) & 0xc0) == 0x80:
            k -= 1
        yield s[:k]
        s = s[k:]
    yield s

Not tested. But you find a place to split, then backtrack until you reach the beginning of a character.

不测试。但是你找到了一个可以分开的地方,然后退回到角色的开头。

However, if a user might ever want to see an individual chunk, you may want to split on grapheme cluster boundaries instead. This is significantly more complicated, but not intractable. For example, in "é", you might not want to split apart the "e" and the "´". Or you might not care, as long as they get stuck together again in the end.

但是,如果用户可能希望看到单个块,您可能希望在grapheme集群边界上进行分割。这要复杂得多,但并不难解决。例如,在“e”,您可能不希望分裂“e”和“´”。或者你可能不介意,只要他们最后再粘在一起。

#2


5  

UTF-8 has a special property that all continuation characters are 0x800xBF (start with bits 10). So just make sure you don't split right before one.

UTF-8有一个特殊属性,即所有的延续字符都是0x80-0xBF(从位10开始)。所以你要确保在1点之前不要分裂。

Something along the lines of:

类似于:

def split_utf8(s, n):
    if len(s) <= n:
        return s, None
    while ord(s[n]) >= 0x80 and ord(s[n]) < 0xc0:
        n -= 1
    return s[0:n], s[n:]

should do the trick.

应该足够了。

#3


2  

Tested.

测试。

def split_utf8(s , n):
    assert n >= 4
    start = 0
    lens = len(s)
    while start < lens:
        if lens - start <= n:
            yield s[start:]
            return # StopIteration
        end = start + n
        while '\x80' <= s[end] <= '\xBF':
            end -= 1
        assert end > start
        yield s[start:end]
        start = end

#4


0  

If you can ensure that the utf-8 representation of your chars are only 2 byte long than you should be safe to split the unicode string into chunks of 150 chars (this should be true for most european encodings). But utf-8 is variable-width encoding. So might might split the unicode string into single characters, convert each char to utf-8 and fill your buffer until you reached the max chunk-size...this might be inefficient and a problem if high-throughput is an must...

如果您可以确保您的字符的utf-8表示只有2字节长,那么将unicode字符串分割为150个字符块应该是安全的(这对于大多数欧洲编码应该是正确的)。但是utf-8是可变宽度编码。因此,可能会将unicode字符串分割成单个字符,将每个字符转换为utf-8,然后填充缓冲区,直到达到最大块大小……如果高吞吐量是必须的,那么这可能是一个低效的问题。

#5


-2  

Use unicode encoding which by design have fixed length of each character, for example utf-32:

使用unicode编码,它的设计有固定长度的每个字符,例如utf-32:

>>> u_32 = u'Юникод'.encode('utf-32')
>>> u_32
'\xff\xfe\x00\x00.\x04\x00\x00=\x04\x00\x008\x04\x00\x00:\x04\x00\x00>\x04\x00\x
004\x04\x00\x00'
>>> len(u_32)
28
>>> len(u_32)%4
0
>>>

After encoding you can send chunk of any size (size must be multiple of 4 bytes) without destroying characters

编码后,您可以发送任何大小的块(大小必须是4字节的多个),而不会破坏字符

#1


10  

UTF-8 is designed for this.

UTF-8是为这个设计的。

def split_utf8(s, n):
    """Split UTF-8 s into chunks of maximum length n."""
    while len(s) > n:
        k = n
        while (ord(s[k]) & 0xc0) == 0x80:
            k -= 1
        yield s[:k]
        s = s[k:]
    yield s

Not tested. But you find a place to split, then backtrack until you reach the beginning of a character.

不测试。但是你找到了一个可以分开的地方,然后退回到角色的开头。

However, if a user might ever want to see an individual chunk, you may want to split on grapheme cluster boundaries instead. This is significantly more complicated, but not intractable. For example, in "é", you might not want to split apart the "e" and the "´". Or you might not care, as long as they get stuck together again in the end.

但是,如果用户可能希望看到单个块,您可能希望在grapheme集群边界上进行分割。这要复杂得多,但并不难解决。例如,在“e”,您可能不希望分裂“e”和“´”。或者你可能不介意,只要他们最后再粘在一起。

#2


5  

UTF-8 has a special property that all continuation characters are 0x800xBF (start with bits 10). So just make sure you don't split right before one.

UTF-8有一个特殊属性,即所有的延续字符都是0x80-0xBF(从位10开始)。所以你要确保在1点之前不要分裂。

Something along the lines of:

类似于:

def split_utf8(s, n):
    if len(s) <= n:
        return s, None
    while ord(s[n]) >= 0x80 and ord(s[n]) < 0xc0:
        n -= 1
    return s[0:n], s[n:]

should do the trick.

应该足够了。

#3


2  

Tested.

测试。

def split_utf8(s , n):
    assert n >= 4
    start = 0
    lens = len(s)
    while start < lens:
        if lens - start <= n:
            yield s[start:]
            return # StopIteration
        end = start + n
        while '\x80' <= s[end] <= '\xBF':
            end -= 1
        assert end > start
        yield s[start:end]
        start = end

#4


0  

If you can ensure that the utf-8 representation of your chars are only 2 byte long than you should be safe to split the unicode string into chunks of 150 chars (this should be true for most european encodings). But utf-8 is variable-width encoding. So might might split the unicode string into single characters, convert each char to utf-8 and fill your buffer until you reached the max chunk-size...this might be inefficient and a problem if high-throughput is an must...

如果您可以确保您的字符的utf-8表示只有2字节长,那么将unicode字符串分割为150个字符块应该是安全的(这对于大多数欧洲编码应该是正确的)。但是utf-8是可变宽度编码。因此,可能会将unicode字符串分割成单个字符,将每个字符转换为utf-8,然后填充缓冲区,直到达到最大块大小……如果高吞吐量是必须的,那么这可能是一个低效的问题。

#5


-2  

Use unicode encoding which by design have fixed length of each character, for example utf-32:

使用unicode编码,它的设计有固定长度的每个字符,例如utf-32:

>>> u_32 = u'Юникод'.encode('utf-32')
>>> u_32
'\xff\xfe\x00\x00.\x04\x00\x00=\x04\x00\x008\x04\x00\x00:\x04\x00\x00>\x04\x00\x
004\x04\x00\x00'
>>> len(u_32)
28
>>> len(u_32)%4
0
>>>

After encoding you can send chunk of any size (size must be multiple of 4 bytes) without destroying characters

编码后,您可以发送任何大小的块(大小必须是4字节的多个),而不会破坏字符