如何将QString缩短,当转换为UTF-8时,QString的长度小于某一长度?

时间:2022-01-20 18:59:31

I am trying to create an efficient algorithm, for shortening QString, so when converted to UTF-8 it would be shorter than defined length and still be valid UTF-8.

我正在尝试创建一个有效的算法,用于缩短QString,所以当转换到UTF-8时,它将比定义的长度短,并且仍然是有效的UTF-8。

  • Input
    • QString text - string with all possible characters - no maximal length specified
    • QString文本-带有所有可能字符的字符串-没有指定最大长度
    • int limit - the maximal length of the output encoded in utf-8
    • int极限——用utf-8编码的输出的最大长度
  • 输入QString文本-带有所有可能字符的字符串-没有指定的最大长度int限制- utf-8编码的输出的最大长度
  • Output
    • QByteArray output - the original text in utf-8 shorter than limit.
    • QByteArray输出- utf-8中的原始文本比限制短。
  • 输出QByteArray输出-原始文本中的utf-8比限制短。
  • example1:
    • text = "How are you?"
    • 你好吗?
    • limit = 5
    • 限制= 5
    • output = "How a"
    • 输出= "如何"
  • 例1:text = "How are you?" limit = 5 output = "How a"
  • example2:
    • text = "Как дела?"
    • 文本= "Какдела?”
    • limit = 5
    • 限制= 5
    • output = "Ка"
      • d0 9a d0 b0 - including "к" would be already over the limit and including d0 would result in not valid utf-8 string.
      • 9 d0 d0 b0——包括“к”将已经超出限度,包括d0会导致无效的utf - 8编码的字符串。
    • 9输出= "Ка" d0 d0 b0——包括“к”已经超过了限制,包括d0会导致无效的utf - 8编码的字符串。
  • example2:文本= "Какдела?”限制输出= 5 =“Ка”d0 9 d0 b0——包括“к”已经超过了限制,包括d0会导致无效的utf - 8编码的字符串。

First we started with the following code, but that may cut the UTF-8 character in the middle, which is not acceptable:

首先我们从下面的代码开始,但是这可能会在中间减少UTF-8字符,这是不可接受的:

QByteArray output = text.toUtf8().left(limit);

Than we developed correct algorithm which works, but is quite ugly and not efficient:

比我们开发的正确算法有效,但相当难看和低效:

QString tmp = input;
while (tmp.toUtf8().size() > limit)
    tmp.chop(1);
QByteArray output = tmp.toUtf8();
  • Is there a better method, how to do this?
    • If yes, please share the code?
    • 如果是,请分享代码。
    • If not, why?
    • 如果不是,为什么?
  • 有更好的方法吗?如果是,请分享代码?如果不是,为什么?

1 个解决方案

#1


1  

The following approach should be optimal unless you want to write your own UTF-8 conversion routine. It relies on the fact that continuation bytes in UTF-8 sequences are in the range 0x80-0xBF. Going backward from the limit, it tries to find the first starting byte at which the string can be split safely.

以下方法应该是最优的,除非您想编写自己的UTF-8转换例程。它依赖于UTF-8序列中的连续字节在0x80-0xBF范围内。从极限返回后,它试图找到第一个开始的字节,在这个字节中,字符串可以安全地分开。

QByteArray output = tmp.toUtf8();
if (output.size() > limit) {
    int truncateAt = 0;
    for (int i = limit; i > 0; i--) {
        if ((output[i] & 0xC0) != 0x80) {
            truncateAt = i;
            break;
        }
    }
    output.truncate(truncateAt);
}

Since UTF-8 byte sequences aren't longer than 4 bytes, it shouldn't take more than 4 loop iterations to find the correct position.

由于UTF-8字节序列不超过4字节,所以不应该超过4次循环迭代才能找到正确的位置。

#1


1  

The following approach should be optimal unless you want to write your own UTF-8 conversion routine. It relies on the fact that continuation bytes in UTF-8 sequences are in the range 0x80-0xBF. Going backward from the limit, it tries to find the first starting byte at which the string can be split safely.

以下方法应该是最优的,除非您想编写自己的UTF-8转换例程。它依赖于UTF-8序列中的连续字节在0x80-0xBF范围内。从极限返回后,它试图找到第一个开始的字节,在这个字节中,字符串可以安全地分开。

QByteArray output = tmp.toUtf8();
if (output.size() > limit) {
    int truncateAt = 0;
    for (int i = limit; i > 0; i--) {
        if ((output[i] & 0xC0) != 0x80) {
            truncateAt = i;
            break;
        }
    }
    output.truncate(truncateAt);
}

Since UTF-8 byte sequences aren't longer than 4 bytes, it shouldn't take more than 4 loop iterations to find the correct position.

由于UTF-8字节序列不超过4字节,所以不应该超过4次循环迭代才能找到正确的位置。