I am trying to create an efficient algorithm, for shortening QString, so when converted to UTF-8 it would be shorter than defined length and still be valid UTF-8.
我正在尝试创建一个有效的算法,用于缩短QString,所以当转换到UTF-8时,它将比定义的长度短,并且仍然是有效的UTF-8。
- Input
- QString text - string with all possible characters - no maximal length specified
- QString文本-带有所有可能字符的字符串-没有指定最大长度
- int limit - the maximal length of the output encoded in utf-8
- int极限——用utf-8编码的输出的最大长度
- 输入QString文本-带有所有可能字符的字符串-没有指定的最大长度int限制- utf-8编码的输出的最大长度
- Output
- QByteArray output - the original text in utf-8 shorter than limit.
- QByteArray输出- utf-8中的原始文本比限制短。
- 输出QByteArray输出-原始文本中的utf-8比限制短。
- example1:
- text = "How are you?"
- 你好吗?
- limit = 5
- 限制= 5
- output = "How a"
- 输出= "如何"
- 例1:text = "How are you?" limit = 5 output = "How a"
- example2:
- text = "Как дела?"
- 文本= "Какдела?”
- limit = 5
- 限制= 5
- output = "Ка"
- d0 9a d0 b0 - including "к" would be already over the limit and including d0 would result in not valid utf-8 string.
- 9 d0 d0 b0——包括“к”将已经超出限度,包括d0会导致无效的utf - 8编码的字符串。
- 9输出= "Ка" d0 d0 b0——包括“к”已经超过了限制,包括d0会导致无效的utf - 8编码的字符串。
- example2:文本= "Какдела?”限制输出= 5 =“Ка”d0 9 d0 b0——包括“к”已经超过了限制,包括d0会导致无效的utf - 8编码的字符串。
First we started with the following code, but that may cut the UTF-8 character in the middle, which is not acceptable:
首先我们从下面的代码开始,但是这可能会在中间减少UTF-8字符,这是不可接受的:
QByteArray output = text.toUtf8().left(limit);
Than we developed correct algorithm which works, but is quite ugly and not efficient:
比我们开发的正确算法有效,但相当难看和低效:
QString tmp = input;
while (tmp.toUtf8().size() > limit)
tmp.chop(1);
QByteArray output = tmp.toUtf8();
- Is there a better method, how to do this?
- If yes, please share the code?
- 如果是,请分享代码。
- If not, why?
- 如果不是,为什么?
- 有更好的方法吗?如果是,请分享代码?如果不是,为什么?
1 个解决方案
#1
1
The following approach should be optimal unless you want to write your own UTF-8 conversion routine. It relies on the fact that continuation bytes in UTF-8 sequences are in the range 0x80-0xBF. Going backward from the limit, it tries to find the first starting byte at which the string can be split safely.
以下方法应该是最优的,除非您想编写自己的UTF-8转换例程。它依赖于UTF-8序列中的连续字节在0x80-0xBF范围内。从极限返回后,它试图找到第一个开始的字节,在这个字节中,字符串可以安全地分开。
QByteArray output = tmp.toUtf8();
if (output.size() > limit) {
int truncateAt = 0;
for (int i = limit; i > 0; i--) {
if ((output[i] & 0xC0) != 0x80) {
truncateAt = i;
break;
}
}
output.truncate(truncateAt);
}
Since UTF-8 byte sequences aren't longer than 4 bytes, it shouldn't take more than 4 loop iterations to find the correct position.
由于UTF-8字节序列不超过4字节,所以不应该超过4次循环迭代才能找到正确的位置。
#1
1
The following approach should be optimal unless you want to write your own UTF-8 conversion routine. It relies on the fact that continuation bytes in UTF-8 sequences are in the range 0x80-0xBF. Going backward from the limit, it tries to find the first starting byte at which the string can be split safely.
以下方法应该是最优的,除非您想编写自己的UTF-8转换例程。它依赖于UTF-8序列中的连续字节在0x80-0xBF范围内。从极限返回后,它试图找到第一个开始的字节,在这个字节中,字符串可以安全地分开。
QByteArray output = tmp.toUtf8();
if (output.size() > limit) {
int truncateAt = 0;
for (int i = limit; i > 0; i--) {
if ((output[i] & 0xC0) != 0x80) {
truncateAt = i;
break;
}
}
output.truncate(truncateAt);
}
Since UTF-8 byte sequences aren't longer than 4 bytes, it shouldn't take more than 4 loop iterations to find the correct position.
由于UTF-8字节序列不超过4字节,所以不应该超过4次循环迭代才能找到正确的位置。