如何计算C中unicode字符串中的字符

Lets say I have a string:

假设我有一个字符串:

char theString[] = "你们好āa";

Given that my encoding is utf-8, this string is 12 bytes long (the three hanzi characters are three bytes each, the latin character with the macron is two bytes, and the 'a' is one byte:

假设我的编码是utf-8，这个字符串是12字节长(三个汉字每个三个字节，macron的拉丁字符是两个字节，a是一个字节:

strlen(theString) == 12

How can I count the number of characters? How can i do the equivalent of subscripting so that:

我怎么能数出字符的数量呢?我如何做的等价下标，以便:

theString[3] == "好"

How can I slice, and cat such strings?

我怎么能切成这样的细线?

10 个解决方案

#1

You only count the characters that have the top two bits are not set to 10 (i.e., everything less that 0x80 or greater than 0xbf).

你只需要数有前两个位的字符不被设置为10(即。，小于0x80或大于0xbf的所有内容。

That's because all the characters with the top two bits set to 10 are UTF-8 continuation bytes.

这是因为前两个位设置为10的字符都是UTF-8连续字节。

See here for a description of the encoding and how strlen can work on a UTF-8 string.

请参阅这里的编码描述以及strlen如何处理UTF-8字符串。

For slicing and dicing UTF-8 strings, you basically have to follow the same rules. Any byte starting with a 0 bit or a 11 sequence is the start of a UTF-8 code point, all others are continuation characters.

对于分割和切UTF-8字符串，基本上必须遵循相同的规则。任何以0位或11序列开头的字节都是UTF-8代码点的开始，所有其他字节都是延续字符。

Your best bet, if you don't want to use a third-party library, is to simply provide functions along the lines of:

如果您不想使用第三方库，最好的方法是简单地按照以下步骤提供函数:

utf8left (char *destbuff, char *srcbuff, size_t sz);
utf8mid  (char *destbuff, char *srcbuff, size_t pos, size_t sz);
utf8rest (char *destbuff, char *srcbuff, size_t pos;

to get, respectively:

,分别为:

the left sz UTF-8 bytes of a string.
字符串的左sz UTF-8字节。
the sz UTF-8 bytes of a string, starting at pos.
字符串的sz UTF-8字节，从pos开始。
the rest of the UTF-8 bytes of a string, starting at pos.
字符串的其余UTF-8字节，从pos开始。

This will be a decent building block to be able to manipulate the strings sufficiently for your purposes.

这将是一个很好的构建块，能够充分地为您的目的操作字符串。

#2

The easiest way is to use a library like ICU

最简单的方法是使用像ICU这样的库。

#3

Try this for size:

试试这个尺寸:

#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

// returns the number of utf8 code points in the buffer at s
size_t utf8len(char *s)
{
    size_t len = 0;
    for (; *s; ++s) if ((*s & 0xC0) != 0x80) ++len;
    return len;
}

// returns a pointer to the beginning of the pos'th utf8 codepoint
// in the buffer at s
char *utf8index(char *s, size_t pos)
{    
    ++pos;
    for (; *s; ++s) {
        if ((*s & 0xC0) != 0x80) --pos;
        if (pos == 0) return s;
    }
    return NULL;
}

// converts codepoint indexes start and end to byte offsets in the buffer at s
void utf8slice(char *s, ssize_t *start, ssize_t *end)
{
    char *p = utf8index(s, *start);
    *start = p ? p - s : -1;
    p = utf8index(s, *end);
    *end = p ? p - s : -1;
}

// appends the utf8 string at src to dest
char *utf8cat(char *dest, char *src)
{
    return strcat(dest, src);
}

// test program
int main(int argc, char **argv)
{
    // slurp all of stdin to p, with length len
    char *p = malloc(0);
    size_t len = 0;
    while (true) {
        p = realloc(p, len + 0x10000);
        ssize_t cnt = read(STDIN_FILENO, p + len, 0x10000);
        if (cnt == -1) {
            perror("read");
            abort();
        } else if (cnt == 0) {
            break;
        } else {
            len += cnt;
        }
    }

    // do some demo operations
    printf("utf8len=%zu\n", utf8len(p));
    ssize_t start = 2, end = 3;
    utf8slice(p, &start, &end);
    printf("utf8slice[2:3]=%.*s\n", end - start, p + start);
    start = 3; end = 4;
    utf8slice(p, &start, &end);
    printf("utf8slice[3:4]=%.*s\n", end - start, p + start);
    return 0;
}

Sample run:

示例运行:

matt@stanley:~/Desktop$ echo -n 你们好āa | ./utf8ops 
utf8len=5
utf8slice[2:3]=好
utf8slice[3:4]=ā

Note that your example has an off by one error. theString[2] == "好"

注意，您的示例有一个错误。theString[2]= = "好"

#4

Depending on your notion of "character", this question can get more or less involved.

根据你对“性格”的理解，这个问题或多或少会涉及到。

First off, you should transform your byte string into a string of unicode codepoints. You can do this with iconv() of ICU, though if this is the only thing you do, iconv() is a lot easier, and it's part of POSIX.

首先，您应该将字节字符串转换为unicode代码点的字符串。您可以使用ICU的iconv()来实现这一点，尽管如果这是惟一要做的事情，iconv()要简单得多，而且是POSIX的一部分。

Your string of unicode codepoints could be something like a null-terminated uint32_t[], or if you have C1x, an array of char32_t. The size of that array (i.e. its number of elements, not its size in bytes) is the number of codepoints (plus the terminator), and that should give you a very good start.

unicode码点的字符串可以是空终止的uint32_t[]，或者如果有C1x，则是char32_t的数组。该数组的大小(即元素的数量，而不是字节的大小)是代码点的数量(加上终止符)，这将给您一个很好的开始。

However, the notion of a "printable character" is fairly complex, and you may prefer to count graphemes rather than codepoints - for instance, an a with an accent ^ can be expressed as two unicode codepoints, or as a combined legacy codepoint â - both are valid, and both are required by the unicode standard to be treated equally. There is a process called "normalization" which turns your string into a definite version, but there are many graphemes which are not expressible as a single codepoint, and in general there is no way around a proper library that understands this and counts graphemes for you.

然而,“可打印字符”的概念非常复杂,你可能更喜欢计数字母而不是codepoints -例如,一个带口音^可以表示为两个unicode codepoints,或作为组合遗留codepoint——两者都是有效的,并且都是unicode标准所要求的平等。有一个叫做“规范化”的过程，它将您的字符串转换成一个确定的版本，但是有许多无法表示为单个代码点的图形，而且通常没有合适的库来理解这个过程并为您计数图形。

That said, it's up to you to decide how complex your scripts are and how thoroughly you want to treat them. Transforming into unicode codepoints is a must, everything beyond that is at your discretion.

也就是说，由您来决定您的脚本有多复杂，以及您希望如何彻底地处理它们。转换成码码码点数是必须的，除此之外的一切都由您决定。

Don't hesitate to ask questions about ICU if you decide that you need it, but feel free to explore the vastly simpler iconv() first.

如果您决定需要ICU，请不要犹豫地询问有关ICU的问题，但是请先*地探索非常简单的iconv()。

#5

In the real world, theString[3]=foo; is not a meaningful operation. Why would you ever want to replace a character at a particular position in the string with a different character? There's certainly no natural-language-text processing task for which this operation is meaningful.

在现实世界中，theString[3]=foo;这不是一个有意义的操作。为什么要用不同的字符替换字符串中特定位置的字符?当然，没有任何自然语言文本处理任务对这个操作有意义。

Counting characters is also unlikely to be meaningful. How many characters (for your idea of "character") are there in "á"? How about "á"? Now how about "གི"? If you need this information for implementing some sort of text editing, you're going to have to deal with these hard questions, or just use an existing library/gui toolkit. I would recommend the latter unless you're an expert on world scripts and languages and think you can do better.

计算字符也不太可能有意义。在“a”中有多少个字符(就你对“品格”的理解而言)?“́”怎么样?现在如何“གི”?如果您需要这些信息来实现某种类型的文本编辑，您将不得不处理这些难题，或者仅仅使用现有的库/gui工具包。我推荐后者，除非你是世界脚本和语言方面的专家，并且认为你可以做得更好。

For all other purposes, strlen tells you exactly the piece of information that's actually useful: how much storage space a string takes. This is what's needed for combining and separating strings. If all you want to do is combine strings or separate them at a particular delimiter, snprintf (or strcat if you insist...) and strstr are all you need.

对于所有其他目的，strlen会准确地告诉您一些有用的信息:一个字符串需要多少存储空间。这就是组合和分离字符串所需要的。如果您想要做的只是组合字符串或在特定的分隔符处将它们分开，那么只需snprintf(或者strcat，如果您坚持…)和strstr。

If you want to perform higher-level natural-language-text operations, like capitalization, line breaking, etc. or even higher-level operations like pluralization, tense changes, etc. then you'll need either a library like ICU or respectively something much higher-level and linguistically-capable (and specific to the language(s) you're working with).

如果您想执行高级的自然语言-文本操作，如大小写、断行等，或者甚至是高级的操作，如复数化、时态变化等，那么您将需要一个像ICU这样的库，或者分别需要一个级别更高、语言能力更强的库(并且针对您正在使用的语言)。

Again, most programs do not have any use for this sort of thing and just need to assemble and parse text without any considerations to natural language.

同样，大多数程序对这种事情没有任何用处，只需要在不考虑自然语言的情况下组装和解析文本。

#6

while (s[i]) {
    if ((s[i] & 0xC0) != 0x80)
        j++;
    i++;
}
return (j);

This will count characters in a UTF-8 String... (Found in this article: Even faster UTF-8 character counting)

这将计数UTF-8字符串中的字符……(本文发现:UTF-8字符计数更快)

However I'm still stumped on slicing and concatenating?!?

然而，我仍然被切片和连接所困扰?

#7

In general we should use a different data type for unicode characters.

通常，我们应该对unicode字符使用不同的数据类型。

For example, you can use the wide char data type

例如，您可以使用宽char数据类型

wchar_t theString[] = L"你们好āa";

Note the L modifier that tells that the string is composed of wide chars.

注意L修饰符，它告诉字符串由宽字符组成。

The length of that string can be calculated using the wcslen function, which behaves like strlen.

可以使用wcslen函数计算该字符串的长度，它的行为类似于strlen。

#8

One thing that's not clear from the above answers is why it's not simple. Each character is encoded in one way or another - it doesn't have to be UTF-8, for example - and each character may have multiple encodings, with varying ways to handle combining of accents, etc. The rules are really complicated, and vary by encoding (e.g., utf-8 vs. utf-16).

上面的答案中有一件事不清楚，那就是为什么它不简单。每个字符都以一种或另一种方式进行编码——例如，它不必是UTF-8——每个字符都可能有多个编码，处理重音组合的方式也各不相同。

This question has enormous security concerns, so it is imperative that this be done correctly. Use an OS-supplied library or a well-known third-party library to manipulate unicode strings; don't roll your own.

这个问题涉及到巨大的安全问题，因此必须正确地解决这个问题。使用os提供的库或知名的第三方库来操作unicode字符串;不要使用你自己的。

#9

I did similar implementation years back. But I do not have code with me.

几年前我也做过类似的工作。但是我没有代码。

For each unicode characters, first byte describes the number of bytes follow it to construct a unicode character. Based on the first byte you can determine the length of each unicode character.

对于每个unicode字符，第一个字节描述后面的字节数来构造一个unicode字符。根据第一个字节可以确定每个unicode字符的长度。

I think its a good UTF8 library. enter link description here

我认为这是一个很好的UTF8图书馆。在这里输入链接描述

#10

-1

A sequence of code points constitute a single syllable / letter / character in many other Non Western-European languages (eg: all Indic languages)

在许多其他非西方欧洲语言(如:所有印度语言)中，编码点组成一个单音节/字母/字符的序列。

So, when you are counting the length OR finding the substring (there are definitely use cases of finding the substrings - let us say playing a hangman game), you need to advance syllable by syllable , not by code point by code point.

所以，当你在计算长度或找到子字符串时(确实有找到子字符串的用例——比如玩一个hangman游戏)，你需要一个音节一个音节地前进，而不是一个代码点一个代码点地前进。

So the definition of the character/syllable and where you actually break the string into "chunks of syllables" depends upon the nature of the language you are dealing with. For example, the pattern of the syllables in many Indic languages (Hindi, Telugu, Kannada, Malayalam, Nepali, Tamil, Punjabi, etc.) can be any of the following

因此，字符/音节的定义，以及将字符串分割成“音节块”的地方，取决于你所处理的语言的性质。例如，许多印度语(北印度语、泰卢固语、卡纳达语、马拉雅拉姆语、尼泊尔语、泰米尔语、旁遮普语等)中的音节模式可以是以下任何一种

V  (Vowel in their primary form appearing at the beginning of the word)
C (consonant)
C + V (consonant + vowel in their secondary form)
C + C + V
C + C + C + V

You need to parse the string and look for the above patterns to break the string and to find the substrings.

您需要解析字符串，并查找上面的模式来中断字符串并查找子字符串。

I do not think it is possible to have a general purpose method which can magically break the strings in the above fashion for any unicode string (or sequence of code points) - as the pattern that works for one language may not be applicable for another letter;

我认为不可能有一个通用的方法，它可以神奇地以上面的方式打破任何unicode字符串(或代码点序列)的字符串，因为适用于一种语言的模式可能不适用于另一个字母;

I guess there may be some methods / libraries that can take some definition / configuration parameters as the input to break the unicode strings into such syllable chunks. Not sure though! Appreciate if some one can share how they solved this problem using any commercially available or open source methods.

我猜可能有一些方法/库可以将一些定义/配置参数作为输入，将unicode字符串分割成这样的音节块。不过不确定!如果有人能分享他们如何使用任何商业上可用的或开源的方法来解决这个问题，我们将不胜感激。

#1