实际上是否可以在C上存储和处理单个UTF-8字符？如果是这样，怎么样？

I've written a program in C that breaks words down into syllables, segments and letters. It's working well with ASCII characters but I want to make versions that work for the IPA and Arabic too.

我用C编写了一个程序,将单词分解为音节,段和字母。它适用于ASCII字符,但我想制作适用于IPA和阿拉伯语的版本。

I'm having massive problems saving and performing functions on individual characters. My editor and console are both set up to UTF-8 and can display Arabic text fine if I save it as a char*, but when I try to print wchars they display random punctuation marks.

我在保存和执行各个角色的功能方面遇到了大量问题。我的编辑器和控制台都设置为UTF-8,如果我将它保存为char *,可以显示阿拉伯语文本,但是当我尝试打印wchars时,它们会显示随机的标点符号。

My program needs to be able to recognise an individual UTF-8 character in order to work. For example, for the word 'though' it stores 't' as syllable[1]segment[1]letter[1], h as syllable[1]segment[1]letter[2] etc. I want to be able to do the same for non-ASCII characters.

我的程序需要能够识别单个UTF-8字符才能工作。例如,对于单词'though',它将't'存储为音节[1]段[1]字母[1],h作为音节[1]段[1]字母[2]等存储。我希望能够对非ASCII字符执行相同操作。

I've spent basically the whole day researching unicode and trying out different methods and I can't get any of them to let me store an Arabic character as a character.

我基本上花了整整一天研究unicode并尝试不同的方法,我不能让他们中的任何一个让我将阿拉伯字符存储为角色。

I'm not sure if I've just made some stupid syntax errors along the way, if I've completely misunderstood the whole concept, or if it actually just isn't possible to do what I want in C and I should just give up and try another language...

如果我完全误解了整个概念,或者它实际上是不可能在C中做我想做的事情,我不确定我是否只是在这个过程中犯了一些愚蠢的语法错误,我应该给起来尝试另一种语言......

I would massively, massively, massively appreciate any help you can offer! I'm pretty new to programming, but unicode is completely instrumental to my work so I want to work out how to do it from the beginning.

我会大量地,大规模地,大量地欣赏你能提供的任何帮助!我对编程很陌生,但是unicode对我的工作非常重要,所以我想从头开始研究如何做。

My understanding of how unicode works (in case that's where I'm going wrong):

我对unicode如何工作的理解(如果我出错的话):

I type some text into my editor. My editor encodes it according to the encoding I have set. So if I set it to UFT-8 it will encode the Arabic letter ب with the 2 byte sequence 0xd8 0xab which indicates the code point U+0628.

我在编辑器中输入了一些文字。我的编辑器根据我设置的编码对其进行编码。因此,如果我将其设置为UFT-8,它将使用2字节序列0xd8 0xab编码阿拉伯字母ب,表示代码点U + 0628。
I compile it, breaking down 0xd8 0xab into the binary 11011000 10101000.

我编译它,将0xd8 0xab分解为二进制11011000 10101000。
I run it on the command prompt. The command prompt interprets the text according to the encoding I have set, so if I set it to UFT-8 it should interpret 11011000 10101000 as the code point U+0628. Unicode algorithms also tell it which version of U+0628 to display to me, as the character has different shapes depending on where it is in the word. As the character is alone it will show me the standalone version ب

我在命令提示符下运行它。命令提示符根据我设置的编码解释文本,因此如果我将其设置为UFT-8,则应将11011000 10101000解释为代码点U + 0628。 Unicode算法还告诉它向我显示哪个版本的U + 0628,因为角色具有不同的形状,具体取决于它在单词中的位置。由于角色是独自一人,它将显示独立版本ب

My understanding of the ways I can process unicode in C:

我对在C中处理unicode的方法的理解:

Option A - Use single bytes encoded as UTF-8 (http://www.nubaria.com/en/blog/?p=289)

选项A - 使用编码为UTF-8的单字节(http://www.nubaria.com/en/blog/?p=289)

Use single bytes encoded as UTF-8. Leave all my datatypes as chars and char arrays and only type ASCII characters in my code. If I absolutely have to hard code a unicode character enter it as an array in the format:

使用编码为UTF-8的单字节。将我的所有数据类型保留为chars和char数组,并仅在我的代码中键入ASCII字符。如果我必须硬编码unicode字符,请将其作为数组输入格式:

    const char kChineseSampleText[] = "\xe4\xb8\xad\xe6\x96\x87";

My problems with this:

我的问题是:

I need to manipulate individual characters

我需要操纵个别角色

Having to type Arabic characters as code points is going to render my code completely unreadable and slow me down immensely.

必须输入阿拉伯字符作为代码点才能使我的代码完全不可读,并使我的速度极慢。

Option B - Use wchar and friends (http://icu-project.org/docs/papers/unicode_wchar_t.html)

选项B - 使用wchar和朋友(http://icu-project.org/docs/papers/unicode_wchar_t.html)

Swap using chars for wchars, which hold 2 to 4 bytes depending on the compiler. String functions like strlen will not work as they are expecting characters to be one byte, but there are w functions like wprintf I can use instead.

使用字符交换为wchars,根据编译器,它包含2到4个字节。像strlen这样的字符串函数不起作用,因为它们期望字符是一个字节,但是我可以使用像wprintf这样的w函数。

My problem with this:

我的问题是:

I can’t get wchars to print Arabic characters at all! I can get them to print English letters fine, but Arabic characters just pull through as random punctuation marks.

我根本无法打印阿拉伯字符!我可以让他们打印英文字母,但阿拉伯字符只是作为随机标点符号。

I've tried inputing the unicode code point as well as the actual Arabic character and I've tried printing them both to the console and to a UTF-8 encoded text file and I get the same result, even though both the console and the text file display Arabic text if entered as a char*. I've included my code at the end.

我已经尝试输入unicode代码点以及实际的阿拉伯字符,我已经尝试将它们打印到控制台和UTF-8编码的文本文件,我得到相同的结果,即使控制台和文本文件显示阿拉伯文本(如果作为char *输入)。我最后把代码包括在内。

(It’s worth saying here that I am aware that a lot of people think wchars are bad because they aren’t very portable and because they take up extra space for ASCII characters. But at this stage, neither of those things are really a worry for me - I’m just writing the program to run on my own computer and the program will only be processing short strings.)

(值得一提的是,我知道很多人认为wchars不好,因为它们不是很便携,而且因为它们占用了ASCII字符的额外空间。但是在这个阶段,这些都不是真正的担心我 - 我只是编写程序在我自己的计算机上运行,程序只处理短字符串。)

Option C - Use external libraries

选项C - 使用外部库

I've read in various comments that external libraries are the way to go so I've tried:

我已经阅读了各种评论,外部库是要走的路,所以我尝试过:

C programming library

C编程库

http://www.cprogramming.com/tutorial/unicode.html suggests replacing all chars with unsigned long integers and using special functions for iterating through strings etc. The site even provides a sample library to download.

http://www.cprogramming.com/tutorial/unicode.html建议用无符号长整数替换所有字符,并使用特殊函数迭代字符串等。该站点甚至提供了一个样本库供下载。

My problem:

While I can set the character to be an unsigned long integer I can’t print it out, because the printf and wprintf functions don’t work, and neither does the library provided on the website (I think maybe the library was designed for Linux? Some of the datatypes are invalid and amending them didn't work either)

虽然我可以将字符设置为无符号长整数但我无法将其打印出来,因为printf和wprintf函数不起作用,并且网站上也没有提供库(我想这个库可能是为Linux设计的) ?某些数据类型无效,修改它们也不起作用)

ICU library

My problem:

I downloaded the ICU library, but when I was looking into how to use it I saw that functionality such as the characterIterator is not available for use in C (http://userguide.icu-project.org/strings). Being able to iterate through characters is completely fundamental to what I need to do, so I don't think the library will work for me.

我下载了ICU库,但是当我研究如何使用它时,我看到了诸如characterIterator之类的功能无法在C中使用(http://userguide.icu-project.org/strings)。能够遍历字符是我需要做的事情的基础,所以我认为图书馆不适合我。

My code

#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>
#include <locale.h>
#include <string.h>


int main ()
{
wchar_t unicode = L'\xd8ac';
wchar_t arabic = L'ب';
wchar_t number = 0x062c;


FILE* f;
f = fopen("unitest.txt","w");
char* string = "ايه الاخبار";


//printf - works 

printf("printf - literal arabic character is \"م\"\n");
fprintf(f,"printf - literal arabic character is \"م\"\n");

printf("printf - char* string is \"%s\"\n",string);
fprintf(f,"printf - char* string is \"%s\"\n",string);


//wprintf  - english - works

wprintf(L"wprintf - literal english char is \"%C\"\n\n", L't');
fwprintf(f,L"wprintf - literal english char is \"%C\"\n\n", L't');

//wprintf - arabic - doesnt work

wprintf(L"wprintf - unicode wchar_t is \"%C\"\n", unicode);
fwprintf(f,L"wprintf - unicode wchar_t is \"%C\"\n", unicode);

wprintf(L"wprintf - unicode number wchar_t is \"%C\"\n", number);
fwprintf(f,L"wprintf - unicode number wchar_t is \"%C\"\n", number);

wprintf(L"wprintf - arabic wchar_t is \"%C\"\n", arabic);
fwprintf(f,L"wprintf - arabic wchar_t is \"%C\"\n", arabic);


wprintf(L"wprintf - literal arabic character is \"%C\"\n",L'ت');
fwprintf(f,L"wprintf - literal arabic character is \"%C\"\n",L'ت');


wprintf(L"wprintf - literal arabic character in string is \"م\"\n\n");
fwprintf(f,L"wprintf - literal arabic character in string is \"م\"\n\n");

fclose(f);

return 0;
}

Output file

printf - literal arabic character is "م"
printf - char* string is "ايه الاخبار"
wprintf - literal english char is "t"

wprintf - unicode wchar_t is "�"
wprintf - unicode number wchar_t is ","
wprintf - arabic wchar_t is "("
wprintf - literal arabic character is "*"
wprintf - literal arabic character in string is ""

I'm using Windows 10, Notepad++ and MinGW.

我正在使用Windows 10,Notepad ++和MinGW。

Edit This got marked as a duplicate of Light C Unicode Library but I don't think it really answers my question. I've downloaded the library and had a look at and you can call me stupid if you like, but I'm really new to programming and I don't understand most of the code in the library, so it's hard for me to work out how I can use it achieve what I want. I searched the library for a print function and couldn't find one...

编辑这被标记为Light C Unicode库的副本,但我不认为它真的回答了我的问题。我已经下载了库并查看过,如果你愿意的话,你可以叫我傻,但我真的很喜欢编程,而且我不懂库中的大部分代码,所以我很难工作我如何使用它实现我想要的。我搜索了图书馆的打印功能,却找不到...

I just want to save a UTF-8 character and then print it out again! Do I really need to install an entire library to do that? I would just really appreciate someone taking pity on me and telling me in baby terms how I can do it... People keep saying I should use uint_32 or something instead of wchar - but how do I then print those datatypes? Can I do it with wprintf?!

我只想保存一个UTF-8字符,然后再打印出来!我真的需要安装一个完整的库才能做到这一点吗?我真的很感激有人对我表示同情并告诉我如何做到这一点...人们一直说我应该使用uint_32或者其他东西而不是wchar - 但是我如何打印这些数据类型呢?我可以用wprintf做到吗?!

3 个解决方案

#1

C and UTF-8 are still getting to know each other. In-other-words, IMO, C support for UTF-8 is scant.

C和UTF-8仍然相互了解。换句话说,IMO,C对UTF-8的支持很少。

Is it ... possible to store and process individual UTF-8 characters ...?

是......可以存储和处理单个UTF-8字符......?

First step is to make certain "ايه الاخبار" is a UTF-8 encoded string. C supports this explicitly with u8"ايه الاخبار".

第一步是确保“ايهالاخبار”是UTF-8编码的字符串。 C使用u8“ايهالاخبار”明确支持此功能。

A UTF-8 string is a sequence of char. Each 1 to 4 char represents a Unicode character. A Unicode character needs at least 21-bits for encoding. Yet OP does not needs to convert a portion of string[] into a Unicode character as much as wants to segment that string on UTF-8 boundaries. This is readily found by looking for UTF-8 continuation bytes.

UTF-8字符串是char序列。每个1到4个字符表示一个Unicode字符。 Unicode字符至少需要21位才能进行编码。然而,OP不需要将string []的一部分转换为Unicode字符,只要想在UTF-8边界上对该字符串进行分段即可。通过查找UTF-8连续字节可以很容易地找到它。

The following forms a 1 Unicode character encoded as a UTF-8 string with the accompanying terminating null character. Then that short string is printed.

以下形成1个Unicode字符,编码为UTF-8字符串,并带有终止空字符。然后打印那个短字符串。

char* string = u8"ايه الاخبار";
for (char *s = string; *s; ) {
  printf("<");
  char u[5];
  char *p = u;
  *p++ = *s++;
  if ((*s & 0xC0) == 0x80) *p++ = *s++;
  if ((*s & 0xC0) == 0x80) *p++ = *s++;
  if ((*s & 0xC0) == 0x80) *p++ = *s++;
  *p = 0; 
  printf("%s", u);
  printf(">\n");
}

With the output viewed with a UTF8 aware screen:

使用UTF8感知屏幕查看输出:

<ا>
<ي>
<ه>
< >
<ا>
<ل>
<ا>
<خ>
<ب>
<ا>
<ر>

#2

An example with utf8proc library to iterate is:

使用utf8proc库进行迭代的示例是:

#include <utf8proc.h>
#include <stdio.h>

int main(void) {
  utf8proc_uint8_t const string[] = u8"ايه الاخبار";
  utf8proc_ssize_t size = sizeof string / sizeof *string - 1;
  utf8proc_int32_t data;
  utf8proc_ssize_t n;

  utf8proc_uint8_t const *pstring = string;
  while ((n = utf8proc_iterate(pstring, size, &data)) > 0) {
    printf("<%.*s>\n", (int)n, pstring);
    pstring += n;
    size -= n;
  }
}

This is probably not the best way to use this library but I make an issue an github to have some example. Because, I'm unable to understand how work this library.

这可能不是使用这个库的最佳方法,但我提出了一个github问题,有一些例子。因为,我无法理解这个库是如何工作的。

#3

You need to very clearly understand the difference between a Unicode code point and UTF-8. UTF-8 is a variable byte encoding of Unicode code points. The lower end, values 0-127, is stored as a single byte. That's the main point of UTF-8, and makes it backwards compatible with Ascii.

您需要非常清楚地理解Unicode代码点和UTF-8之间的区别。 UTF-8是Unicode代码点的可变字节编码。值0-127的下端存储为单个字节。这是UTF-8的主要观点,并使其向后兼容Ascii。

When bit 7 is set, for values over 127, a variable length code of two bytes or more is used. The leading byte always has the bit pattern 11xxxxxx.

当设置位7时,对于超过127的值,使用两个字节或更多的可变长度代码。前导字节始终具有位模式11xxxxxx。

Here's code to get the skip (the number of character used), also to read a codepoint and to write one.

这是获取跳过的代码(使用的字符数),也是读取代码点和编写代码点的代码。

static const unsigned int offsetsFromUTF8[6] = 
{
    0x00000000UL, 0x00003080UL, 0x000E2080UL,
    0x03C82080UL, 0xFA082080UL, 0x82082080UL
};

static const unsigned char trailingBytesForUTF8[256] = {
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
    2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5
};



int bbx_utf8_skip(const char *utf8)
{
  return trailingBytesForUTF8[(unsigned char) *utf8] + 1;
}

int bbx_utf8_getch(const char *utf8)
{
    int ch;
    int nb;

    nb = trailingBytesForUTF8[(unsigned char)*utf8];
    ch = 0;
    switch (nb) 
    {
            /* these fall through deliberately */
        case 3: ch += (unsigned char)*utf8++; ch <<= 6;
        case 2: ch += (unsigned char)*utf8++; ch <<= 6;
        case 1: ch += (unsigned char)*utf8++; ch <<= 6;
        case 0: ch += (unsigned char)*utf8++;
    }
    ch -= offsetsFromUTF8[nb];

    return ch;
}

int bbx_utf8_putch(char *out, int ch)
{
  char *dest = out;
  if (ch < 0x80) 
  {
     *dest++ = (char)ch;
  }
  else if (ch < 0x800) 
  {
    *dest++ = (ch>>6) | 0xC0;
    *dest++ = (ch & 0x3F) | 0x80;
  }
  else if (ch < 0x10000) 
  {
     *dest++ = (ch>>12) | 0xE0;
     *dest++ = ((ch>>6) & 0x3F) | 0x80;
     *dest++ = (ch & 0x3F) | 0x80;
  }
  else if (ch < 0x110000) 
  {
     *dest++ = (ch>>18) | 0xF0;
     *dest++ = ((ch>>12) & 0x3F) | 0x80;
     *dest++ = ((ch>>6) & 0x3F) | 0x80;
     *dest++ = (ch & 0x3F) | 0x80;
  }
  else
    return 0;
  return dest - out;
}

Using these functions or similar, you convert between code points and UTF-8 and back.

使用这些函数或类似函数,您可以在代码点和UTF-8之间进行转换。

Windows currently uses UTF-16 for its apis. To a first approximation, UTF-16 is the code points in 16 bit format. So when writing a UTF-8 based program, you need to convert the UTF-8 to UTF-16 (using wide chars) immediately before calling Windows output functions.

Windows目前使用UTF-16作为其api。对于第一近似,UTF-16是16位格式的代码点。因此,在编写基于UTF-8的程序时,您需要在调用Windows输出函数之前立即将UTF-8转换为UTF-16(使用宽字符)。

Support for UTF-8 via printf() is patchy. Passing a UTF-8 encoded string to printf() is unlikely to do what you want.

通过printf()支持UTF-8是不完整的。将UTF-8编码的字符串传递给printf()不太可能做到你想要的。

#1