如何在C中使用'æ','ø'和'å'进行操作

时间:2021-07-27 20:13:39

I have made a program in C which both can replace or remove all vowels from a string. In addition I would like it to work for these characters: 'æ', 'ø', 'å'.

我用C编写了一个程序,它可以替换或删除字符串中的所有元音。另外我希望它适用于这些角色:'æ','ø','å'。

I have tried to use strstr(), but I didn't manage to implement it without replacing all chars on the line containing 'æ', 'ø' or 'å'. I have also read about wchar, but that only seem to complicate everything.

我试图使用strstr(),但是我没有设法实现它而不替换包含'æ','ø'或'å'的行上的所有字符。我也读过关于wchar的内容,但这似乎只是让一切变得复杂。

The program is working with this array of chars:

该程序正在使用这个字符数组:

char vowels[6] = {'a', 'e', 'i', 'o', 'u', 'y'};

I tried with this array:

我试过这个数组:

char vowels[9] = {'a', 'e', 'i', 'o', 'u', 'y', 'æ', 'ø', 'å'};

but it gives these warnings:

但它给出了这些警告:

warning: multi-character character constant [-Wmultichar]

警告:多字符字符常量[-Wmultichar]

warning: overflow in implicit constant conversion [-Woverflow]

警告:隐式常量转换溢出[-Woverflow]

and if I want to replace each vowel with 'a' it replaces 'å' with "�a".

如果我想用'a'替换每个元音,它将'å'替换为' a'。

I have also tried with the UTF-8 hexval of 'æ', 'ø' and 'å'.

我也尝试过'æ','ø'和'å'的UTF-8。

char extended[3] = {"\xc3\xa6", "\xc3\xb8", "\xc3\xa5"};

but it gives this error:

但它给出了这个错误:

excess elements in char array initializer

char数组初始值设定项中的多余元素

Is there a a way to make this work without making it too complicated?

有没有办法使这项工作不会太复杂?

2 个解决方案

#1


4  

There are two approaches to getting that character to be usable. The first is code pages, which would allow you to use extended ASCII characters (values 128-255), but the code page is system and locale dependent, so it's a bad idea in general.

有两种方法可以使该角色可用。第一个是代码页,它允许你使用扩展的ASCII字符(值128-255),但代码页依赖于系统和语言环境,所以一般来说这是一个坏主意。

The better alternative is to use unicode. The typical case with unicode is to use wide character literals, like in this post:

更好的选择是使用unicode。 unicode的典型案例是使用宽字符文字,如下文所示:

wchar_t str[] = L"αγρω";

The key problem with your code is that you're trying to compare ASCII with UTF8, which can be a problem. The solution to this is simple: convert all your literals to wide character UTF8 equivalents, as well as your strings. You need to work with a common encoding rather than mixing it, unless you have conversion functions to help out.

您的代码的关键问题是您正在尝试将ASCII与UTF8进行比较,这可能是一个问题。对此的解决方案很简单:将所有文字转换为宽字符UTF8等效项以及字符串。您需要使用通用编码而不是混合它,除非您有转换功能来帮助。

#2


4  

Learn about UTF-8 (including its relationship to Unicode) and use some UTF-8 library: libunistring, utfcpp, Glib from GTK, ICU ....

了解UTF-8(包括它与Unicode的关系)并使用一些UTF-8库:libunistring,utfcpp,来自GTK的Glib,ICU ....

You need to understand what character encoding are you using.

您需要了解您使用的字符编码。

I strongly recommend UTF-8 in all cases (which is the default on most Linux systems and nearly all the Internet and web servers; read locale(7) & utf8(7)). Read utf8everywhere....

我强烈建议在所有情况下使用UTF-8(这是大多数Linux系统和几乎所有Internet和Web服务器的默认设置;读取区域设置(7)和utf8(7))。阅读utf8everywhere ....

I don't recommend wchar_t whose width and range and sign is implementation specific (you can't be sure that Unicode fits in a wchar_t; it is rumored that on Windows it does not fit). Also converting UTF-8 input to Unicode/UCS4 can be time-consuming, more than handle UTF-8...

我不推荐wchar_t的宽度,范围和符号是特定于实现的(你不能确定Unicode适合wchar_t;据传,在Windows上它不适合)。将UTF-8输入转换为Unicode / UCS4也很耗时,而不仅仅是处理UTF-8 ......

Do understand that in UTF-8 a character can be encoded in several bytes. For example ê (French accentuated e circonflexe lower-case) is encoded in two bytes 0xc3, 0xaa, and ы (Russian yery lower-case) is encoded in two bytes 0xd1, 0x8b and both are considered vowels but neither fit in one char (which is an 8 bit byte on your and mine machines).

要明白,在UTF-8中,一个字符可以用几个字节编码。例如ê(法语强调e circonflexe小写)以两个字节编码0xc3,0xaa和ы(俄语yery小写)以两个字节0xd1,0x8b编码,两者都被认为是元音但不适合一个字符(这是你和我的机器上的8位字节)。

The notion of vowel is complicated (e.g. what are vowels in Russian, Arabic, Japanese, Hebrew, Cherokee, Hindi, ....), so there might be no simple solution to your problem (since UTF-8 has combining characters).

元音的概念很复杂(例如,俄语,阿拉伯语,日语,希伯来语,切诺基语,印地语等元音是什么元素),因此可能没有简单的解决方案来解决你的问题(因为UTF-8结合了字符)。

Are you exactly sure that æ and œ are letters or vowels? (FWIW, å & œ & æ are classified as a letter & lowercase in Unicode). I was taught in French elementary school that they are ligatures (and French dictionaries don't mention them as letters, so œuf is in a dictionary at the place of oeuf, which means egg). But I am not an expert about this. See strcoll(3).

你确定æ和œ是字母还是元音? (FWIW,å&œ&æ在Unicode中被分类为字母和小写)。我在法国小学被教导他们是连字(而法语词典并没有将它们称为字母,因此œuf是在oeuf的地方的字典中,这意味着鸡蛋)。但我不是这方面的专家。见strcoll(3)。

On Linux, since UTF-8 is the default encoding (and it is increasingly hard to get some other one on recent distribution), I don't recommend using wchar_t, but use UTF-8 char (so functions handling multi-byte encoded UTF-8), for example (using Glib UTF8 & Unicode functions) :

在Linux上,由于UTF-8是默认编码(并且在最近的发行版上越来越难以获得其他的编码),我不建议使用wchar_t,而是使用UTF-8 char(因此处理多字节编码的UTF的函数) -8),例如(使用Glib UTF8和Unicode函数):

 unsigned count_norvegian_lowercase_vowels(const char*s) {
   assert (s != NULL);
  // s should be a not-too-big string 
  // (its `strlen` should be less than UINT_MAX)
  // s is assumed to be UTF-8 encoded, and should be valid UTF-8:
    if (!g_utf8_validate(s, -1, NULL)) {
      fprintf(stderr, "invalid UTF-8 string %s\n", s);
      exit(EXIT_FAILURE);
    };
    unsigned count = 0;
    char* next= NULL; 
    char* pc= NULL;
    for (pc = s; *pc != '\0' && ((next=g_utf8_next_char(pc)), *pc); pc=next) {
      g_unichar u = g_utf8_get_char(pc);
      // comments from OP make me believe these are the only Norvegian vowels.
      if (u=='a' || u=='e' || u=='i' || u=='o' || u=='u' || u=='y'
          || u==(g_unichar)0xa6 //æ U+00E6 LATIN SMALL LETTER AE
          || u==(g_unichar)0xf8  //ø U+00F8 LATIN SMALL LETTER O WITH STROKE
          || u==(g_unichar)0xe5 //å U+00E5 LATIN SMALL LETTER A WITH RING ABOVE
       /* notice that for me  ы & ê are also vowels but œ is a ligature ... */
      )
        count++;
    };
    return count;
  }

I'm not sure the name of my function is correct; but you told me in comments that Norvegian (which I don't know) has no more vowel characters than what my function is counting.

我不确定我的功能名称是否正确;但你在评论中告诉我,Norvegian(我不知道)没有比我的功能更多的元音字符。

It is on purpose that I did not put UTF-8 in literal strings or wide char literals (only in comments). There are other obsolete character encodings (read about EBCDIC or KOI8) and you might want to cross-compile the code.

我故意将UTF-8放在文字字符串或宽字符文字中(仅在注释中)。还有其他过时的字符编码(阅读有关EBCDIC或KOI8),您可能希望交叉编译代码。

#1


4  

There are two approaches to getting that character to be usable. The first is code pages, which would allow you to use extended ASCII characters (values 128-255), but the code page is system and locale dependent, so it's a bad idea in general.

有两种方法可以使该角色可用。第一个是代码页,它允许你使用扩展的ASCII字符(值128-255),但代码页依赖于系统和语言环境,所以一般来说这是一个坏主意。

The better alternative is to use unicode. The typical case with unicode is to use wide character literals, like in this post:

更好的选择是使用unicode。 unicode的典型案例是使用宽字符文字,如下文所示:

wchar_t str[] = L"αγρω";

The key problem with your code is that you're trying to compare ASCII with UTF8, which can be a problem. The solution to this is simple: convert all your literals to wide character UTF8 equivalents, as well as your strings. You need to work with a common encoding rather than mixing it, unless you have conversion functions to help out.

您的代码的关键问题是您正在尝试将ASCII与UTF8进行比较,这可能是一个问题。对此的解决方案很简单:将所有文字转换为宽字符UTF8等效项以及字符串。您需要使用通用编码而不是混合它,除非您有转换功能来帮助。

#2


4  

Learn about UTF-8 (including its relationship to Unicode) and use some UTF-8 library: libunistring, utfcpp, Glib from GTK, ICU ....

了解UTF-8(包括它与Unicode的关系)并使用一些UTF-8库:libunistring,utfcpp,来自GTK的Glib,ICU ....

You need to understand what character encoding are you using.

您需要了解您使用的字符编码。

I strongly recommend UTF-8 in all cases (which is the default on most Linux systems and nearly all the Internet and web servers; read locale(7) & utf8(7)). Read utf8everywhere....

我强烈建议在所有情况下使用UTF-8(这是大多数Linux系统和几乎所有Internet和Web服务器的默认设置;读取区域设置(7)和utf8(7))。阅读utf8everywhere ....

I don't recommend wchar_t whose width and range and sign is implementation specific (you can't be sure that Unicode fits in a wchar_t; it is rumored that on Windows it does not fit). Also converting UTF-8 input to Unicode/UCS4 can be time-consuming, more than handle UTF-8...

我不推荐wchar_t的宽度,范围和符号是特定于实现的(你不能确定Unicode适合wchar_t;据传,在Windows上它不适合)。将UTF-8输入转换为Unicode / UCS4也很耗时,而不仅仅是处理UTF-8 ......

Do understand that in UTF-8 a character can be encoded in several bytes. For example ê (French accentuated e circonflexe lower-case) is encoded in two bytes 0xc3, 0xaa, and ы (Russian yery lower-case) is encoded in two bytes 0xd1, 0x8b and both are considered vowels but neither fit in one char (which is an 8 bit byte on your and mine machines).

要明白,在UTF-8中,一个字符可以用几个字节编码。例如ê(法语强调e circonflexe小写)以两个字节编码0xc3,0xaa和ы(俄语yery小写)以两个字节0xd1,0x8b编码,两者都被认为是元音但不适合一个字符(这是你和我的机器上的8位字节)。

The notion of vowel is complicated (e.g. what are vowels in Russian, Arabic, Japanese, Hebrew, Cherokee, Hindi, ....), so there might be no simple solution to your problem (since UTF-8 has combining characters).

元音的概念很复杂(例如,俄语,阿拉伯语,日语,希伯来语,切诺基语,印地语等元音是什么元素),因此可能没有简单的解决方案来解决你的问题(因为UTF-8结合了字符)。

Are you exactly sure that æ and œ are letters or vowels? (FWIW, å & œ & æ are classified as a letter & lowercase in Unicode). I was taught in French elementary school that they are ligatures (and French dictionaries don't mention them as letters, so œuf is in a dictionary at the place of oeuf, which means egg). But I am not an expert about this. See strcoll(3).

你确定æ和œ是字母还是元音? (FWIW,å&œ&æ在Unicode中被分类为字母和小写)。我在法国小学被教导他们是连字(而法语词典并没有将它们称为字母,因此œuf是在oeuf的地方的字典中,这意味着鸡蛋)。但我不是这方面的专家。见strcoll(3)。

On Linux, since UTF-8 is the default encoding (and it is increasingly hard to get some other one on recent distribution), I don't recommend using wchar_t, but use UTF-8 char (so functions handling multi-byte encoded UTF-8), for example (using Glib UTF8 & Unicode functions) :

在Linux上,由于UTF-8是默认编码(并且在最近的发行版上越来越难以获得其他的编码),我不建议使用wchar_t,而是使用UTF-8 char(因此处理多字节编码的UTF的函数) -8),例如(使用Glib UTF8和Unicode函数):

 unsigned count_norvegian_lowercase_vowels(const char*s) {
   assert (s != NULL);
  // s should be a not-too-big string 
  // (its `strlen` should be less than UINT_MAX)
  // s is assumed to be UTF-8 encoded, and should be valid UTF-8:
    if (!g_utf8_validate(s, -1, NULL)) {
      fprintf(stderr, "invalid UTF-8 string %s\n", s);
      exit(EXIT_FAILURE);
    };
    unsigned count = 0;
    char* next= NULL; 
    char* pc= NULL;
    for (pc = s; *pc != '\0' && ((next=g_utf8_next_char(pc)), *pc); pc=next) {
      g_unichar u = g_utf8_get_char(pc);
      // comments from OP make me believe these are the only Norvegian vowels.
      if (u=='a' || u=='e' || u=='i' || u=='o' || u=='u' || u=='y'
          || u==(g_unichar)0xa6 //æ U+00E6 LATIN SMALL LETTER AE
          || u==(g_unichar)0xf8  //ø U+00F8 LATIN SMALL LETTER O WITH STROKE
          || u==(g_unichar)0xe5 //å U+00E5 LATIN SMALL LETTER A WITH RING ABOVE
       /* notice that for me  ы & ê are also vowels but œ is a ligature ... */
      )
        count++;
    };
    return count;
  }

I'm not sure the name of my function is correct; but you told me in comments that Norvegian (which I don't know) has no more vowel characters than what my function is counting.

我不确定我的功能名称是否正确;但你在评论中告诉我,Norvegian(我不知道)没有比我的功能更多的元音字符。

It is on purpose that I did not put UTF-8 in literal strings or wide char literals (only in comments). There are other obsolete character encodings (read about EBCDIC or KOI8) and you might want to cross-compile the code.

我故意将UTF-8放在文字字符串或宽字符文字中(仅在注释中)。还有其他过时的字符编码(阅读有关EBCDIC或KOI8),您可能希望交叉编译代码。