I have made a program in C which both can replace or remove all vowels from a string. In addition I would like it to work for these characters: 'æ', 'ø', 'å'.
我用C编写了一个程序,它可以替换或删除字符串中的所有元音。另外我希望它适用于这些角色:'æ','ø','å'。
I have tried to use strstr(), but I didn't manage to implement it without replacing all chars on the line containing 'æ', 'ø' or 'å'. I have also read about wchar, but that only seem to complicate everything.
我试图使用strstr(),但是我没有设法实现它而不替换包含'æ','ø'或'å'的行上的所有字符。我也读过关于wchar的内容,但这似乎只是让一切变得复杂。
The program is working with this array of chars:
该程序正在使用这个字符数组:
char vowels[6] = {'a', 'e', 'i', 'o', 'u', 'y'};
I tried with this array:
我试过这个数组:
char vowels[9] = {'a', 'e', 'i', 'o', 'u', 'y', 'æ', 'ø', 'å'};
but it gives these warnings:
但它给出了这些警告:
warning: multi-character character constant [-Wmultichar]
警告:多字符字符常量[-Wmultichar]
warning: overflow in implicit constant conversion [-Woverflow]
警告:隐式常量转换溢出[-Woverflow]
and if I want to replace each vowel with 'a' it replaces 'å' with "�a".
如果我想用'a'替换每个元音,它将'å'替换为' a'。
I have also tried with the UTF-8 hexval of 'æ', 'ø' and 'å'.
我也尝试过'æ','ø'和'å'的UTF-8。
char extended[3] = {"\xc3\xa6", "\xc3\xb8", "\xc3\xa5"};
but it gives this error:
但它给出了这个错误:
excess elements in char array initializer
char数组初始值设定项中的多余元素
Is there a a way to make this work without making it too complicated?
有没有办法使这项工作不会太复杂?
2 个解决方案
#1
4
There are two approaches to getting that character to be usable. The first is code pages, which would allow you to use extended ASCII characters (values 128-255), but the code page is system and locale dependent, so it's a bad idea in general.
有两种方法可以使该角色可用。第一个是代码页,它允许你使用扩展的ASCII字符(值128-255),但代码页依赖于系统和语言环境,所以一般来说这是一个坏主意。
The better alternative is to use unicode. The typical case with unicode is to use wide character literals, like in this post:
更好的选择是使用unicode。 unicode的典型案例是使用宽字符文字,如下文所示:
wchar_t str[] = L"αγρω";
The key problem with your code is that you're trying to compare ASCII with UTF8, which can be a problem. The solution to this is simple: convert all your literals to wide character UTF8 equivalents, as well as your strings. You need to work with a common encoding rather than mixing it, unless you have conversion functions to help out.
您的代码的关键问题是您正在尝试将ASCII与UTF8进行比较,这可能是一个问题。对此的解决方案很简单:将所有文字转换为宽字符UTF8等效项以及字符串。您需要使用通用编码而不是混合它,除非您有转换功能来帮助。
#2
4
Learn about UTF-8 (including its relationship to Unicode) and use some UTF-8 library: libunistring, utfcpp, Glib from GTK, ICU ....
了解UTF-8(包括它与Unicode的关系)并使用一些UTF-8库:libunistring,utfcpp,来自GTK的Glib,ICU ....
You need to understand what character encoding are you using.
您需要了解您使用的字符编码。
I strongly recommend UTF-8 in all cases (which is the default on most Linux systems and nearly all the Internet and web servers; read locale(7) & utf8(7)). Read utf8everywhere....
我强烈建议在所有情况下使用UTF-8(这是大多数Linux系统和几乎所有Internet和Web服务器的默认设置;读取区域设置(7)和utf8(7))。阅读utf8everywhere ....
I don't recommend wchar_t
whose width and range and sign is implementation specific (you can't be sure that Unicode fits in a wchar_t
; it is rumored that on Windows it does not fit). Also converting UTF-8 input to Unicode/UCS4 can be time-consuming, more than handle UTF-8...
我不推荐wchar_t的宽度,范围和符号是特定于实现的(你不能确定Unicode适合wchar_t;据传,在Windows上它不适合)。将UTF-8输入转换为Unicode / UCS4也很耗时,而不仅仅是处理UTF-8 ......
Do understand that in UTF-8 a character can be encoded in several bytes. For example ê
(French accentuated e circonflexe lower-case) is encoded in two bytes 0xc3, 0xaa
, and ы
(Russian yery lower-case) is encoded in two bytes 0xd1, 0x8b
and both are considered vowels but neither fit in one char
(which is an 8 bit byte on your and mine machines).
要明白,在UTF-8中,一个字符可以用几个字节编码。例如ê(法语强调e circonflexe小写)以两个字节编码0xc3,0xaa和ы(俄语yery小写)以两个字节0xd1,0x8b编码,两者都被认为是元音但不适合一个字符(这是你和我的机器上的8位字节)。
The notion of vowel is complicated (e.g. what are vowels in Russian, Arabic, Japanese, Hebrew, Cherokee, Hindi, ....), so there might be no simple solution to your problem (since UTF-8 has combining characters).
元音的概念很复杂(例如,俄语,阿拉伯语,日语,希伯来语,切诺基语,印地语等元音是什么元素),因此可能没有简单的解决方案来解决你的问题(因为UTF-8结合了字符)。
Are you exactly sure that æ
and œ
are letters or vowels? (FWIW, å
& œ
& æ
are classified as a letter & lowercase in Unicode). I was taught in French elementary school that they are ligatures (and French dictionaries don't mention them as letters, so œuf
is in a dictionary at the place of oeuf
, which means egg). But I am not an expert about this. See strcoll(3).
你确定æ和œ是字母还是元音? (FWIW,å&œ&æ在Unicode中被分类为字母和小写)。我在法国小学被教导他们是连字(而法语词典并没有将它们称为字母,因此œuf是在oeuf的地方的字典中,这意味着鸡蛋)。但我不是这方面的专家。见strcoll(3)。
On Linux, since UTF-8 is the default encoding (and it is increasingly hard to get some other one on recent distribution), I don't recommend using wchar_t
, but use UTF-8 char
(so functions handling multi-byte encoded UTF-8), for example (using Glib UTF8 & Unicode functions) :
在Linux上,由于UTF-8是默认编码(并且在最近的发行版上越来越难以获得其他的编码),我不建议使用wchar_t,而是使用UTF-8 char(因此处理多字节编码的UTF的函数) -8),例如(使用Glib UTF8和Unicode函数):
unsigned count_norvegian_lowercase_vowels(const char*s) {
assert (s != NULL);
// s should be a not-too-big string
// (its `strlen` should be less than UINT_MAX)
// s is assumed to be UTF-8 encoded, and should be valid UTF-8:
if (!g_utf8_validate(s, -1, NULL)) {
fprintf(stderr, "invalid UTF-8 string %s\n", s);
exit(EXIT_FAILURE);
};
unsigned count = 0;
char* next= NULL;
char* pc= NULL;
for (pc = s; *pc != '\0' && ((next=g_utf8_next_char(pc)), *pc); pc=next) {
g_unichar u = g_utf8_get_char(pc);
// comments from OP make me believe these are the only Norvegian vowels.
if (u=='a' || u=='e' || u=='i' || u=='o' || u=='u' || u=='y'
|| u==(g_unichar)0xa6 //æ U+00E6 LATIN SMALL LETTER AE
|| u==(g_unichar)0xf8 //ø U+00F8 LATIN SMALL LETTER O WITH STROKE
|| u==(g_unichar)0xe5 //å U+00E5 LATIN SMALL LETTER A WITH RING ABOVE
/* notice that for me ы & ê are also vowels but œ is a ligature ... */
)
count++;
};
return count;
}
I'm not sure the name of my function is correct; but you told me in comments that Norvegian (which I don't know) has no more vowel characters than what my function is counting.
我不确定我的功能名称是否正确;但你在评论中告诉我,Norvegian(我不知道)没有比我的功能更多的元音字符。
It is on purpose that I did not put UTF-8 in literal strings or wide char literals (only in comments). There are other obsolete character encodings (read about EBCDIC or KOI8) and you might want to cross-compile the code.
我故意将UTF-8放在文字字符串或宽字符文字中(仅在注释中)。还有其他过时的字符编码(阅读有关EBCDIC或KOI8),您可能希望交叉编译代码。
#1
4
There are two approaches to getting that character to be usable. The first is code pages, which would allow you to use extended ASCII characters (values 128-255), but the code page is system and locale dependent, so it's a bad idea in general.
有两种方法可以使该角色可用。第一个是代码页,它允许你使用扩展的ASCII字符(值128-255),但代码页依赖于系统和语言环境,所以一般来说这是一个坏主意。
The better alternative is to use unicode. The typical case with unicode is to use wide character literals, like in this post:
更好的选择是使用unicode。 unicode的典型案例是使用宽字符文字,如下文所示:
wchar_t str[] = L"αγρω";
The key problem with your code is that you're trying to compare ASCII with UTF8, which can be a problem. The solution to this is simple: convert all your literals to wide character UTF8 equivalents, as well as your strings. You need to work with a common encoding rather than mixing it, unless you have conversion functions to help out.
您的代码的关键问题是您正在尝试将ASCII与UTF8进行比较,这可能是一个问题。对此的解决方案很简单:将所有文字转换为宽字符UTF8等效项以及字符串。您需要使用通用编码而不是混合它,除非您有转换功能来帮助。
#2
4
Learn about UTF-8 (including its relationship to Unicode) and use some UTF-8 library: libunistring, utfcpp, Glib from GTK, ICU ....
了解UTF-8(包括它与Unicode的关系)并使用一些UTF-8库:libunistring,utfcpp,来自GTK的Glib,ICU ....
You need to understand what character encoding are you using.
您需要了解您使用的字符编码。
I strongly recommend UTF-8 in all cases (which is the default on most Linux systems and nearly all the Internet and web servers; read locale(7) & utf8(7)). Read utf8everywhere....
我强烈建议在所有情况下使用UTF-8(这是大多数Linux系统和几乎所有Internet和Web服务器的默认设置;读取区域设置(7)和utf8(7))。阅读utf8everywhere ....
I don't recommend wchar_t
whose width and range and sign is implementation specific (you can't be sure that Unicode fits in a wchar_t
; it is rumored that on Windows it does not fit). Also converting UTF-8 input to Unicode/UCS4 can be time-consuming, more than handle UTF-8...
我不推荐wchar_t的宽度,范围和符号是特定于实现的(你不能确定Unicode适合wchar_t;据传,在Windows上它不适合)。将UTF-8输入转换为Unicode / UCS4也很耗时,而不仅仅是处理UTF-8 ......
Do understand that in UTF-8 a character can be encoded in several bytes. For example ê
(French accentuated e circonflexe lower-case) is encoded in two bytes 0xc3, 0xaa
, and ы
(Russian yery lower-case) is encoded in two bytes 0xd1, 0x8b
and both are considered vowels but neither fit in one char
(which is an 8 bit byte on your and mine machines).
要明白,在UTF-8中,一个字符可以用几个字节编码。例如ê(法语强调e circonflexe小写)以两个字节编码0xc3,0xaa和ы(俄语yery小写)以两个字节0xd1,0x8b编码,两者都被认为是元音但不适合一个字符(这是你和我的机器上的8位字节)。
The notion of vowel is complicated (e.g. what are vowels in Russian, Arabic, Japanese, Hebrew, Cherokee, Hindi, ....), so there might be no simple solution to your problem (since UTF-8 has combining characters).
元音的概念很复杂(例如,俄语,阿拉伯语,日语,希伯来语,切诺基语,印地语等元音是什么元素),因此可能没有简单的解决方案来解决你的问题(因为UTF-8结合了字符)。
Are you exactly sure that æ
and œ
are letters or vowels? (FWIW, å
& œ
& æ
are classified as a letter & lowercase in Unicode). I was taught in French elementary school that they are ligatures (and French dictionaries don't mention them as letters, so œuf
is in a dictionary at the place of oeuf
, which means egg). But I am not an expert about this. See strcoll(3).
你确定æ和œ是字母还是元音? (FWIW,å&œ&æ在Unicode中被分类为字母和小写)。我在法国小学被教导他们是连字(而法语词典并没有将它们称为字母,因此œuf是在oeuf的地方的字典中,这意味着鸡蛋)。但我不是这方面的专家。见strcoll(3)。
On Linux, since UTF-8 is the default encoding (and it is increasingly hard to get some other one on recent distribution), I don't recommend using wchar_t
, but use UTF-8 char
(so functions handling multi-byte encoded UTF-8), for example (using Glib UTF8 & Unicode functions) :
在Linux上,由于UTF-8是默认编码(并且在最近的发行版上越来越难以获得其他的编码),我不建议使用wchar_t,而是使用UTF-8 char(因此处理多字节编码的UTF的函数) -8),例如(使用Glib UTF8和Unicode函数):
unsigned count_norvegian_lowercase_vowels(const char*s) {
assert (s != NULL);
// s should be a not-too-big string
// (its `strlen` should be less than UINT_MAX)
// s is assumed to be UTF-8 encoded, and should be valid UTF-8:
if (!g_utf8_validate(s, -1, NULL)) {
fprintf(stderr, "invalid UTF-8 string %s\n", s);
exit(EXIT_FAILURE);
};
unsigned count = 0;
char* next= NULL;
char* pc= NULL;
for (pc = s; *pc != '\0' && ((next=g_utf8_next_char(pc)), *pc); pc=next) {
g_unichar u = g_utf8_get_char(pc);
// comments from OP make me believe these are the only Norvegian vowels.
if (u=='a' || u=='e' || u=='i' || u=='o' || u=='u' || u=='y'
|| u==(g_unichar)0xa6 //æ U+00E6 LATIN SMALL LETTER AE
|| u==(g_unichar)0xf8 //ø U+00F8 LATIN SMALL LETTER O WITH STROKE
|| u==(g_unichar)0xe5 //å U+00E5 LATIN SMALL LETTER A WITH RING ABOVE
/* notice that for me ы & ê are also vowels but œ is a ligature ... */
)
count++;
};
return count;
}
I'm not sure the name of my function is correct; but you told me in comments that Norvegian (which I don't know) has no more vowel characters than what my function is counting.
我不确定我的功能名称是否正确;但你在评论中告诉我,Norvegian(我不知道)没有比我的功能更多的元音字符。
It is on purpose that I did not put UTF-8 in literal strings or wide char literals (only in comments). There are other obsolete character encodings (read about EBCDIC or KOI8) and you might want to cross-compile the code.
我故意将UTF-8放在文字字符串或宽字符文字中(仅在注释中)。还有其他过时的字符编码(阅读有关EBCDIC或KOI8),您可能希望交叉编译代码。