I'm creating a tiny program of guessing the capitals of countries. Some of the capitals have accents, cedillas, etc.
我正在创建一个小程序来猜测国家的首都。有些首都有口音,如仙人掌等。
Since I have to compare the capital and the text the user guessed, and I don't want an accent to mess up the comparison, I went digging the internet for some way of accomplishing that.
因为我必须比较一下用户所猜测的资本和文本,而且我不想要一个口音来搞砸比较,所以我上网搜索了一些方法来实现这一点。
I came across countless solutions to another programming languages however only a couple of results about C.
我遇到了数不清的其他编程语言的解决方案,但是关于C的结果只有几个。
None of them actually worked with me. Although, I came to conclusion that I'd have to use the wchar.h library to deal with those annoying characters
他们都没有和我一起工作过。虽然,我得出的结论是我必须使用wchar。h库来处理那些烦人的字符
I made this tiny bit of code (which replaces É with E) just to check this method and against all I read and understand it doesn't work, even printing the wide char string doesn't show diacritic characters. If it worked, I'm sure I could implement this on the capitals' program so I'd appreciate if someone can tell me what's wrong.
我做了这一小段代码(用E替换E),只是为了检查这个方法,并对照我所读和理解的所有代码,它不起作用,即使打印宽字符字符串也不会显示字符。如果成功了,我相信我可以在资本项目上实现这个,所以我希望有人能告诉我哪里出了问题。
#include<stdio.h>
#include<locale.h>
#include<wchar.h>
const wchar_t CAPITAL_ACCUTE_E = L'\u00C9';
int main()
{
wchar_t wbuff[128];
setlocale(LC_ALL,"");
fputws(L"Say something: ", stdout);
fgetws(wbuff, 128, stdin);
int n;
int len = wcslen(wbuff);
for(n=0;n<len;n++)
if(wbuff[n] == CAPITAL_ACCUTE_E)
wbuff[n] = L'E';
wprintf(L"%ls\n", wbuff);
return 0;
}
1 个解决方案
#1
1
An issue you overlooked is that É
can be represented as
你忽略的一个问题是E可以被表示为
-
É
- LATIN CAPITAL LETTER E WITH ACUTE, codepoint U+00C9 (c3 89
in UTF-8), or - E -拉丁大写字母E带有锐角,码点U+00C9 (UTF-8中的c3 89),或
-
É
- LATIN CAPITAL LETTER E followed by COMBINING ACUTE ACCENT, codepoints U+0045 U+0301 (45 cc 81
in UTF-8) - E -拉丁大写字母E,加上尖锐的重音,codepoints U+0045 U+0301 (UTF-8的45 cc 81)
You need to account for this. This can be done by mapping both strings to the NFD (Normal Form: Decomposed). After that, you can strip away the decomposed combining characters and be left with the E
, which you then can strcmp
as usual.
你需要说明原因。这可以通过将两个字符串映射到NFD(正常形式:分解)来实现。之后,您可以去掉分解的组合字符,剩下E,然后可以像往常一样strcmp。
Assuming you've got an UTF-8 encoded input
, here is how you could do it with utf8proc:
假设您有一个UTF-8编码的输入,下面是使用utf8proc的方法:
#include <utf8proc.h>
utf8_t *output;
ssize_t len = utf8proc_map((uint8_t*)input, 0, &output,
UTF8PROC_NULLTERM | UTF8PROC_STABLE |
UTF8PROC_STRIPMARK | UTF8PROC_DECOMPOSE |
UTF8PROC_CASEFOLD
);
This would turn all of É
, É
and E
to a plain e
.
这将把E E E和E的纯E次方。
#1
1
An issue you overlooked is that É
can be represented as
你忽略的一个问题是E可以被表示为
-
É
- LATIN CAPITAL LETTER E WITH ACUTE, codepoint U+00C9 (c3 89
in UTF-8), or - E -拉丁大写字母E带有锐角,码点U+00C9 (UTF-8中的c3 89),或
-
É
- LATIN CAPITAL LETTER E followed by COMBINING ACUTE ACCENT, codepoints U+0045 U+0301 (45 cc 81
in UTF-8) - E -拉丁大写字母E,加上尖锐的重音,codepoints U+0045 U+0301 (UTF-8的45 cc 81)
You need to account for this. This can be done by mapping both strings to the NFD (Normal Form: Decomposed). After that, you can strip away the decomposed combining characters and be left with the E
, which you then can strcmp
as usual.
你需要说明原因。这可以通过将两个字符串映射到NFD(正常形式:分解)来实现。之后,您可以去掉分解的组合字符,剩下E,然后可以像往常一样strcmp。
Assuming you've got an UTF-8 encoded input
, here is how you could do it with utf8proc:
假设您有一个UTF-8编码的输入,下面是使用utf8proc的方法:
#include <utf8proc.h>
utf8_t *output;
ssize_t len = utf8proc_map((uint8_t*)input, 0, &output,
UTF8PROC_NULLTERM | UTF8PROC_STABLE |
UTF8PROC_STRIPMARK | UTF8PROC_DECOMPOSE |
UTF8PROC_CASEFOLD
);
This would turn all of É
, É
and E
to a plain e
.
这将把E E E和E的纯E次方。