如何避免变音符/口音敏感的问题

I'm creating a tiny program of guessing the capitals of countries. Some of the capitals have accents, cedillas, etc.

我正在创建一个小程序来猜测国家的首都。有些首都有口音，如仙人掌等。

Since I have to compare the capital and the text the user guessed, and I don't want an accent to mess up the comparison, I went digging the internet for some way of accomplishing that.

因为我必须比较一下用户所猜测的资本和文本，而且我不想要一个口音来搞砸比较，所以我上网搜索了一些方法来实现这一点。

I came across countless solutions to another programming languages however only a couple of results about C.

我遇到了数不清的其他编程语言的解决方案，但是关于C的结果只有几个。

None of them actually worked with me. Although, I came to conclusion that I'd have to use the wchar.h library to deal with those annoying characters

他们都没有和我一起工作过。虽然，我得出的结论是我必须使用wchar。h库来处理那些烦人的字符

I made this tiny bit of code (which replaces É with E) just to check this method and against all I read and understand it doesn't work, even printing the wide char string doesn't show diacritic characters. If it worked, I'm sure I could implement this on the capitals' program so I'd appreciate if someone can tell me what's wrong.

我做了这一小段代码(用E替换E)，只是为了检查这个方法，并对照我所读和理解的所有代码，它不起作用，即使打印宽字符字符串也不会显示字符。如果成功了，我相信我可以在资本项目上实现这个，所以我希望有人能告诉我哪里出了问题。

#include<stdio.h>
#include<locale.h>
#include<wchar.h>

const wchar_t CAPITAL_ACCUTE_E = L'\u00C9';

int main()
{
    wchar_t wbuff[128];
    setlocale(LC_ALL,"");
    fputws(L"Say something: ", stdout);
    fgetws(wbuff, 128, stdin);
    int n;
    int len = wcslen(wbuff);
    for(n=0;n<len;n++)
        if(wbuff[n] == CAPITAL_ACCUTE_E)
            wbuff[n] = L'E';
    wprintf(L"%ls\n", wbuff);
    return 0;
}

1 个解决方案

#1

An issue you overlooked is that É can be represented as

你忽略的一个问题是E可以被表示为

É - LATIN CAPITAL LETTER E WITH ACUTE, codepoint U+00C9 (c3 89 in UTF-8), or
E -拉丁大写字母E带有锐角，码点U+00C9 (UTF-8中的c3 89)，或
É - LATIN CAPITAL LETTER E followed by COMBINING ACUTE ACCENT, codepoints U+0045 U+0301 (45 cc 81 in UTF-8)
E -拉丁大写字母E，加上尖锐的重音，codepoints U+0045 U+0301 (UTF-8的45 cc 81)

You need to account for this. This can be done by mapping both strings to the NFD (Normal Form: Decomposed). After that, you can strip away the decomposed combining characters and be left with the E, which you then can strcmp as usual.

你需要说明原因。这可以通过将两个字符串映射到NFD(正常形式:分解)来实现。之后，您可以去掉分解的组合字符，剩下E，然后可以像往常一样strcmp。

Assuming you've got an UTF-8 encoded input, here is how you could do it with utf8proc:

假设您有一个UTF-8编码的输入，下面是使用utf8proc的方法:

#include <utf8proc.h>

utf8_t *output;
ssize_t len = utf8proc_map((uint8_t*)input, 0, &output, 
                           UTF8PROC_NULLTERM | UTF8PROC_STABLE |
                           UTF8PROC_STRIPMARK | UTF8PROC_DECOMPOSE |
                           UTF8PROC_CASEFOLD
                          );

This would turn all of É, É and E to a plain e.

这将把E E E和E的纯E次方。

#1