如何避免变音符/口音敏感的问题

时间:2022-03-08 20:13:10

I'm creating a tiny program of guessing the capitals of countries. Some of the capitals have accents, cedillas, etc.

我正在创建一个小程序来猜测国家的首都。有些首都有口音,如仙人掌等。

Since I have to compare the capital and the text the user guessed, and I don't want an accent to mess up the comparison, I went digging the internet for some way of accomplishing that.

因为我必须比较一下用户所猜测的资本和文本,而且我不想要一个口音来搞砸比较,所以我上网搜索了一些方法来实现这一点。

I came across countless solutions to another programming languages however only a couple of results about C.

我遇到了数不清的其他编程语言的解决方案,但是关于C的结果只有几个。

None of them actually worked with me. Although, I came to conclusion that I'd have to use the wchar.h library to deal with those annoying characters

他们都没有和我一起工作过。虽然,我得出的结论是我必须使用wchar。h库来处理那些烦人的字符

I made this tiny bit of code (which replaces É with E) just to check this method and against all I read and understand it doesn't work, even printing the wide char string doesn't show diacritic characters. If it worked, I'm sure I could implement this on the capitals' program so I'd appreciate if someone can tell me what's wrong.

我做了这一小段代码(用E替换E),只是为了检查这个方法,并对照我所读和理解的所有代码,它不起作用,即使打印宽字符字符串也不会显示字符。如果成功了,我相信我可以在资本项目上实现这个,所以我希望有人能告诉我哪里出了问题。

#include<stdio.h>
#include<locale.h>
#include<wchar.h>

const wchar_t CAPITAL_ACCUTE_E = L'\u00C9';

int main()
{
    wchar_t wbuff[128];
    setlocale(LC_ALL,"");
    fputws(L"Say something: ", stdout);
    fgetws(wbuff, 128, stdin);
    int n;
    int len = wcslen(wbuff);
    for(n=0;n<len;n++)
        if(wbuff[n] == CAPITAL_ACCUTE_E)
            wbuff[n] = L'E';
    wprintf(L"%ls\n", wbuff);
    return 0;
}

1 个解决方案

#1


1  

An issue you overlooked is that É can be represented as

你忽略的一个问题是E可以被表示为

You need to account for this. This can be done by mapping both strings to the NFD (Normal Form: Decomposed). After that, you can strip away the decomposed combining characters and be left with the E, which you then can strcmp as usual.

你需要说明原因。这可以通过将两个字符串映射到NFD(正常形式:分解)来实现。之后,您可以去掉分解的组合字符,剩下E,然后可以像往常一样strcmp。

Assuming you've got an UTF-8 encoded input, here is how you could do it with utf8proc:

假设您有一个UTF-8编码的输入,下面是使用utf8proc的方法:

#include <utf8proc.h>

utf8_t *output;
ssize_t len = utf8proc_map((uint8_t*)input, 0, &output, 
                           UTF8PROC_NULLTERM | UTF8PROC_STABLE |
                           UTF8PROC_STRIPMARK | UTF8PROC_DECOMPOSE |
                           UTF8PROC_CASEFOLD
                          );

This would turn all of É, É and E to a plain e.

这将把E E E和E的纯E次方。

#1


1  

An issue you overlooked is that É can be represented as

你忽略的一个问题是E可以被表示为

You need to account for this. This can be done by mapping both strings to the NFD (Normal Form: Decomposed). After that, you can strip away the decomposed combining characters and be left with the E, which you then can strcmp as usual.

你需要说明原因。这可以通过将两个字符串映射到NFD(正常形式:分解)来实现。之后,您可以去掉分解的组合字符,剩下E,然后可以像往常一样strcmp。

Assuming you've got an UTF-8 encoded input, here is how you could do it with utf8proc:

假设您有一个UTF-8编码的输入,下面是使用utf8proc的方法:

#include <utf8proc.h>

utf8_t *output;
ssize_t len = utf8proc_map((uint8_t*)input, 0, &output, 
                           UTF8PROC_NULLTERM | UTF8PROC_STABLE |
                           UTF8PROC_STRIPMARK | UTF8PROC_DECOMPOSE |
                           UTF8PROC_CASEFOLD
                          );

This would turn all of É, É and E to a plain e.

这将把E E E和E的纯E次方。