如何删除C ++ std :: string中的重音符和波浪号

I have a problem with a string in C++ which has several words in Spanish. This means that I have a lot of words with accents and tildes. I want to replace them for their not accented counterparts. Example: I want to replace this word: "había" for habia. I tried replace it directly but with replace method of string class but I could not get that to work.

我在C ++中有一个字符串的问题,它有几个西班牙语单词。这意味着我有很多带有重音符号和波浪号的单词。我想替换它们没有重音的同行。示例:我想替换这个词:哈比亚的“había”。我尝试直接替换它但使用字符串类的替换方法,但我无法让它工作。

I'm using this code:

我正在使用此代码:

for (it= dictionary.begin(); it != dictionary.end(); it++)
{
    strMine=(it->first);
    found=toReplace.find_first_of(strMine);
    while (found!=std::string::npos)
    {
        strAux=(it->second);
        toReplace.erase(found,strMine.length());
        toReplace.insert(found,strAux);
        found=toReplace.find_first_of(strMine,found+1);
    }
}

Where dictionary is a map like this (with more entries):

字典是这样的地图(有更多条目):

dictionary.insert ( std::pair<std::string,std::string>("á","a") );
dictionary.insert ( std::pair<std::string,std::string>("é","e") );
dictionary.insert ( std::pair<std::string,std::string>("í","i") );
dictionary.insert ( std::pair<std::string,std::string>("ó","o") );
dictionary.insert ( std::pair<std::string,std::string>("ú","u") );
dictionary.insert ( std::pair<std::string,std::string>("ñ","n") );

and toReplace strings is:

和toReplace字符串是:

std::string toReplace="á-é-í-ó-ú-ñ-á-é-í-ó-ú-ñ";

I obviously must be missing something. I can't figure it out. Is there any library I can use?.

我显然必须遗漏一些东西。我无法弄明白。我有可以使用的图书馆吗?

Thanks,

12 个解决方案

#1

First, this is a really bad idea: you’re mangling somebody’s language by removing letters. Although the extra dots in words like “naïve” seem superfluous to people who only speak English, there are literally thousands of writing systems in the world in which such distinctions are very important. Writing software to mutilate someone’s speech puts you squarely on the wrong side of the tension between using computers as means to broaden the realm of human expression vs. tools of oppression.

首先,这是一个非常糟糕的主意:你通过删除字母来破坏某人的语言。虽然像“天真”这样的单词中的额外点对于只说英语的人来说似乎是多余的,但世界上有数以千计的书写系统,其中这些区别非常重要。编写软件以破坏某人的言论,这使你正好处于使用计算机作为扩大人类表达领域与压迫工具之间的紧张关系的错误方面。

What is the reason you’re trying to do this? Is something further down the line choking on the accents? Many people would love to help you solve that.

你试图这样做的原因是什么?是什么东西在口音上窒息?很多人都愿意帮助你解决这个问题。

That said, libicu can do this for you. Open the transform demo; copy and paste your Spanish text into the “Input” box; enter

也就是说,libicu可以为你做到这一点。打开转换演示;将西班牙文本复制并粘贴到“输入”框中;输入

NFD; [:M:] remove; NFC

as “Compound 1” and click transform.

作为“化合物1”并单击转换。

(With help from slide 9 of Unicode Transforms in ICU. Slides 29-30 show how to use the API.)

(借助ICU中Unicode转换的幻灯片9的帮助。幻灯片29-30显示了如何使用API。)

#2

I disagree with the currently "approved" answer. The question makes perfect sense when you are indexing text. Like case-insensitive search, accent-insensitive search is a good idea. "naïve" matches "Naïve" matches "naive" matches "NAİVE" (you do know that an uppercase i is İ in Turkish? That's why you ignore accents)

我不同意目前“批准”的答案。在索引文本时,这个问题非常有意义。与不区分大小写的搜索一样,不区分重音的搜索也是一个好主意。 “naïve”匹配“Naïve”匹配“天真”匹配“NAİVE”(你知道大写我是土耳其语吗?这就是你忽略重音的原因)

Now, the best algorithm is hinted at the approved answer: Use NKD (decomposition) to decompose accented letters into the base letter and a seperate accent, and then remove all accents.

现在,最好的算法暗示了批准的答案:使用NKD(分解)将重音字母分解为基本字母和单独的重音,然后删除所有重音。

There is little point in the re-composition afterwards, though. You removed most sequences which would change, and the others are for all intents and purposes identical anyway. WHat's the difference between æ in NKC and æ in NKD?

不过,之后的重组很少有意义。您删除了大多数会改变的序列,而其他序列无论如何都是相同的。什么是NKC和æ在NKD之间的区别?

#3

I definitely think you should look into the root of the problem. That is, look for a solution that will allow you to support characters encoded in Unicode or for the user's locale.

我绝对认为你应该研究问题的根源。也就是说,寻找一种解决方案,允许您支持以Unicode编码的字符或用户的语言环境。

That being said, your problem is that you're dealing with multi-character strings. There is std::wstring but I'm not sure I'd use that. For one thing, wide characters aren't meant to handle variable width encodings. This hole goes deep, so I'll leave it at that.

话虽这么说,你的问题是你正在处理多字符串。有std :: wstring,但我不确定我是否会使用它。首先,宽字符并不意味着处理可变宽度编码。这个洞深入,所以我会留下它。

Now, as for the rest of your code, it is error prone because you mix the looping logic with translation logic. Thus, at least two kinds of bugs can occur: translation bugs and looping bugs. Do use the STL, it can help you a lot with the looping part.

现在,对于其余的代码,它很容易出错,因为您将循环逻辑与转换逻辑混合在一起。因此,至少会出现两种错误:转换错误和循环错误。使用STL,它可以帮助你很多循环部分。

The following is a rough solution for replacing characters in a string.

以下是替换字符串中字符的粗略解决方案。

main.cpp:

#include <iostream>
#include <string>
#include <iterator>
#include <algorithm>
#include "translate_characters.h"

using namespace std;

int main()
{
    string text;
    cin.unsetf(ios::skipws);
    transform(istream_iterator<char>(cin), istream_iterator<char>(),
              inserter(text, text.end()), translate_characters());
    cout << text << endl;
    return 0;
}

translate_characters.h:

#ifndef TRANSLATE_CHARACTERS_H
#define TRANSLATE_CHARACTERS_H

#include <functional>
#include <map>

class translate_characters : public std::unary_function<const char,char> {
public:
    translate_characters();
    char operator()(const char c);

private:
    std::map<char, char> characters_map;
};

#endif // TRANSLATE_CHARACTERS_H

translate_characters.cpp:

#include "translate_characters.h"

using namespace std;

translate_characters::translate_characters()
{
    characters_map.insert(make_pair('e', 'a'));
}

char translate_characters::operator()(const char c)
{
    map<char, char>::const_iterator translation_pos(characters_map.find(c));
    if( translation_pos == characters_map.end() )
        return c;
    return translation_pos->second;
}

#4

I'm surprised some people say you shouldn't deaccentuate characters. Having accents on characters in filenames can get you into a lot of problems when using programs manifestly written by programmers who didn't allow for this.

我很惊讶有些人说你不应该让角色变得沉重。对文件名中的字符进行重音可能会在使用由不允许这样做的程序员明显编写的程序时遇到很多问题。

#5

I'm totally 100% in favour of using Unicode and not losing important information such as accents, but sometimes you need to do something like this. It's best not to second-guess people's reasons for wanting a particular function. In my case, I'm looking to do this for the purposes of searching for "similar" texts (which often means texts written - incorrectly - without accents).

我完全100%赞成使用Unicode并且不会丢失重音等重要信息,但有时你需要做这样的事情。最好不要猜测人们想要特定功能的原因。在我的情况下,我希望这样做是为了搜索“类似”文本(这通常意味着文字写得不正确 - 没有重音)。

Someone will always have a valid reason.

有人总是有正当理由。

#6

You might want to check out the boost (http://www.boost.org/) library.

您可能想查看boost(http://www.boost.org/)库。

It has a regexp library, which you could use. In addition it has a specific library that has some functions for string manipulation (link) including replace.

它有一个正则表达式库,您可以使用它。此外,它还有一个特定的库,它具有一些字符串操作(链接)功能,包括替换。

#7

I was using unix, I forgot to mention that, but I run tr like this

我正在使用unix,我忘了提到它,但我像这样运行tr

$tr áéíóú aeiou
á-é-í-ó-ú
ue-uo-uu-uu-uu

$tráéíóúaeiouá-é-í-ó-úu-uo-uu-uu-uu

it does not work as espected. I think it has to do with unicode and string class.

它没有像预期的那样工作。我认为它与unicode和string类有关。

#8

The thing is that I am developing an application due in 5 days for university. It's a program that will index the text inside the tag in HTML pages (I can't use apache lucene to create the index also). However I won't be indexing all the words, must remove all stopwords use stemming and make all the text in lowercase. As per request of our teacher we must eliminate accents and tilde in the words. Hope this make things a little clearer.

问题是我正在为大学开发5天申请。这是一个程序,它将索引HTML页面中标签内的文本(我也不能使用apache lucene来创建索引)。但是我不会将所有单词编入索引,必须删除所有使用词干的停用词并使所有文本都以小写形式出现。根据我们老师的要求,我们必须消除口音中的重音和代字。希望这会让事情变得更加清晰。

Saludos,

#9

Try using std::wstring instead of std::string. UTF-16 should work (as opposed to ASCII).

尝试使用std :: wstring而不是std :: string。 UTF-16应该工作(而不是ASCII)。

#10

If you can (if you're running Unix), I suggest using the tr facility for this: it's custom-built for this purpose. Remember, no code == no buggy code. :-)

如果可以(如果你正在运行Unix),我建议使用tr工具:它是为此目的而定制的。记住,没有代码==没有错误的代码。 :-)

Edit: Sorry, you're right, tr doesn't seem to work. How about sed? It's a pretty stupid script I've written, but it works for me.

编辑:对不起,你说得对,tr似乎不起作用。怎么样sed?这是我编写的一个非常愚蠢的剧本,但它对我有用。

#!/bin/sed -f
s/á/a/g;
s/é/e/g;
s/í/i/g;
s/ó/o/g;
s/ú/u/g;
s/ñ/n/g;

#11

I could not link the ICU libraries but I still think it's the best solution. As I need this program to be functional as soon as possible I made a little program (that I have to improve) and I'm going to use that. Thank you all for for suggestions and answers.

我无法链接ICU库,但我仍然认为这是最好的解决方案。因为我需要这个程序尽快运行,我做了一个小程序(我必须改进),我将使用它。谢谢大家的建议和解答。

Here's the code I'm gonna use:

这是我要使用的代码:

for (it= dictionary.begin(); it != dictionary.end(); it++)
{
    strMine=(it->first);
    found=toReplace.find(strMine);
    while (found != std::string::npos)
    {
        strAux=(it->second);
        toReplace.erase(found,2);
        toReplace.insert(found,strAux);
        found=toReplace.find(strMine,found+1);
    }
}

I will change it next time I have to turn my program in for correction (in about 6 weeks).

下次我必须改变我的程序进行校正(大约6周),我会改变它。

#12

    /// <summary>
    /// 
    /// Replace any accent and foreign character by their ASCII equivalent.
    /// In other words, convert a string to an ASCII-complient string.
    /// 
    /// This also get rid of special hidden character, like EOF, NUL, TAB and other '\0', except \n\r
    /// 
    /// Tests with accents and foreign characters:
    /// Before: "äæǽaeöœoeüueÄAeÜUeÖOeÀÁÂÃÄÅǺĀĂĄǍΑΆẢẠẦẪẨẬẰẮẴẲẶАAàáâãåǻāăąǎªαάảạầấẫẩậằắẵẳặаaБBбbÇĆĈĊČCçćĉċčcДDдdÐĎĐΔDjðďđδdjÈÉÊËĒĔĖĘĚΕΈẼẺẸỀẾỄỂỆЕЭEèéêëēĕėęěέεẽẻẹềếễểệеэeФFфfĜĞĠĢΓГҐGĝğġģγгґgĤĦHĥħhÌÍÎÏĨĪĬǏĮİΗΉΊΙΪỈỊИЫIìíîïĩīĭǐįıηήίιϊỉịиыїiĴJĵjĶΚКKķκкkĹĻĽĿŁΛЛLĺļľŀłλлlМMмmÑŃŅŇΝНNñńņňŉνнnÒÓÔÕŌŎǑŐƠØǾΟΌΩΏỎỌỒỐỖỔỘỜỚỠỞỢОOòóôõōŏǒőơøǿºοόωώỏọồốỗổộờớỡởợоoПPпpŔŖŘΡРRŕŗřρрrŚŜŞȘŠΣСSśŝşșšſσςсsȚŢŤŦτТTțţťŧтtÙÚÛŨŪŬŮŰŲƯǓǕǗǙǛŨỦỤỪỨỮỬỰУUùúûũūŭůűųưǔǖǘǚǜυύϋủụừứữửựуuÝŸŶΥΎΫỲỸỶỴЙYýÿŷỳỹỷỵйyВVвvŴWŵwŹŻŽΖЗZźżžζзzÆǼAEßssĲIJĳijŒOEƒf'ξksπpβvμmψpsЁYoёyoЄYeєyeЇYiЖZhжzhХKhхkhЦTsцtsЧChчchШShшshЩShchщshchЪъЬьЮYuюyuЯYaяya"
    /// After:  "aaeooeuueAAeUUeOOeAAAAAAAAAAAAAAAAAAAAAAAaaaaaaaaaaaaaaaaaaaaaaaBbCCCCCCccccccDdDDjddjEEEEEEEEEEEEEEEEEEeeeeeeeeeeeeeeeeeeFfGGGGGgggggHHhhIIIIIIIIIIIIIiiiiiiiiiiiiJJjjKKkkLLLLllllMmNNNNNnnnnnOOOOOOOOOOOOOOOOOOOOOOooooooooooooooooooooooPpRRRRrrrrSSSSSSssssssTTTTttttUUUUUUUUUUUUUUUUUUUUUUUUuuuuuuuuuuuuuuuuuuuuuuuYYYYYYYYyyyyyyyyVvWWwwZZZZzzzzAEssIJijOEf'kspvmpsYoyoYeyeYiZhzhKhkhTstsChchShshShchshchYuyuYaya"
    /// 
    /// Tests with invalid 'special hidden characters':
    /// Before: "\0\0\000\0000Bj��rk�\'\"\\\0\a\b\f\n\r\t\v\u0020���oacu\'\\\'te�"
    /// After:  "00000Bjrk'\"\\\n\r oacu'\\'te"
    /// 
    /// </summary>
    private string Normalize(string StringToClean)
    {
        string normalizedString = StringToClean.Normalize(NormalizationForm.FormD);
        StringBuilder Buffer = new StringBuilder(StringToClean.Length);

        for (int i = 0; i < normalizedString.Length; i++)
        {
            if (CharUnicodeInfo.GetUnicodeCategory(normalizedString[i]) != UnicodeCategory.NonSpacingMark)
            {
                Buffer.Append(normalizedString[i]);
            }
        }

        string PreAsciiCompliant = Buffer.ToString().Normalize(NormalizationForm.FormC);
        StringBuilder AsciiComplient = new StringBuilder(PreAsciiCompliant.Length);

        foreach (char character in PreAsciiCompliant)
        {
            //Reject all special characters except \n\r (Carriage-Return and Line-Feed). 
            //Get rid of special hidden character, like EOF, NUL, TAB and other '\0'
            if (((int)character >= 32 && (int)character < 127) || ((int)character == 10 || (int)character == 13)) 
            {
                AsciiComplient.Append(character);
            }
        }
        return AsciiComplient.ToString().Trim(); // Remove spaces at start and end of string if any
    }

#1