用R表示的Unicode标准化(form C):将所有带重音的字符转换成单字符形式?

In Unicode, letters with accents can be represented in two ways: the accentuated letter itself, and the combination of the bare letter plus the accent. For example, é (+U00E9) and e´ (+U0065 +U0301) are usually displayed in the same way.

在Unicode中，带有重音的字母可以用两种方式表示:重音字母本身，以及裸字母加重音的组合。例如,e(+ U00E9)和e´(+ U0065 + U0301)通常以同样的方式显示。

R renders the following (version 3.0.2, Mac OS 10.7.5):

R呈现如下(版本3.0.2,Mac OS 10.7.5):

> "\u00e9"
[1] "é"
> "\u0065\u0301"
[1] "é"

However, of course:

然而,当然,

> "\u00e9" == "\u0065\u0301"
[1] FALSE

Is there a function in R which converts two-unicode-character-letters into their one-character form? In particular, here it would collapse "\u0065\u0301" into "\u00e9".

在R中是否有一个函数可以将两个单字符字符字母转换成一个字符的形式?特别是在这里，它会将“\u0065\u0301”折叠成“\u00e9”。

That would be extremely handy to process large quantities of strings. Plus, the one-character forms can easily be converted to other encodings via iconv -- at least for the usual Latin1 characters -- and is better handled by plot.

这对于处理大量字符串非常方便。另外，单字符表单可以通过iconv很容易地转换为其他编码——至少对于通常的Latin1字符来说是这样——并且更好地由情节来处理。

Thanks a lot in advance.

非常感谢。

1 个解决方案

#1

Ok, it appears that a package has been developed to enhance and simplify the string manipulation toolbox in R (finally!). It is called stringi and looks very promising. Its documentation is very well written, and in particular I find the pages about encodings and locales much more enlightening than some of the standard R documentation on the subject.

好的，似乎已经开发了一个包来增强和简化R中的字符串处理工具箱(最后!)它叫stringi，看起来很有前途。它的文档写得很好，特别是我发现关于编码和地区的页面比关于这个主题的一些标准R文档更有启发性。

It has Unicode normalization functions, as I was looking for (here form C):

它具有Unicode标准化函数，正如我所寻找的(这里是form C):

> stri_trans_nfc('\u00e9') == stri_trans_nfc('\u0065\u0301')
[1] TRUE

It also contains a smart comparison function which integrates these normalization questions and lessens the pain of having to think about them:

它还包含了一个智能的比较函数，该函数集成了这些标准化问题，减少了不得不考虑它们的痛苦:

> stri_compare('\u00e9', '\u0065\u0301')
[1] 0
# i.e. equal ;
# otherwise it returns 1 or -1, i.e. greater or lesser, in the alphabetic order.

Thanks to the developers, Marek Gągolewski and Bartek Tartanus, and to Kurt Hornik for the info!

由于开发人员,Marek Gągolewski和Bartek Tartanus,和库尔特Hornik信息!

#1