R语料库把UTF-8编码的文本弄乱了

I am simply trying to create a corpus from Russian, UTF-8 encoded text. The problem is, the Corpus method from the tm package is not encoding the strings correctly.

我只是试图从俄文UTF-8编码文本创建一个语料库。问题是，来自tm包的语料库方法没有正确地编码字符串。

Here is a reproducible example of my problem:

我的问题有一个可重复的例子:

Load in the Russian text:

载入俄罗斯文字:

> data <- c("Renault Logan, 2005","Складское помещение, 345 м²",
          "Су-шеф","3-к квартира, 64 м², 3/5 эт.","Samsung galaxy S4 mini GT-I9190 (чёрный)")

Create a VectorSource:

创建一个VectorSource:

> vs <- VectorSource(data)
> vs # outputs correctly

Then, create the corpus:

然后,创建主体:

> corp <- Corpus(vs)
> inspect(corp) # output is not encoded properly

The output that I get is:

我得到的输出是:

> inspect(corp)
<<VCorpus (documents: 5, metadata (corpus/indexed): 0/0)>>

[[1]]
<<PlainTextDocument (metadata: 7)>>
Renault Logan, 2005

[[2]]
<<PlainTextDocument (metadata: 7)>>
Ñêëàäñêîå ïîìåùåíèå, 345 ì<U+00B2>

[[3]]
<<PlainTextDocument (metadata: 7)>>
Ñó-øåô

[[4]]
<<PlainTextDocument (metadata: 7)>>
3-ê êâàðòèðà, 64 ì<U+00B2>, 3/5 ýò.

[[5]]
<<PlainTextDocument (metadata: 7)>>
Samsung galaxy S4 mini GT-I9190 (÷¸ðíûé)

Why does it output incorrectly? There doesn't seem to be any option to set the encoding on the Corpus method. Is there a way to set it after the fact? I have tried this:

为什么输出不正确?似乎没有任何选项可以在语料库方法上设置编码。有什么办法在事实之后设置它吗?我试过这个:

> title_corpus <- tm_map(title_corpus, enc2utf8)
Error in FUN(X[[1L]], ...) : argumemt is not a character vector

But, it errors as shown.

但是，正如所示，它是错误的。

3 个解决方案

#1

Well, there seems to be good news and bad news.

嗯，似乎有好消息和坏消息。

The good news is that the data appears to be fine even if it doesn't display correctly with inspect(). Try looking at

好消息是，即使没有正确地显示检查()，数据看起来也很好。试着看

content(corp[[2]])
# [1] "Складское помещение, 345 м²"

The reason it looks funny in inspect() is because the authors changed the way the print.PlainTextDocument function works. It formerly would cat the value to screen. Now, however, they feed the data though writeLines(). This function uses the locale of the system to format the characters/bytes in the document. (This can be viewed with Sys.getlocale()). It turns out Linux and OS X have a proper "UTF-8" encoding, but Windows uses language specific code pages. So if the characters aren't in the code page, they get escaped or translated to funny characters. This means this should work just fine on a Mac, but not on a PC.

在inspect()中看起来很有趣的原因是作者改变了打印方式。PlainTextDocument功能是否有效。以前它会显示屏幕的值。然而，现在他们通过writeLines()提供数据。这个函数使用系统的语言环境来格式化文档中的字符/字节。(可以使用Sys.getlocale()查看。)原来Linux和OS X有一个合适的“UTF-8”编码，但是Windows使用特定于语言的代码页。所以如果这些字符不在代码页中，它们就会被转义或翻译成有趣的字符。这意味着它在Mac电脑上可以正常工作，但在PC电脑上就不行。

Try going a step further and building a DocumentTermMatrix

尝试更进一步，构建一个DocumentTermMatrix。

dtm <- DocumentTermMatrix(corp)
Terms(dtm)

Hopefully you will see (as I do) the words correctly displayed.

希望你能像我一样看到正确显示的单词。

If you like, this article about writing UTF-8 files on Windows has some more information about this OS specific issue. I see no easy way to get writeLines to output UTF-8 to stdout() on Windows. I'm not sure why the package maintainers changed the print method, but one might ask or submit a feature request to change it back.

如果您愿意，这篇关于在Windows上编写UTF-8文件的文章有关于这个操作系统特定问题的更多信息。我认为让writeLines输出UTF-8到stdout()在Windows上是不容易的。我不确定为什么包维护人员更改了print方法，但是有人可能会要求或提交一个特性请求来更改它。

#2

I'm surprised the answer has not been posted yet. Don't bother messing with locale. I'm using tm package version 0.6.0 and it works absolutely fine, provided you add the following little piece of magic :

我很惊讶这个答案还没有公布。不要把现场弄得乱七八糟。我使用的是tm package 0.6.0版，如果你添加以下的小魔术，它绝对可以工作得很好:

Encoding(data)  <- "UTF-8"

Well, here is the reproducible code :

这是可复制的代码:

data <- c("Renault Logan, 2005","Складское помещение, 345 м²","Су-шеф","3-к квартира, 64 м², 3/5 эт.","Samsung galaxy S4 mini GT-I9190 (чёрный)")

Encoding(data)
# [1] "unknown" "unknown" "unknown" "unknown" "unknown"

Encoding(data)  <- "UTF-8"
# [1] "unknown" "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"

Just put it in a text file saved with UTF-8 encoding, then source it normally in R. But do not use source.with.encoding(..., encoding = "UTF-8"); it will throw an error.

只需要将它放在一个用UTF-8编码保存的文本文件中，然后在r中使用正常的源代码，但是不要使用source.with.encoding(…、编码= " utf - 8 ");它会抛出一个错误。

I forgot where I learned this trick, but I picked it up somehere along the way this past week, while surfing the Web trying to learn how to process UTF8 text in R. Things were alot cleaner in Python (just convert everything to Unicode!). R's approach is much less straighforward for me, and it did not help that documentation is sparse and confusing.

我忘记从哪里学到了这个技巧，但上周我在网上冲浪的时候学到了这个技巧，我想学习如何在r中处理UTF8文本。R的方法对我来说就不那么直接了，它也没有帮助文档的稀缺性和混乱性。

#3

I had a problem with German UTF-8 encoding while importing the texts. For me, the next oneliner helped:

我在导入文本时遇到了德语UTF-8编码的问题。对我来说，下一个oneliner帮助了我:

Sys.setlocale("LC_ALL", "de_DE.UTF-8")

Sys。setlocale(“LC_ALL”、“de_DE.UTF-8”)

Try to run the same with Russian?

试着和俄国人一样?

Sys.setlocale("LC_ALL", "ru_RU.UTF-8")

Sys。setlocale(“LC_ALL”、“ru_RU.UTF-8”)

Of course, that goes after library(tm) and before creating a corpus.

当然，这是在库(tm)之后，在创建语料库之前。

#1