The following, when copied and pasted directly into R works fine:
下面,当直接复制粘贴到R的时候可以很好:
> character_test <- function() print("R同时也被称为GNU S是一个强烈的功能性语言和环境,探索统计数据集,使许多从自定义数据图形显示...")
> character_test()
[1] "R同时也被称为GNU S是一个强烈的功能性语言和环境,探索统计数据集,使许多从自定义数据图形显示..."
However, if I make a file called character_test.R containing the EXACT SAME code, save it in UTF-8 encoding (so as to retain the special Chinese characters), then when I source() it in R, I get the following error:
但是,如果我创建一个名为character_test的文件。R包含完全相同的代码,将其保存在UTF-8编码中(以便保留特殊的中文字符),当我在R中输入()时,我得到以下错误:
> source(file="C:\\Users\\Tony\\Desktop\\character_test.R", encoding = "UTF-8")
Error in source(file = "C:\\Users\\Tony\\Desktop\\character_test.R", encoding = "utf-8") :
C:\Users\Tony\Desktop\character_test.R:3:0: unexpected end of input
1: character.test <- function() print("R
2:
^
In addition: Warning message:
In source(file = "C:\\Users\\Tony\\Desktop\\character_test.R", encoding = "UTF-8") :
invalid input found on input connection 'C:\Users\Tony\Desktop\character_test.R'
Any help you can offer in solving and helping me to understand what is going on here would be much appreciated.
如果你能帮助我理解这里发生的事情,我将不胜感激。
> sessionInfo() # Windows 7 Pro x64
R version 2.12.1 (2010-12-16)
Platform: x86_64-pc-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United Kingdom.1252
[2] LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
loaded via a namespace (and not attached):
[1] tools_2.12.1
and
和
> l10n_info()
$MBCS
[1] FALSE
$`UTF-8`
[1] FALSE
$`Latin-1`
[1] TRUE
$codepage
[1] 1252
7 个解决方案
#1
19
We talked about this a lot in the comments to my previous post but I don't want this to get lost on page 3 of comments: You have to set the locale, it works with both input from the R-console (see screenshot in comments) as well as with input from file see this screenshot:
我们讲过很多在评论我的以前的文章,但我不希望这迷路3页的评论:你必须设置语言环境,它与输入从R-console评论(见截图)以及输入文件看到这个截图:
The file "myfile.r" contains:
myfile文件”。r”包含:
russian <- function() print ("Американские с...");
The console contains:
控制台包含:
source("myfile.r", encoding="utf-8")
> Error in source(".....
Sys.setlocale("LC_CTYPE","ru")
> [1] "Russian_Russia.1251"
russian()
[1] "Американские с..."
Note that the file-in fails and it points to the same character as the original poster's error (the one after "R). I can not do this with Chinese because i would have to install "Microsoft Pinyin IME 3.0", but the process is the same, you just replace the locale with "chinese" (the naming is a bit inconsistent, consult the documentation).
注意文件-in失败,它指向与原始海报的错误相同的字符(一个在“R”之后)。我不能用中文来做这个,因为我必须安装“Microsoft Pinyin IME 3.0”,但是这个过程是一样的,你只是用“中文”来替换locale(命名有点不一致,请参考文档)。
#2
21
On R/Windows, source
runs into problems with any UTF-8 characters that can't be represented in the current locale (or ANSI Code Page in Windows-speak). And unfortunately Windows doesn't have UTF-8 available as an ANSI code page--Windows has a technical limitation that ANSI code pages can only be one- or two-byte-per-character encodings, not variable-byte encodings like UTF-8.
在R/Windows上,源程序会遇到任何UTF-8字符的问题,这些字符不能在当前的语言环境中表示(或者在Windows会话中的ANSI代码页)。不幸的是,Windows没有UTF-8作为ANSI代码页——Windows的技术限制是ANSI代码页只能是一个或两个字节的编码,而不是像UTF-8那样的可变字节编码。
This doesn't seem to be a fundamental, unsolvable problem--there's just something wrong with the source
function. You can get 90% of the way there by doing this instead:
这似乎不是一个根本的、无法解决的问题——源函数有问题。你可以通过这样做得到90%的结果:
eval(parse(filename, encoding="UTF-8"))
This'll work almost exactly like source()
with default arguments, but won't let you do echo=T, eval.print=T, etc.
这与默认参数几乎完全相同,但不会让您执行echo=T, eval。打印= T等。
#3
5
For me (on windows) I do:
对我来说(在windows上):
source.utf8 <- function(f) {
l <- readLines(f, encoding="UTF-8")
eval(parse(text=l),envir=.GlobalEnv)
}
It works fine.
它将正常工作。
#4
3
I think the problem lies with R. I can happily source UTF-8 files, or UCS-2LE files with many non-ASCII characters in. But some characters cause it to fail. For example the following
我认为问题在于r。我可以很高兴地提供UTF-8文件,或者UCS-2LE文件,里面有很多非ascii字符。但有些角色会导致失败。例如下面的
danish <- function() print("Skønt H. C. Andersens barndomsomgivelser var meget fattige, blev de i hans rige fantasi solbeskinnede.")
croatian <- function() print("Dodigović. Kako se Vi zovete?")
new_testament <- function() print("Ne provizu al vi trezorojn sur la tero, kie tineo kaj rusto konsumas, kaj jie ŝtelistoj trafosas kaj ŝtelas; sed provizu al vi trezoron en la ĉielo")
russian <- function() print ("Американские суда находятся в международных водах. Япония выразила серьезное беспокойство советскими действиями.")
is fine in both UTF-8 and UCS-2LE without the Russian line. But if that is included then it fails. I'm pointing the finger at R. Your Chinese text also appears to be too hard for R on Windows.
在UTF-8和UCS-2LE中都没有俄国线。但如果包括在内,那就失败了。我指的是R。你的中文文本在Windows上看起来也太硬了。
Locale seems irrelevant here. It's just a file, you tell it what encoding the file is, why should your locale matter?
地区似乎无关紧要。它只是一个文件,你告诉它文件的编码是什么,为什么你的语言环境很重要?
#5
1
On windows, when you copy-paste a unicode or utf-8 encoded string into a text-control that is set to single-byte-input (ascii... depending on locale), the unknown bytes will be replaced by questionmarks. If i take the first 4 characters of your string and copy-paste it into e.g. Notepad and then save it, the file becomes in hex:
在windows上,当你将一个unicode或utf-8编码的字符串复制到一个文本控件中,它被设置为单字节输入(ascii…根据区域设置,未知字节将被问号替换。如果我把你的字符串的前4个字符和复制粘贴到例如Notepad中,然后保存,文件就变成了hex:
52 3F 3F 3F 3F
3F 3F 3F 3F。
what you have to do is find an editor which you can set to utf-8 before copy-pasting the text into it, then the saved file (of your first 4 characters) becomes:
你需要做的是找到一个编辑器,你可以将它设置为utf-8,然后将文本粘贴到它,然后保存的文件(你的前4个字符)变成:
52 E5 90 8C E6 97 B6 E4 B9 9F E8 A2 AB
52 E5 90 8C E6 97 B6 E4 B9 9F e8a2 AB。
This will then be recognized as valid utf-8 by [R].
这将被确认为有效的utf-8 (R)。
I used "Notepad2" for trying this, but i am sure there are many more.
我用“Notepad2”来尝试这个,但我确信还有更多。
#6
0
I encounter this problem when a try to source a .R file containing some Chinese characters. In my case, I found that merely set "LC_CTYPE" to "chinese" is not enough. But setting "LC_ALL" to "chinese" works well.
当试图找到包含一些汉字的. r文件时,我遇到了这个问题。在我的案例中,我发现仅仅将“LC_CTYPE”设置为“中文”是不够的。但是将“LC_ALL”设置为“中文”效果很好。
Note that it's not enough to get encoding right when you read or write plain text file in Rstudio (or R?) with non-ASCII. The locale setting counts too.
请注意,当您在Rstudio(或R?)中使用非ascii进行读或写纯文本文件时,编码是不够的。地区设置也很重要。
PS. the command is Sys.setlocale(category = "LC_CTYPE",locale = "chinese"). Please replace locale value correspondingly.
命令是Sys。setlocale(category = "LC_CTYPE",locale = "chinese")。请相应地替换区域设置值。
#7
0
Building on crow's answer, this solution makes RStudio
's Source
button work.
根据乌鸦的答案,这个解决方案使RStudio的源代码按钮工作。
When hitting that Source
button, RStudio
executes source('myfile.r', encoding = 'UTF-8')
), so overriding source
makes the errors disappear and runs the code as expected:
当单击该源按钮时,RStudio执行Source ('myfile)。r',编码= 'UTF-8'),所以覆盖源使错误消失,并按照预期运行代码:
source <- function(f, encoding = 'UTF-8') {
l <- readLines(f, encoding=encoding)
eval(parse(text=l),envir=.GlobalEnv)
}
#1
19
We talked about this a lot in the comments to my previous post but I don't want this to get lost on page 3 of comments: You have to set the locale, it works with both input from the R-console (see screenshot in comments) as well as with input from file see this screenshot:
我们讲过很多在评论我的以前的文章,但我不希望这迷路3页的评论:你必须设置语言环境,它与输入从R-console评论(见截图)以及输入文件看到这个截图:
The file "myfile.r" contains:
myfile文件”。r”包含:
russian <- function() print ("Американские с...");
The console contains:
控制台包含:
source("myfile.r", encoding="utf-8")
> Error in source(".....
Sys.setlocale("LC_CTYPE","ru")
> [1] "Russian_Russia.1251"
russian()
[1] "Американские с..."
Note that the file-in fails and it points to the same character as the original poster's error (the one after "R). I can not do this with Chinese because i would have to install "Microsoft Pinyin IME 3.0", but the process is the same, you just replace the locale with "chinese" (the naming is a bit inconsistent, consult the documentation).
注意文件-in失败,它指向与原始海报的错误相同的字符(一个在“R”之后)。我不能用中文来做这个,因为我必须安装“Microsoft Pinyin IME 3.0”,但是这个过程是一样的,你只是用“中文”来替换locale(命名有点不一致,请参考文档)。
#2
21
On R/Windows, source
runs into problems with any UTF-8 characters that can't be represented in the current locale (or ANSI Code Page in Windows-speak). And unfortunately Windows doesn't have UTF-8 available as an ANSI code page--Windows has a technical limitation that ANSI code pages can only be one- or two-byte-per-character encodings, not variable-byte encodings like UTF-8.
在R/Windows上,源程序会遇到任何UTF-8字符的问题,这些字符不能在当前的语言环境中表示(或者在Windows会话中的ANSI代码页)。不幸的是,Windows没有UTF-8作为ANSI代码页——Windows的技术限制是ANSI代码页只能是一个或两个字节的编码,而不是像UTF-8那样的可变字节编码。
This doesn't seem to be a fundamental, unsolvable problem--there's just something wrong with the source
function. You can get 90% of the way there by doing this instead:
这似乎不是一个根本的、无法解决的问题——源函数有问题。你可以通过这样做得到90%的结果:
eval(parse(filename, encoding="UTF-8"))
This'll work almost exactly like source()
with default arguments, but won't let you do echo=T, eval.print=T, etc.
这与默认参数几乎完全相同,但不会让您执行echo=T, eval。打印= T等。
#3
5
For me (on windows) I do:
对我来说(在windows上):
source.utf8 <- function(f) {
l <- readLines(f, encoding="UTF-8")
eval(parse(text=l),envir=.GlobalEnv)
}
It works fine.
它将正常工作。
#4
3
I think the problem lies with R. I can happily source UTF-8 files, or UCS-2LE files with many non-ASCII characters in. But some characters cause it to fail. For example the following
我认为问题在于r。我可以很高兴地提供UTF-8文件,或者UCS-2LE文件,里面有很多非ascii字符。但有些角色会导致失败。例如下面的
danish <- function() print("Skønt H. C. Andersens barndomsomgivelser var meget fattige, blev de i hans rige fantasi solbeskinnede.")
croatian <- function() print("Dodigović. Kako se Vi zovete?")
new_testament <- function() print("Ne provizu al vi trezorojn sur la tero, kie tineo kaj rusto konsumas, kaj jie ŝtelistoj trafosas kaj ŝtelas; sed provizu al vi trezoron en la ĉielo")
russian <- function() print ("Американские суда находятся в международных водах. Япония выразила серьезное беспокойство советскими действиями.")
is fine in both UTF-8 and UCS-2LE without the Russian line. But if that is included then it fails. I'm pointing the finger at R. Your Chinese text also appears to be too hard for R on Windows.
在UTF-8和UCS-2LE中都没有俄国线。但如果包括在内,那就失败了。我指的是R。你的中文文本在Windows上看起来也太硬了。
Locale seems irrelevant here. It's just a file, you tell it what encoding the file is, why should your locale matter?
地区似乎无关紧要。它只是一个文件,你告诉它文件的编码是什么,为什么你的语言环境很重要?
#5
1
On windows, when you copy-paste a unicode or utf-8 encoded string into a text-control that is set to single-byte-input (ascii... depending on locale), the unknown bytes will be replaced by questionmarks. If i take the first 4 characters of your string and copy-paste it into e.g. Notepad and then save it, the file becomes in hex:
在windows上,当你将一个unicode或utf-8编码的字符串复制到一个文本控件中,它被设置为单字节输入(ascii…根据区域设置,未知字节将被问号替换。如果我把你的字符串的前4个字符和复制粘贴到例如Notepad中,然后保存,文件就变成了hex:
52 3F 3F 3F 3F
3F 3F 3F 3F。
what you have to do is find an editor which you can set to utf-8 before copy-pasting the text into it, then the saved file (of your first 4 characters) becomes:
你需要做的是找到一个编辑器,你可以将它设置为utf-8,然后将文本粘贴到它,然后保存的文件(你的前4个字符)变成:
52 E5 90 8C E6 97 B6 E4 B9 9F E8 A2 AB
52 E5 90 8C E6 97 B6 E4 B9 9F e8a2 AB。
This will then be recognized as valid utf-8 by [R].
这将被确认为有效的utf-8 (R)。
I used "Notepad2" for trying this, but i am sure there are many more.
我用“Notepad2”来尝试这个,但我确信还有更多。
#6
0
I encounter this problem when a try to source a .R file containing some Chinese characters. In my case, I found that merely set "LC_CTYPE" to "chinese" is not enough. But setting "LC_ALL" to "chinese" works well.
当试图找到包含一些汉字的. r文件时,我遇到了这个问题。在我的案例中,我发现仅仅将“LC_CTYPE”设置为“中文”是不够的。但是将“LC_ALL”设置为“中文”效果很好。
Note that it's not enough to get encoding right when you read or write plain text file in Rstudio (or R?) with non-ASCII. The locale setting counts too.
请注意,当您在Rstudio(或R?)中使用非ascii进行读或写纯文本文件时,编码是不够的。地区设置也很重要。
PS. the command is Sys.setlocale(category = "LC_CTYPE",locale = "chinese"). Please replace locale value correspondingly.
命令是Sys。setlocale(category = "LC_CTYPE",locale = "chinese")。请相应地替换区域设置值。
#7
0
Building on crow's answer, this solution makes RStudio
's Source
button work.
根据乌鸦的答案,这个解决方案使RStudio的源代码按钮工作。
When hitting that Source
button, RStudio
executes source('myfile.r', encoding = 'UTF-8')
), so overriding source
makes the errors disappear and runs the code as expected:
当单击该源按钮时,RStudio执行Source ('myfile)。r',编码= 'UTF-8'),所以覆盖源使错误消失,并按照预期运行代码:
source <- function(f, encoding = 'UTF-8') {
l <- readLines(f, encoding=encoding)
eval(parse(text=l),envir=.GlobalEnv)
}