如何使用UTF-8编码保存().R文件?

时间:2022-01-22 13:18:05

The following, when copied and pasted directly into R works fine:

下面,当直接复制粘贴到R的时候可以很好:

> character_test <- function() print("R同时也被称为GNU S是一个强烈的功能性语言和环境,探索统计数据集,使许多从自定义数据图形显示...")
> character_test()
[1] "R同时也被称为GNU S是一个强烈的功能性语言和环境,探索统计数据集,使许多从自定义数据图形显示..."

However, if I make a file called character_test.R containing the EXACT SAME code, save it in UTF-8 encoding (so as to retain the special Chinese characters), then when I source() it in R, I get the following error:

但是,如果我创建一个名为character_test的文件。R包含完全相同的代码,将其保存在UTF-8编码中(以便保留特殊的中文字符),当我在R中输入()时,我得到以下错误:

> source(file="C:\\Users\\Tony\\Desktop\\character_test.R", encoding = "UTF-8")
Error in source(file = "C:\\Users\\Tony\\Desktop\\character_test.R", encoding = "utf-8") : 
  C:\Users\Tony\Desktop\character_test.R:3:0: unexpected end of input
1: character.test <- function() print("R
2: 
  ^
In addition: Warning message:
In source(file = "C:\\Users\\Tony\\Desktop\\character_test.R", encoding = "UTF-8") :
  invalid input found on input connection 'C:\Users\Tony\Desktop\character_test.R'

Any help you can offer in solving and helping me to understand what is going on here would be much appreciated.

如果你能帮助我理解这里发生的事情,我将不胜感激。

> sessionInfo() # Windows 7 Pro x64
R version 2.12.1 (2010-12-16)
Platform: x86_64-pc-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United Kingdom.1252 
[2] LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

loaded via a namespace (and not attached):
[1] tools_2.12.1

and

> l10n_info()
$MBCS
[1] FALSE

$`UTF-8`
[1] FALSE

$`Latin-1`
[1] TRUE

$codepage
[1] 1252

7 个解决方案

#1


19  

We talked about this a lot in the comments to my previous post but I don't want this to get lost on page 3 of comments: You have to set the locale, it works with both input from the R-console (see screenshot in comments) as well as with input from file see this screenshot:

我们讲过很多在评论我的以前的文章,但我不希望这迷路3页的评论:你必须设置语言环境,它与输入从R-console评论(见截图)以及输入文件看到这个截图:

如何使用UTF-8编码保存().R文件?

The file "myfile.r" contains:

myfile文件”。r”包含:

russian <- function() print ("Американские с...");

The console contains:

控制台包含:

source("myfile.r", encoding="utf-8")
> Error in source(".....
Sys.setlocale("LC_CTYPE","ru")
> [1] "Russian_Russia.1251"
russian()
[1] "Американские с..."

Note that the file-in fails and it points to the same character as the original poster's error (the one after "R). I can not do this with Chinese because i would have to install "Microsoft Pinyin IME 3.0", but the process is the same, you just replace the locale with "chinese" (the naming is a bit inconsistent, consult the documentation).

注意文件-in失败,它指向与原始海报的错误相同的字符(一个在“R”之后)。我不能用中文来做这个,因为我必须安装“Microsoft Pinyin IME 3.0”,但是这个过程是一样的,你只是用“中文”来替换locale(命名有点不一致,请参考文档)。

#2


21  

On R/Windows, source runs into problems with any UTF-8 characters that can't be represented in the current locale (or ANSI Code Page in Windows-speak). And unfortunately Windows doesn't have UTF-8 available as an ANSI code page--Windows has a technical limitation that ANSI code pages can only be one- or two-byte-per-character encodings, not variable-byte encodings like UTF-8.

在R/Windows上,源程序会遇到任何UTF-8字符的问题,这些字符不能在当前的语言环境中表示(或者在Windows会话中的ANSI代码页)。不幸的是,Windows没有UTF-8作为ANSI代码页——Windows的技术限制是ANSI代码页只能是一个或两个字节的编码,而不是像UTF-8那样的可变字节编码。

This doesn't seem to be a fundamental, unsolvable problem--there's just something wrong with the source function. You can get 90% of the way there by doing this instead:

这似乎不是一个根本的、无法解决的问题——源函数有问题。你可以通过这样做得到90%的结果:

eval(parse(filename, encoding="UTF-8"))

This'll work almost exactly like source() with default arguments, but won't let you do echo=T, eval.print=T, etc.

这与默认参数几乎完全相同,但不会让您执行echo=T, eval。打印= T等。

#3


5  

For me (on windows) I do:

对我来说(在windows上):

source.utf8 <- function(f) {
    l <- readLines(f, encoding="UTF-8")
    eval(parse(text=l),envir=.GlobalEnv)
}

It works fine.

它将正常工作。

#4


3  

I think the problem lies with R. I can happily source UTF-8 files, or UCS-2LE files with many non-ASCII characters in. But some characters cause it to fail. For example the following

我认为问题在于r。我可以很高兴地提供UTF-8文件,或者UCS-2LE文件,里面有很多非ascii字符。但有些角色会导致失败。例如下面的

danish <- function() print("Skønt H. C. Andersens barndomsomgivelser var meget fattige, blev de i hans rige fantasi solbeskinnede.")
croatian <- function() print("Dodigović. Kako se Vi zovete?")
new_testament <- function() print("Ne provizu al vi trezorojn sur la tero, kie tineo kaj rusto konsumas, kaj jie ŝtelistoj trafosas kaj ŝtelas; sed provizu al vi trezoron en la ĉielo")
russian <- function() print ("Американские суда находятся в международных водах. Япония выразила серьезное беспокойство советскими действиями.")

is fine in both UTF-8 and UCS-2LE without the Russian line. But if that is included then it fails. I'm pointing the finger at R. Your Chinese text also appears to be too hard for R on Windows.

在UTF-8和UCS-2LE中都没有俄国线。但如果包括在内,那就失败了。我指的是R。你的中文文本在Windows上看起来也太硬了。

Locale seems irrelevant here. It's just a file, you tell it what encoding the file is, why should your locale matter?

地区似乎无关紧要。它只是一个文件,你告诉它文件的编码是什么,为什么你的语言环境很重要?

#5


1  

On windows, when you copy-paste a unicode or utf-8 encoded string into a text-control that is set to single-byte-input (ascii... depending on locale), the unknown bytes will be replaced by questionmarks. If i take the first 4 characters of your string and copy-paste it into e.g. Notepad and then save it, the file becomes in hex:

在windows上,当你将一个unicode或utf-8编码的字符串复制到一个文本控件中,它被设置为单字节输入(ascii…根据区域设置,未知字节将被问号替换。如果我把你的字符串的前4个字符和复制粘贴到例如Notepad中,然后保存,文件就变成了hex:

52 3F 3F 3F 3F

3F 3F 3F 3F。

what you have to do is find an editor which you can set to utf-8 before copy-pasting the text into it, then the saved file (of your first 4 characters) becomes:

你需要做的是找到一个编辑器,你可以将它设置为utf-8,然后将文本粘贴到它,然后保存的文件(你的前4个字符)变成:

52 E5 90 8C E6 97 B6 E4 B9 9F E8 A2 AB

52 E5 90 8C E6 97 B6 E4 B9 9F e8a2 AB。

This will then be recognized as valid utf-8 by [R].

这将被确认为有效的utf-8 (R)。

I used "Notepad2" for trying this, but i am sure there are many more.

我用“Notepad2”来尝试这个,但我确信还有更多。

#6


0  

I encounter this problem when a try to source a .R file containing some Chinese characters. In my case, I found that merely set "LC_CTYPE" to "chinese" is not enough. But setting "LC_ALL" to "chinese" works well.

当试图找到包含一些汉字的. r文件时,我遇到了这个问题。在我的案例中,我发现仅仅将“LC_CTYPE”设置为“中文”是不够的。但是将“LC_ALL”设置为“中文”效果很好。

Note that it's not enough to get encoding right when you read or write plain text file in Rstudio (or R?) with non-ASCII. The locale setting counts too.

请注意,当您在Rstudio(或R?)中使用非ascii进行读或写纯文本文件时,编码是不够的。地区设置也很重要。

PS. the command is Sys.setlocale(category = "LC_CTYPE",locale = "chinese"). Please replace locale value correspondingly.

命令是Sys。setlocale(category = "LC_CTYPE",locale = "chinese")。请相应地替换区域设置值。

#7


0  

Building on crow's answer, this solution makes RStudio's Source button work.

根据乌鸦的答案,这个解决方案使RStudio的源代码按钮工作。

When hitting that Source button, RStudio executes source('myfile.r', encoding = 'UTF-8')), so overriding source makes the errors disappear and runs the code as expected:

当单击该源按钮时,RStudio执行Source ('myfile)。r',编码= 'UTF-8'),所以覆盖源使错误消失,并按照预期运行代码:

source <- function(f, encoding = 'UTF-8') {
    l <- readLines(f, encoding=encoding)
    eval(parse(text=l),envir=.GlobalEnv)
}

#1


19  

We talked about this a lot in the comments to my previous post but I don't want this to get lost on page 3 of comments: You have to set the locale, it works with both input from the R-console (see screenshot in comments) as well as with input from file see this screenshot:

我们讲过很多在评论我的以前的文章,但我不希望这迷路3页的评论:你必须设置语言环境,它与输入从R-console评论(见截图)以及输入文件看到这个截图:

如何使用UTF-8编码保存().R文件?

The file "myfile.r" contains:

myfile文件”。r”包含:

russian <- function() print ("Американские с...");

The console contains:

控制台包含:

source("myfile.r", encoding="utf-8")
> Error in source(".....
Sys.setlocale("LC_CTYPE","ru")
> [1] "Russian_Russia.1251"
russian()
[1] "Американские с..."

Note that the file-in fails and it points to the same character as the original poster's error (the one after "R). I can not do this with Chinese because i would have to install "Microsoft Pinyin IME 3.0", but the process is the same, you just replace the locale with "chinese" (the naming is a bit inconsistent, consult the documentation).

注意文件-in失败,它指向与原始海报的错误相同的字符(一个在“R”之后)。我不能用中文来做这个,因为我必须安装“Microsoft Pinyin IME 3.0”,但是这个过程是一样的,你只是用“中文”来替换locale(命名有点不一致,请参考文档)。

#2


21  

On R/Windows, source runs into problems with any UTF-8 characters that can't be represented in the current locale (or ANSI Code Page in Windows-speak). And unfortunately Windows doesn't have UTF-8 available as an ANSI code page--Windows has a technical limitation that ANSI code pages can only be one- or two-byte-per-character encodings, not variable-byte encodings like UTF-8.

在R/Windows上,源程序会遇到任何UTF-8字符的问题,这些字符不能在当前的语言环境中表示(或者在Windows会话中的ANSI代码页)。不幸的是,Windows没有UTF-8作为ANSI代码页——Windows的技术限制是ANSI代码页只能是一个或两个字节的编码,而不是像UTF-8那样的可变字节编码。

This doesn't seem to be a fundamental, unsolvable problem--there's just something wrong with the source function. You can get 90% of the way there by doing this instead:

这似乎不是一个根本的、无法解决的问题——源函数有问题。你可以通过这样做得到90%的结果:

eval(parse(filename, encoding="UTF-8"))

This'll work almost exactly like source() with default arguments, but won't let you do echo=T, eval.print=T, etc.

这与默认参数几乎完全相同,但不会让您执行echo=T, eval。打印= T等。

#3


5  

For me (on windows) I do:

对我来说(在windows上):

source.utf8 <- function(f) {
    l <- readLines(f, encoding="UTF-8")
    eval(parse(text=l),envir=.GlobalEnv)
}

It works fine.

它将正常工作。

#4


3  

I think the problem lies with R. I can happily source UTF-8 files, or UCS-2LE files with many non-ASCII characters in. But some characters cause it to fail. For example the following

我认为问题在于r。我可以很高兴地提供UTF-8文件,或者UCS-2LE文件,里面有很多非ascii字符。但有些角色会导致失败。例如下面的

danish <- function() print("Skønt H. C. Andersens barndomsomgivelser var meget fattige, blev de i hans rige fantasi solbeskinnede.")
croatian <- function() print("Dodigović. Kako se Vi zovete?")
new_testament <- function() print("Ne provizu al vi trezorojn sur la tero, kie tineo kaj rusto konsumas, kaj jie ŝtelistoj trafosas kaj ŝtelas; sed provizu al vi trezoron en la ĉielo")
russian <- function() print ("Американские суда находятся в международных водах. Япония выразила серьезное беспокойство советскими действиями.")

is fine in both UTF-8 and UCS-2LE without the Russian line. But if that is included then it fails. I'm pointing the finger at R. Your Chinese text also appears to be too hard for R on Windows.

在UTF-8和UCS-2LE中都没有俄国线。但如果包括在内,那就失败了。我指的是R。你的中文文本在Windows上看起来也太硬了。

Locale seems irrelevant here. It's just a file, you tell it what encoding the file is, why should your locale matter?

地区似乎无关紧要。它只是一个文件,你告诉它文件的编码是什么,为什么你的语言环境很重要?

#5


1  

On windows, when you copy-paste a unicode or utf-8 encoded string into a text-control that is set to single-byte-input (ascii... depending on locale), the unknown bytes will be replaced by questionmarks. If i take the first 4 characters of your string and copy-paste it into e.g. Notepad and then save it, the file becomes in hex:

在windows上,当你将一个unicode或utf-8编码的字符串复制到一个文本控件中,它被设置为单字节输入(ascii…根据区域设置,未知字节将被问号替换。如果我把你的字符串的前4个字符和复制粘贴到例如Notepad中,然后保存,文件就变成了hex:

52 3F 3F 3F 3F

3F 3F 3F 3F。

what you have to do is find an editor which you can set to utf-8 before copy-pasting the text into it, then the saved file (of your first 4 characters) becomes:

你需要做的是找到一个编辑器,你可以将它设置为utf-8,然后将文本粘贴到它,然后保存的文件(你的前4个字符)变成:

52 E5 90 8C E6 97 B6 E4 B9 9F E8 A2 AB

52 E5 90 8C E6 97 B6 E4 B9 9F e8a2 AB。

This will then be recognized as valid utf-8 by [R].

这将被确认为有效的utf-8 (R)。

I used "Notepad2" for trying this, but i am sure there are many more.

我用“Notepad2”来尝试这个,但我确信还有更多。

#6


0  

I encounter this problem when a try to source a .R file containing some Chinese characters. In my case, I found that merely set "LC_CTYPE" to "chinese" is not enough. But setting "LC_ALL" to "chinese" works well.

当试图找到包含一些汉字的. r文件时,我遇到了这个问题。在我的案例中,我发现仅仅将“LC_CTYPE”设置为“中文”是不够的。但是将“LC_ALL”设置为“中文”效果很好。

Note that it's not enough to get encoding right when you read or write plain text file in Rstudio (or R?) with non-ASCII. The locale setting counts too.

请注意,当您在Rstudio(或R?)中使用非ascii进行读或写纯文本文件时,编码是不够的。地区设置也很重要。

PS. the command is Sys.setlocale(category = "LC_CTYPE",locale = "chinese"). Please replace locale value correspondingly.

命令是Sys。setlocale(category = "LC_CTYPE",locale = "chinese")。请相应地替换区域设置值。

#7


0  

Building on crow's answer, this solution makes RStudio's Source button work.

根据乌鸦的答案,这个解决方案使RStudio的源代码按钮工作。

When hitting that Source button, RStudio executes source('myfile.r', encoding = 'UTF-8')), so overriding source makes the errors disappear and runs the code as expected:

当单击该源按钮时,RStudio执行Source ('myfile)。r',编码= 'UTF-8'),所以覆盖源使错误消失,并按照预期运行代码:

source <- function(f, encoding = 'UTF-8') {
    l <- readLines(f, encoding=encoding)
    eval(parse(text=l),envir=.GlobalEnv)
}