导入R中具有特殊字符的数据

The following pic shows how the data is before i import it(notepad) in R and after importing.

下面的图片显示了在我导入数据之前(记事本)在R中以及导入之后的数据。

I use the following command to import it in R:

我使用以下命令将它导入R中:

Data <- read.csv('data.csv',stringsAsFactors = FALSE,header = TRUE,quote = "")

It can be seen that the special characters such as the ae is replaced with something like A| (line 19 on the left,line 18 or the right). Is there a way to import the CSV file as it is? (Using R)

可以看到，诸如ae之类的特殊字符被替换为|(左边第19行，第18行或右边第18行)。是否有办法导入CSV文件?(使用R)

1 个解决方案

#1

Your problem is an encoding issue. There are two aspects to this: First, what is saved by Notepad++ may not correspond to the encoding that you are expecting in the saved text file, and second, R may be reading the file in using read.csv() based on a different encoding, which is especially possible since if you are using Notepad++ then this suggests you are using Windows, and therefore you may be unable to have UTF-8 as your system locale for R.

您的问题是编码问题。这个问题有两个方面:第一,什么是拯救了notepad++可能不对应的编码,你预计在保存的文本文件,第二,R可能读取文件在使用read.csv()根据不同的编码,这尤其可能因为如果使用notepad++那么建议您使用的是Windows,因此你可能无法utf - 8作为R系统的语言环境。

So taking each issue in turn:

因此，依次讨论每一个问题:

Getting Notepad++ to save your file in a specific encoding. Here you can set your encoding for the new file based using these instructions. I always use UTF-8 but here since your texts are Danish, Latin-1 should work too.

获取Notepad++以将文件保存到特定的编码中。在这里，您可以使用这些指令为新文件设置编码。我总是使用UTF-8，但是在这里，因为您的文本是丹麦语，所以Latin-1也应该有效。

To verify the encoding of your texts, you may wish to use the file utility supplied with RTools. This will tell you something about the probable encoding of your file from the command line, although it is not perfect. (OS X and Linux users already have this without needing to install additional utilities.)

为了验证文本的编码，您可能希望使用RTools提供的文件实用程序。这将告诉您一些关于您的文件可能从命令行编码的信息，尽管它并不完美。(OS X和Linux用户已经有了这个功能，无需安装其他实用程序。)
Setting encoding when importing the .csv file into R. When you import the file using read.csv(), specify encoding = "UTF-8" or encoding = "Latin-1". You might also want to check though what your system encoding is, and match that. You can do this with Sys.getlocale() (and set it with Sys.setlocale().) On my system for instance:

在将.csv文件导入r时设置编码当使用read.csv()导入文件时，指定编码= "UTF-8"或编码= "Latin-1"。您可能还想检查您的系统编码是什么，并与之匹配。您可以使用Sys.getlocale()(并使用Sys.setlocale()设置它)。以我的系统为例:
```
> Sys.getlocale()
[1] "en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8"
```
You could of course set this to Windows-1252 but you might have trouble then with portability if using this on other platforms. UTF-8 is the best solution to this.

当然，您可以将其设置为Windows-1252，但是如果在其他平台上使用它，那么在可移植性方面可能会有问题。UTF-8是最好的解决方案。

#1

So taking each issue in turn:

因此，依次讨论每一个问题:

Getting Notepad++ to save your file in a specific encoding. Here you can set your encoding for the new file based using these instructions. I always use UTF-8 but here since your texts are Danish, Latin-1 should work too.

获取Notepad++以将文件保存到特定的编码中。在这里，您可以使用这些指令为新文件设置编码。我总是使用UTF-8，但是在这里，因为您的文本是丹麦语，所以Latin-1也应该有效。

To verify the encoding of your texts, you may wish to use the file utility supplied with RTools. This will tell you something about the probable encoding of your file from the command line, although it is not perfect. (OS X and Linux users already have this without needing to install additional utilities.)

为了验证文本的编码，您可能希望使用RTools提供的文件实用程序。这将告诉您一些关于您的文件可能从命令行编码的信息，尽管它并不完美。(OS X和Linux用户已经有了这个功能，无需安装其他实用程序。)
Setting encoding when importing the .csv file into R. When you import the file using read.csv(), specify encoding = "UTF-8" or encoding = "Latin-1". You might also want to check though what your system encoding is, and match that. You can do this with Sys.getlocale() (and set it with Sys.setlocale().) On my system for instance:

在将.csv文件导入r时设置编码当使用read.csv()导入文件时，指定编码= "UTF-8"或编码= "Latin-1"。您可能还想检查您的系统编码是什么，并与之匹配。您可以使用Sys.getlocale()(并使用Sys.setlocale()设置它)。以我的系统为例:
```
> Sys.getlocale()
[1] "en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8"
```
You could of course set this to Windows-1252 but you might have trouble then with portability if using this on other platforms. UTF-8 is the best solution to this.

当然，您可以将其设置为Windows-1252，但是如果在其他平台上使用它，那么在可移植性方面可能会有问题。UTF-8是最好的解决方案。

秒客网

导入R中具有特殊字符的数据

1 个解决方案

#1

#1

相关文章