Suppose I have a csv file looks like this:
假设我有一个csv文件,如下所示:
Type,ID,NAME,CONTENT,RESPONSE,GRADE,SOURCE
A,3,"","I have comma, ha!",I have open double quotes",A,""
desired output should be:
期望的输出应该是:
df <- data.frame(Type='A',ID=3, NAME=NA, CONTENT='I have comma, ha!',
RESPONSE='I have open double quotes\"', GRADE=A, SOURCE=NA)
df
Type ID NAME CONTENT RESPONSE GRADE SOURCE
1 A 3 NA I have comma, ha! I have open double quotes" A NA
I tried to use read.csv
, since the data provider uses quote to escape comma in the string, but they forgot to escape double quotes in string with no comma, so no matter whether I disable quote in read.csv
I won't get desired output.
我试图使用read.csv,因为数据提供程序使用quote来转义字符串中的逗号,但是他们忘记了在字符串中没有逗号的双引号,所以无论我是否在read.csv中禁用引用我都不会得到期望的输出。
How can I do this in R? Other package solutions are also welcome.
我怎么能在R中这样做?其他包装解决方案也欢迎。
3 个解决方案
#1
7
fread
from data.table
handles this just fine:
来自data.table的fread处理这个很好:
library(data.table)
fread('Type,ID,NAME,CONTENT,RESPONSE,GRADE,SOURCE
A,3,"","I have comma, ha!",I have open double quotes",A,""')
# Type ID NAME CONTENT RESPONSE GRADE SOURCE
#1: A 3 I have comma, ha! I have open double quotes" A
#2
2
This is not valid CSV, so you'll have to do your own parsing. But, assuming the convention is as follows, you can just toggle with scan
to take advantage of most of its abilities:
这不是有效的CSV,因此您必须自己进行解析。但是,假设约定如下,您可以通过扫描切换以利用其大部分功能:
- If the field starts with a quote, it is quoted.
- 如果该字段以引号开头,则引用该字段。
- If the field does not start with a quote, it is raw
- 如果该字段不以引号开头,则为raw
next_field<-function(stream) {
p<-seek(stream)
d<-readChar(stream,1)
seek(stream,p)
if(d=="\"")
field<-scan(stream,"",1,sep=",",quote="\"",blank=FALSE)
else
field<-scan(stream,"",1,sep=",",quote="",blank=FALSE)
return(field)
}
Assuming the above convention, this sufficient to parse as follows
假设上述约定,这足以解析如下
s<-file("example.csv",open="rt")
header<-readLines(s,1)
header<-scan(what="",text=header,sep=",")
line<-replicate(length(header),next_field(s))
setNames(as.data.frame(lapply(line,type.convert)),header)
Type ID NAME CONTENT RESPONSE GRADE SOURCE 1 A 3 NA I have comma, ha! I have open double quotes" A NA
However, in practice you might want to first write back the fields, quoting each, to another file, so you can just read.csv
on the corrected format.
但是,在实践中,您可能希望首先将字段写回,引用每个字段到另一个文件,这样您就可以在校正后的格式上读取.cv。
#3
0
I'm not too sure about the structure of CSV files, but you said the author had escaped the comma in the text under content.
我不太确定CSV文件的结构,但是你说作者已经在内容下的文本中删除了逗号。
This works to read the text as is with the "
at the end.
这可以用“最后”来读取文本。
read.csv2("Test.csv", header = T,sep = ",", quote="")
#1
7
fread
from data.table
handles this just fine:
来自data.table的fread处理这个很好:
library(data.table)
fread('Type,ID,NAME,CONTENT,RESPONSE,GRADE,SOURCE
A,3,"","I have comma, ha!",I have open double quotes",A,""')
# Type ID NAME CONTENT RESPONSE GRADE SOURCE
#1: A 3 I have comma, ha! I have open double quotes" A
#2
2
This is not valid CSV, so you'll have to do your own parsing. But, assuming the convention is as follows, you can just toggle with scan
to take advantage of most of its abilities:
这不是有效的CSV,因此您必须自己进行解析。但是,假设约定如下,您可以通过扫描切换以利用其大部分功能:
- If the field starts with a quote, it is quoted.
- 如果该字段以引号开头,则引用该字段。
- If the field does not start with a quote, it is raw
- 如果该字段不以引号开头,则为raw
next_field<-function(stream) {
p<-seek(stream)
d<-readChar(stream,1)
seek(stream,p)
if(d=="\"")
field<-scan(stream,"",1,sep=",",quote="\"",blank=FALSE)
else
field<-scan(stream,"",1,sep=",",quote="",blank=FALSE)
return(field)
}
Assuming the above convention, this sufficient to parse as follows
假设上述约定,这足以解析如下
s<-file("example.csv",open="rt")
header<-readLines(s,1)
header<-scan(what="",text=header,sep=",")
line<-replicate(length(header),next_field(s))
setNames(as.data.frame(lapply(line,type.convert)),header)
Type ID NAME CONTENT RESPONSE GRADE SOURCE 1 A 3 NA I have comma, ha! I have open double quotes" A NA
However, in practice you might want to first write back the fields, quoting each, to another file, so you can just read.csv
on the corrected format.
但是,在实践中,您可能希望首先将字段写回,引用每个字段到另一个文件,这样您就可以在校正后的格式上读取.cv。
#3
0
I'm not too sure about the structure of CSV files, but you said the author had escaped the comma in the text under content.
我不太确定CSV文件的结构,但是你说作者已经在内容下的文本中删除了逗号。
This works to read the text as is with the "
at the end.
这可以用“最后”来读取文本。
read.csv2("Test.csv", header = T,sep = ",", quote="")