This question already has an answer here:
这个问题已经有了答案:
- Only read limited number of columns 3 answers
- 只阅读有限的3栏答案
I have a simple csv file called "test.csv" with the following content:
我有一个简单的csv文件叫做“test”。csv"包含以下内容:
colA,colB,colC
1,"x",12
2,"y",34
3,"z",56
Let's say I want to skip reading in colA and just read in colB and colC. I want a general way to do this because I have lots of files to read in and sometimes colA is called something else altogether but colB and colC are always the same.
比方说,我想跳过在可乐中阅读,而只是在colB和colC中阅读。我想要一种通用的方式来做这个因为我有很多文件要读,有时可乐被叫做其他的东西但是colB和colC总是一样的。
According to the read_csv documentation, one way to accomplish this is to pass a named list for col_types and only name the columns you want to keep:
根据read_csv文档,实现此目的的一种方法是为col_types传递一个已命名的列表,并且只指定要保留的列:
read_csv('test.csv', col_types = list(colB = col_character(), colC = col_numeric()))
By not mentioning colA it should get dropped from the output. However, the resulting data frame is:
不提可乐,它应该从输出中删除。然而,最终的数据帧是:
Source: local data frame [3 x 3]
colA colB colC
1 1 x 12
2 2 y 34
3 3 z 56
Am I doing something wrong or is the read_csv documentation not correct? According to the help file:
是我做错了什么,还是read_csv文档不正确?根据帮助文件:
If a list, it must contain one "collector" for each column. If you only want to read a subset of the columns, you can use a named list (where the names give the column names). If a column is not mentioned by name, it will not be included in the output.
如果一个列表,它必须为每个列包含一个“收集器”。如果您只想读取列的一个子集,您可以使用一个命名列表(其中的名称给出列名)。如果没有按名称提及列,则不会包含在输出中。
2 个解决方案
#1
7
There is an answer out there, I just didn't search hard enough: https://github.com/hadley/readr/issues/132
这里有一个答案,我只是没有仔细搜索:https://github.com/hadley/readr/issues/132
Apparently this was a documentation issue that has been corrected. This functionality may eventually get added but Hadley thought it was more useful to be able to just update one column type and not drop the others.
显然,这是一个已经被纠正的文档问题。这个功能可能最终会被添加,但是Hadley认为能够只更新一个列类型而不删除其他列是更有用的。
#2
2
"According to the read_csv documentation, one way to accomplish this is to pass a named list for col_types and only name the columns you want to keep"
"根据read_csv文档,实现此目的的一种方法是为col_types传递一个命名列表,并只命名要保留的列"
WRONG: read_csv('test.csv', col_types=list(colB='c', colC='c'))
No, the doc is misleading, you have to either specify that unnamed cols get dropped (class '_' or col_skip()
), or else explicitly specify their class as NULL:
不,doc具有误导性,您必须指定未命名的cols被删除(class '_'或col_skip())),或者明确指定它们的类为NULL:
read_csv('test.csv', col_types=list('*'='_', colB='c', colC='c'))
read_csv('test.csv', col_types=list('colA'='_', colB='c', colC='c'))
#1
7
There is an answer out there, I just didn't search hard enough: https://github.com/hadley/readr/issues/132
这里有一个答案,我只是没有仔细搜索:https://github.com/hadley/readr/issues/132
Apparently this was a documentation issue that has been corrected. This functionality may eventually get added but Hadley thought it was more useful to be able to just update one column type and not drop the others.
显然,这是一个已经被纠正的文档问题。这个功能可能最终会被添加,但是Hadley认为能够只更新一个列类型而不删除其他列是更有用的。
#2
2
"According to the read_csv documentation, one way to accomplish this is to pass a named list for col_types and only name the columns you want to keep"
"根据read_csv文档,实现此目的的一种方法是为col_types传递一个命名列表,并只命名要保留的列"
WRONG: read_csv('test.csv', col_types=list(colB='c', colC='c'))
No, the doc is misleading, you have to either specify that unnamed cols get dropped (class '_' or col_skip()
), or else explicitly specify their class as NULL:
不,doc具有误导性,您必须指定未命名的cols被删除(class '_'或col_skip())),或者明确指定它们的类为NULL:
read_csv('test.csv', col_types=list('*'='_', colB='c', colC='c'))
read_csv('test.csv', col_types=list('colA'='_', colB='c', colC='c'))