I have to read in about 300 individual CSVs. I have managed to automate the process using a loop and structured CSV names. However each CSV has 14-17 lines of rubbish at the start and it varies randomly so hard coding a 'skip' parameter in the read.table command won't work. The column names and number of columns is the same for each CSV.
我必须阅读大约300个独立的csv。我使用循环和结构化的CSV名称实现了过程的自动化。然而,每个CSV在开始时都有14-17行垃圾,并且它随机变化,所以很难在read中编码“skip”参数。table命令不能工作。每个CSV的列名和列数是相同的。
Here is an example of what I am up against:
下面是我所反对的一个例子:
QUICK STATISTICS:
Directory: Data,,,,
File: Final_Comp_Zn_1
Selection: SEL{Ox*1000+Doma=1201}
Weight: None,,,
,,Variable: AG,,,
Total Number of Samples: 450212 Number of Selected Samples: 277
Statistics
VARIABLE,Min slice Y(m),Max slice Y(m),Count,Minimum,Maximum,Mean,Std.Dev.,Variance,Total Samples in Domain,Active Samples in Domain AG,
6780.00, 6840.00, 7, 3.0000, 52.5000, 23.4143, 16.8507, 283.9469, 10, 10 AG,
6840.00, 6900.00, 4, 4.0000, 5.5000, 4.9500, 0.5766, 0.3325, 13, 13 AG,
6900.00, 6960.00, 16, 1.0000, 37.0000, 8.7625, 9.0047, 81.0848, 29, 29 AG,
6960.00, 7020.00, 58, 3.0000, 73.5000, 10.6931, 11.9087, 141.8172, 132, 132 AG,
7020.00, 7080.00, 23, 3.0000, 104.5000, 15.3435, 23.2233, 539.3207, 23, 23 AG,
7080.00, 7140.00, 33, 1.0000, 15.4000, 3.8152, 2.8441, 8.0892, 35, 35 AG,
Basically I want to read from the line VARIABLE,Min slice Y(m),Max slice Y(m),...
. I can think of a few solutions but I don't know how I would go about programming it. Is there anyway I can:
基本上我想从行读取变量,敏片Y(m),马克斯片Y(m),....我能想出一些解决方案,但我不知道该如何去编程。我能做到吗:
- Read the CSV first and somehow work out how many lines of rubbish there is and then re-read it and specify the correct number of lines to skip? Or
- 首先读取CSV,然后以某种方式计算出有多少行垃圾,然后重新读取,并指定要跳过的正确行数?或
- Tell
read.table
to start reading when it finds the column names (since these are the same for each CSV) and ignore everything prior to that? - 跟读。当查找列名(因为每个CSV都是相同的)并忽略之前的所有内容时,表开始读取?
I think solution (2) would be the most appropriate, but I am open to any suggestions!
我认为解决方案(2)是最合适的,但是我愿意接受任何建议!
2 个解决方案
#1
8
The function fread
from the package data.table does automatic detection of number of rows to be skipped. The function is in development stage currently.
函数来自包数据。表自动检测要跳过的行数。该功能目前处于开发阶段。
Here is an example code:
下面是一个示例代码:
require(data.table)
cat("blah\nblah\nblah\nVARIABLE,X1,X2\nA,1,2\n", file="myfile1.csv")
cat("blah\nVARIABLE,A1,A2\nA,1,2\n", file="myfile2.csv")
cat("blah\nblah\nVARIABLE,Z1,Z2\nA,1,2\n", file="myfile3.csv")
lapply(list.files(pattern = "myfile.*.csv"), fread)
#2
17
Here's a minimal example of one approach that can be taken.
这里有一个可以采用的方法的最小示例。
First, let's make up some csv files similar to the ones you describe:
首先,让我们创建一些csv文件,类似于您所描述的文件:
cat("blah\nblah\nblah\nVARIABLE,X1,X2\nA,1,2\n", file="myfile1.csv")
cat("blah\nVARIABLE,A1,A2\nA,1,2\n", file="myfile2.csv")
cat("blah\nblah\nVARIABLE,Z1,Z2\nA,1,2\n", file="myfile3.csv")
Second, identify where the data start:
第二,确定数据从哪里开始:
linesToSkip <- sapply(list.files(pattern = "myfile.*.csv"),
function(x) grep("^VARIABLE", readLines(x))-1)
Third, use that information to read in your files into a single list.
第三,使用这些信息将文件读入一个列表。
lapply(names(linesToSkip),
function(x) read.csv(file=x, skip = linesToSkip[x]))
# [[1]]
# VARIABLE X1 X2
# 1 A 1 2
#
# [[2]]
# VARIABLE A1 A2
# 1 A 1 2
#
# [[3]]
# VARIABLE Z1 Z2
# 1 A 1 2
Edit #1
An alternative to reading the data twice is to read it once into a list, and then perform the same type of processing:
将数据读两遍的替代方法是将其读入一个列表,然后执行相同类型的处理:
myRawData <- lapply(list.files(pattern = "myfile.*.csv"), readLines)
lapply(myRawData, function(x) {
linesToSkip <- grep("^VARIABLE", x)-1
read.csv(text = x, skip = linesToSkip)
})
Or, for that matter:
或者,对于这个问题:
lapply(list.files(pattern = "myfile.*.csv"), function(x) {
temp <- readLines(x)
linesToSkip <- grep("^VARIABLE", temp)-1
read.csv(text = temp, skip = linesToSkip)
})
Edit #2
As @PaulHiemstra notes, you can use the argument n
to only read a few lines of each file into memory, rather than reading the whole file. Thus, if you know for certain that there aren't more than 20 lines of "rubbish" in each file, if you are using the first approach described, you can use:
正如@PaulHiemstra指出的,可以使用参数n将每个文件的几行读入内存,而不是读取整个文件。因此,如果您确定每个文件中“垃圾”不超过20行,如果您使用的是描述的第一种方法,您可以使用:
linesToSkip <- sapply(list.files(pattern = "myfile.*.csv"),
function(x) grep("^VARIABLE", readLines(x, n = 20))-1)
#1
8
The function fread
from the package data.table does automatic detection of number of rows to be skipped. The function is in development stage currently.
函数来自包数据。表自动检测要跳过的行数。该功能目前处于开发阶段。
Here is an example code:
下面是一个示例代码:
require(data.table)
cat("blah\nblah\nblah\nVARIABLE,X1,X2\nA,1,2\n", file="myfile1.csv")
cat("blah\nVARIABLE,A1,A2\nA,1,2\n", file="myfile2.csv")
cat("blah\nblah\nVARIABLE,Z1,Z2\nA,1,2\n", file="myfile3.csv")
lapply(list.files(pattern = "myfile.*.csv"), fread)
#2
17
Here's a minimal example of one approach that can be taken.
这里有一个可以采用的方法的最小示例。
First, let's make up some csv files similar to the ones you describe:
首先,让我们创建一些csv文件,类似于您所描述的文件:
cat("blah\nblah\nblah\nVARIABLE,X1,X2\nA,1,2\n", file="myfile1.csv")
cat("blah\nVARIABLE,A1,A2\nA,1,2\n", file="myfile2.csv")
cat("blah\nblah\nVARIABLE,Z1,Z2\nA,1,2\n", file="myfile3.csv")
Second, identify where the data start:
第二,确定数据从哪里开始:
linesToSkip <- sapply(list.files(pattern = "myfile.*.csv"),
function(x) grep("^VARIABLE", readLines(x))-1)
Third, use that information to read in your files into a single list.
第三,使用这些信息将文件读入一个列表。
lapply(names(linesToSkip),
function(x) read.csv(file=x, skip = linesToSkip[x]))
# [[1]]
# VARIABLE X1 X2
# 1 A 1 2
#
# [[2]]
# VARIABLE A1 A2
# 1 A 1 2
#
# [[3]]
# VARIABLE Z1 Z2
# 1 A 1 2
Edit #1
An alternative to reading the data twice is to read it once into a list, and then perform the same type of processing:
将数据读两遍的替代方法是将其读入一个列表,然后执行相同类型的处理:
myRawData <- lapply(list.files(pattern = "myfile.*.csv"), readLines)
lapply(myRawData, function(x) {
linesToSkip <- grep("^VARIABLE", x)-1
read.csv(text = x, skip = linesToSkip)
})
Or, for that matter:
或者,对于这个问题:
lapply(list.files(pattern = "myfile.*.csv"), function(x) {
temp <- readLines(x)
linesToSkip <- grep("^VARIABLE", temp)-1
read.csv(text = temp, skip = linesToSkip)
})
Edit #2
As @PaulHiemstra notes, you can use the argument n
to only read a few lines of each file into memory, rather than reading the whole file. Thus, if you know for certain that there aren't more than 20 lines of "rubbish" in each file, if you are using the first approach described, you can use:
正如@PaulHiemstra指出的,可以使用参数n将每个文件的几行读入内存,而不是读取整个文件。因此,如果您确定每个文件中“垃圾”不超过20行,如果您使用的是描述的第一种方法,您可以使用:
linesToSkip <- sapply(list.files(pattern = "myfile.*.csv"),
function(x) grep("^VARIABLE", readLines(x, n = 20))-1)