I have been reading up on how to read and combine multiple xlsx. files into one R data frame and have come across some very good suggestions like, How to read multiple xlsx file in R using loop with specific rows and columns, but non fits my data set so far.
我一直在阅读如何阅读和组合多个xlsx。文件到一个R数据框,并遇到了一些非常好的建议,如,如何使用具有特定行和列的循环读取R中的多个xlsx文件,但到目前为止不适合我的数据集。
I would like R to read in multiple xlsx files with that have multiple sheets. All sheets and files have the same columns but not the same length and NA's should be excluded. I want to skip the first 3 rows and only take in columns 1:6, 8:10, 12:17, 19.
我希望R读取多个xlsx文件,其中包含多个工作表。所有工作表和文件都具有相同的列但长度不同,应排除NA。我想跳过前3行,只考虑第1:6,8:10,12:17,19列。
So far I tried:
到目前为止我尝试过:
file.list <- list.files(recursive=T,pattern='*.xlsx')
dat = lapply(file.list, function(i){
x = read.xlsx(i, sheetIndex=1, sheetName=NULL, startRow=4,
endRow=NULL, as.data.frame=TRUE, header=F)
# Column select
x = x[, c(1:6,8:10,12:17,19)]
# Create column with file name
x$file = i
# Return data
x
})
dat = do.call("rbind.data.frame", dat)
But this only takes all the first sheet of every file
但这只占用了每个文件的第一张
Does anyone know how to get all the sheets and files together in one R data frame?
有谁知道如何在一个R数据框中将所有工作表和文件放在一起?
Also, what packages would you recommend for large sets of data? So far I tried readxl and XLConnect.
另外,您会推荐哪些包用于大型数据集?到目前为止,我尝试了readxl和XLConnect。
Thanks a million!
太感谢了!
2 个解决方案
#1
2
I would use a nested loop like this to go through each sheet of each file. It might not be the fastest solution but it is the simplest.
我会使用这样的嵌套循环来浏览每个文件的每个表格。它可能不是最快的解决方案,但它是最简单的。
require(xlsx)
file.list <- list.files(recursive=T,pattern='*.xlsx') #get files list from folder
for (i in 1:length(files.list)){
wb <- loadWorkbook(files.list[i]) #select a file & load workbook
sheet <- getSheets(wb) #get sheet list
for (j in 1:length(sheet)){
tmp<-read.xlsx(files.list[i], sheetIndex=j, colIndex= c(1:6,8:10,12:17,19),
sheetName=NULL, startRow=4, endRow=NULL,
as.data.frame=TRUE, header=F)
if (i==1&j==1) dataset<-tmp else dataset<-rbind(dataset,tmp) #happend to previous
}
}
You can clean NA
values after the loading phase.
您可以在加载阶段后清除NA值。
#2
3
openxlsx solution:
openxlsx解决方案:
filename <-"myFilePath"
sheets <- openxlsx::getSheetNames(filename)
SheetList <- lapply(sheets,openxlsx::read.xlsx,xlsxFile=filename)
names(SheetList) <- sheets
#1
2
I would use a nested loop like this to go through each sheet of each file. It might not be the fastest solution but it is the simplest.
我会使用这样的嵌套循环来浏览每个文件的每个表格。它可能不是最快的解决方案,但它是最简单的。
require(xlsx)
file.list <- list.files(recursive=T,pattern='*.xlsx') #get files list from folder
for (i in 1:length(files.list)){
wb <- loadWorkbook(files.list[i]) #select a file & load workbook
sheet <- getSheets(wb) #get sheet list
for (j in 1:length(sheet)){
tmp<-read.xlsx(files.list[i], sheetIndex=j, colIndex= c(1:6,8:10,12:17,19),
sheetName=NULL, startRow=4, endRow=NULL,
as.data.frame=TRUE, header=F)
if (i==1&j==1) dataset<-tmp else dataset<-rbind(dataset,tmp) #happend to previous
}
}
You can clean NA
values after the loading phase.
您可以在加载阶段后清除NA值。
#2
3
openxlsx solution:
openxlsx解决方案:
filename <-"myFilePath"
sheets <- openxlsx::getSheetNames(filename)
SheetList <- lapply(sheets,openxlsx::read.xlsx,xlsxFile=filename)
names(SheetList) <- sheets