R:根据导入的文件名分配变量名称

时间:2022-07-05 14:57:04

I have a list of filenames that were found by searching the working directory. I want to either make one data frame with multiple elements that can be selected from or multiple data frames. To select either parts of one data frame or pick from multiple data frames, I want to name them using a part of the associated filename.

我有一个通过搜索工作目录找到的文件名列表。我想要制作一个具有多个可以从中选择的元素的数据帧或多个数据帧。要选择一个数据框的任一部分或从多个数据框中选择,我想使用相关文件名的一部分来命名它们。

Currently, I set filenames using list.files and set up the data frame using lapply with read.csv

目前,我使用list.files设置文件名,并使用lapply和read.csv设置数据框

filenames = list.files(recursive=TRUE,pattern="*dat.csv",full.names=FALSE)
data = lapply(filenames,function(i){
  read.csv(i,stringsAsFactors=FALSE)
})

Can someone explain to me the best way to go about this data import and name assignment?

有人可以向我解释一下这个数据导入和名称分配的最佳方法吗?

1 个解决方案

#1


1  

A good way to store this would be as a single, combined data frame with a column describing the original file, let's say type:

存储它的一个好方法是作为一个单独的组合数据框架,其中一列描述了原始文件,让我们说类型:

data_frames = lapply(filenames,function(i){
    ret <- read.csv(i,stringsAsFactors=FALSE)
    ret$type <- gsub("dat.csv$", "", i)
    ret
})
data = do.call(rbind, data_frames)

Or shorter, with plyr:

或更短,与plyr:

library(plyr)
data = ldply(filenames, read.csv, stringsAsFactors = FALSE, .id = "type")
data$type <- gsub("dat.csv$", "", data$type)

That way you could extract whatever subset you wanted with:

这样你就可以提取你想要的任何子集:

# to get all lines from, say, the AAAdat.csv file
subset(data, type == "AAA")

You could store each dataset as an individual variable with a name like AAA, but you shouldn't, because it's a bad idea to use your variable names to store information.

您可以将每个数据集存储为具有AAA等名称的单个变量,但您不应该这样做,因为使用变量名来存储信息是个坏主意。

(Note that this assumes your datasets share most, or at least some, columns. If they have entirely different structures, this is not an appropriate approach).

(请注意,这假设您的数据集共享大多数或至少一些列。如果它们具有完全不同的结构,则这不是一种合适的方法)。

#1


1  

A good way to store this would be as a single, combined data frame with a column describing the original file, let's say type:

存储它的一个好方法是作为一个单独的组合数据框架,其中一列描述了原始文件,让我们说类型:

data_frames = lapply(filenames,function(i){
    ret <- read.csv(i,stringsAsFactors=FALSE)
    ret$type <- gsub("dat.csv$", "", i)
    ret
})
data = do.call(rbind, data_frames)

Or shorter, with plyr:

或更短,与plyr:

library(plyr)
data = ldply(filenames, read.csv, stringsAsFactors = FALSE, .id = "type")
data$type <- gsub("dat.csv$", "", data$type)

That way you could extract whatever subset you wanted with:

这样你就可以提取你想要的任何子集:

# to get all lines from, say, the AAAdat.csv file
subset(data, type == "AAA")

You could store each dataset as an individual variable with a name like AAA, but you shouldn't, because it's a bad idea to use your variable names to store information.

您可以将每个数据集存储为具有AAA等名称的单个变量,但您不应该这样做,因为使用变量名来存储信息是个坏主意。

(Note that this assumes your datasets share most, or at least some, columns. If they have entirely different structures, this is not an appropriate approach).

(请注意,这假设您的数据集共享大多数或至少一些列。如果它们具有完全不同的结构,则这不是一种合适的方法)。