如何从子目录导入文件并使用子目录名称R命名它们

时间:2021-07-26 00:32:00

I'd like to import files (of different lengths) recursively from sub-directories and put them into one data.frame, having one column with the subdirectory name and one column with the file name (minus the extension):

我想从子目录递归导入文件(不同长度)并将它们放入一个data.frame中,其中一列包含子目录名称,一列包含文件名(减去扩展名):

e.g. folder structure
IsolatedData
  00
    tap-4.out
    cl_pressure.out
  15
    tap-4.out
    cl_pressure.out

So far I have:

到目前为止我有:

setwd("~/Documents/IsolatedData")
l <- list.files(pattern = ".out$",recursive = TRUE)
p <- bind_rows(lapply(1:length(l), function(i) {chars <- strsplit(l[i], "/");
cbind(data.frame(Pressure = read.table(l[i],header = FALSE,skip=2, nrow =length(readLines(l[i])))),
      Angle = chars[[1]][1], Location = chars[[1]][1])}), .id = "id")

But I get an error saying line 43 doesn't have 2 elements.

但我得到一个错误,说第43行没有2个元素。

Also seen this one using dplyr which looks neat but I can't get it to work: http://www.machinegurning.com/rstats/map_df/

也看到这个使用dplyr看起来整洁但我无法让它工作:http://www.machinegurning.com/rstats/map_df/

tbl <-
  list.files(recursive=T,pattern=".out$")%>% 
  map_df(~data_frame(x=.x),.id="id")

3 个解决方案

#1


6  

Here's a workflow with the map functions from purrr within the tidyverse.

这是一个工作流程,其中包含了tidyverse中purrr的地图函数。

I generated a bunch of csv files to work with to mimic your file structure and some simple data. I threw in 2 lines of junk data at the beginning of each file, since you said you were trying to skip the top 2 lines.

我生成了一堆csv文件来模仿你的文件结构和一些简单的数据。我在每个文件的开头扔了两行垃圾数据,因为你说你试图跳过前两行。

library(tidyverse)

setwd("~/_R/SO/nested")

walk(paste0("folder", 1:3), dir.create)

list.files() %>%
    walk(function(folderpath) {
        map(1:4, function(i) {
            df <- tibble(
                x1 = sample(letters[1:3], 10, replace = T),
                x2 = rnorm(10)
            )
            dummy <- tibble(
                x1 = c("junk line 1", "junk line 2"),
                x2 = c(0)
            )
            bind_rows(dummy, df) %>%
                write_csv(sprintf("%s/file%s.out", folderpath, i))
        })
    })

That gets the following file structure:

这将获得以下文件结构:

├── folder1
|  ├── file1.out
|  ├── file2.out
|  ├── file3.out
|  └── file4.out
├── folder2
|  ├── file1.out
|  ├── file2.out
|  ├── file3.out
|  └── file4.out
└── folder3
   ├── file1.out
   ├── file2.out
   ├── file3.out
   └── file4.out

Then I used list.files(recursive = T) to get a list of the paths to these files, use str_extract to pull text for the folder and file name for each, read the csv file skipping the dummy text, and add the folder and file names so they'll be added to the dataframe.

然后我使用list.files(recursive = T)获取这些文件的路径列表,使用str_extract为文件夹和文件名提取文本,读取跳过虚拟文本的csv文件,然后添加文件夹和文件名,因此它们将被添加到数据框中。

Since I did this with map_dfr, I get a tibble back, where the dataframes from each iteration are all rbinded together.

因为我用map_dfr做了这个,所以我得到了一个tibble,每次迭代的数据帧都被绑定在一起。

all_data <- list.files(recursive = T) %>%
    map_dfr(function(path) {
        # any characters from beginning of path until /
        foldername <- str_extract(path, "^.+(?=/)")
        # any characters between / and .out at end
        filename <- str_extract(path, "(?<=/).+(?=\\.out$)")

        # skip = 3 to skip over names and first 2 lines
        # could instead use col_names = c("x1", "x2")
        read_csv(path, skip = 3, col_names = F) %>%
            mutate(folder = foldername, file = filename)
    })

head(all_data)
#> # A tibble: 6 x 4
#>   X1        X2 folder  file 
#>   <chr>  <dbl> <chr>   <chr>
#> 1 b      0.858 folder1 file1
#> 2 b      0.544 folder1 file1
#> 3 a     -0.180 folder1 file1
#> 4 b      1.14  folder1 file1
#> 5 b      0.725 folder1 file1
#> 6 c      1.05  folder1 file1

Created on 2018-04-21 by the reprex package (v0.2.0).

由reprex包(v0.2.0)创建于2018-04-21。

#2


1  

Can you try:

你能试一下吗:

library(tidyverse)    

tbl <-
  list.files(recursive = T, pattern = ".out$") %>% 
  map_dfr(read_table, skip = 2, .id = "filepath")

#3


0  

I am guessing from your program that your ".out" files consist of a single column of data? If so, you can use scan instead of read.table. I am also guessing that your want the folder name in a column called Angle, the file name (minus extension) in a column called Location, and the data in a column called Pressure. If that is correct, the following should work:

我猜你的程序中你的“.out”文件是由一列数据组成的?如果是这样,您可以使用scan而不是read.table。我也猜测你想要一个名为Angle的列中的文件夹名称,名为Location的列中的文件名(减去扩展名),以及名为Pressure的列中的数据。如果这是正确的,以下应该有效:

setwd("~/Documents/IsolatedData")
l <- list.files(pattern = "\\.out$", recursive = TRUE)
p <- data.frame()
for (i in seq_along(l)){
  pt <- data.frame(Angle = strsplit(l[i], "/")[[1]][1],
                   Location = sub("\\.out", "", l[i]),
                   Pressure = scan(l[i], skip=2))
  p <- rbind(p, pt)
}

I know this is unfashionable to give an answer that just uses base R, particularly one involving a loop. However, for things like iterating through files in a directory, IMHO it is a perfectly reasonable thing to do, not least for readability and ease of debugging. Of course, as you expect you know, growing an object with rbind in a loop (or apply for that matter) is not a great idea if you are dealing with big data, but I suspect that is not the case here.

我知道给出一个只使用基数R的答案是不合时宜的,特别是涉及循环的答案。但是,对于像浏览目录中的文件这样的事情,恕我直言,这是一个非常合理的事情,尤其是可读性和易于调试。当然,正如您所期望的那样,如果您正在处理大数据,那么在循环中使用rbind生成一个对象(或申请该问题)并不是一个好主意,但我怀疑在这里并非如此。

#1


6  

Here's a workflow with the map functions from purrr within the tidyverse.

这是一个工作流程,其中包含了tidyverse中purrr的地图函数。

I generated a bunch of csv files to work with to mimic your file structure and some simple data. I threw in 2 lines of junk data at the beginning of each file, since you said you were trying to skip the top 2 lines.

我生成了一堆csv文件来模仿你的文件结构和一些简单的数据。我在每个文件的开头扔了两行垃圾数据,因为你说你试图跳过前两行。

library(tidyverse)

setwd("~/_R/SO/nested")

walk(paste0("folder", 1:3), dir.create)

list.files() %>%
    walk(function(folderpath) {
        map(1:4, function(i) {
            df <- tibble(
                x1 = sample(letters[1:3], 10, replace = T),
                x2 = rnorm(10)
            )
            dummy <- tibble(
                x1 = c("junk line 1", "junk line 2"),
                x2 = c(0)
            )
            bind_rows(dummy, df) %>%
                write_csv(sprintf("%s/file%s.out", folderpath, i))
        })
    })

That gets the following file structure:

这将获得以下文件结构:

├── folder1
|  ├── file1.out
|  ├── file2.out
|  ├── file3.out
|  └── file4.out
├── folder2
|  ├── file1.out
|  ├── file2.out
|  ├── file3.out
|  └── file4.out
└── folder3
   ├── file1.out
   ├── file2.out
   ├── file3.out
   └── file4.out

Then I used list.files(recursive = T) to get a list of the paths to these files, use str_extract to pull text for the folder and file name for each, read the csv file skipping the dummy text, and add the folder and file names so they'll be added to the dataframe.

然后我使用list.files(recursive = T)获取这些文件的路径列表,使用str_extract为文件夹和文件名提取文本,读取跳过虚拟文本的csv文件,然后添加文件夹和文件名,因此它们将被添加到数据框中。

Since I did this with map_dfr, I get a tibble back, where the dataframes from each iteration are all rbinded together.

因为我用map_dfr做了这个,所以我得到了一个tibble,每次迭代的数据帧都被绑定在一起。

all_data <- list.files(recursive = T) %>%
    map_dfr(function(path) {
        # any characters from beginning of path until /
        foldername <- str_extract(path, "^.+(?=/)")
        # any characters between / and .out at end
        filename <- str_extract(path, "(?<=/).+(?=\\.out$)")

        # skip = 3 to skip over names and first 2 lines
        # could instead use col_names = c("x1", "x2")
        read_csv(path, skip = 3, col_names = F) %>%
            mutate(folder = foldername, file = filename)
    })

head(all_data)
#> # A tibble: 6 x 4
#>   X1        X2 folder  file 
#>   <chr>  <dbl> <chr>   <chr>
#> 1 b      0.858 folder1 file1
#> 2 b      0.544 folder1 file1
#> 3 a     -0.180 folder1 file1
#> 4 b      1.14  folder1 file1
#> 5 b      0.725 folder1 file1
#> 6 c      1.05  folder1 file1

Created on 2018-04-21 by the reprex package (v0.2.0).

由reprex包(v0.2.0)创建于2018-04-21。

#2


1  

Can you try:

你能试一下吗:

library(tidyverse)    

tbl <-
  list.files(recursive = T, pattern = ".out$") %>% 
  map_dfr(read_table, skip = 2, .id = "filepath")

#3


0  

I am guessing from your program that your ".out" files consist of a single column of data? If so, you can use scan instead of read.table. I am also guessing that your want the folder name in a column called Angle, the file name (minus extension) in a column called Location, and the data in a column called Pressure. If that is correct, the following should work:

我猜你的程序中你的“.out”文件是由一列数据组成的?如果是这样,您可以使用scan而不是read.table。我也猜测你想要一个名为Angle的列中的文件夹名称,名为Location的列中的文件名(减去扩展名),以及名为Pressure的列中的数据。如果这是正确的,以下应该有效:

setwd("~/Documents/IsolatedData")
l <- list.files(pattern = "\\.out$", recursive = TRUE)
p <- data.frame()
for (i in seq_along(l)){
  pt <- data.frame(Angle = strsplit(l[i], "/")[[1]][1],
                   Location = sub("\\.out", "", l[i]),
                   Pressure = scan(l[i], skip=2))
  p <- rbind(p, pt)
}

I know this is unfashionable to give an answer that just uses base R, particularly one involving a loop. However, for things like iterating through files in a directory, IMHO it is a perfectly reasonable thing to do, not least for readability and ease of debugging. Of course, as you expect you know, growing an object with rbind in a loop (or apply for that matter) is not a great idea if you are dealing with big data, but I suspect that is not the case here.

我知道给出一个只使用基数R的答案是不合时宜的,特别是涉及循环的答案。但是,对于像浏览目录中的文件这样的事情,恕我直言,这是一个非常合理的事情,尤其是可读性和易于调试。当然,正如您所期望的那样,如果您正在处理大数据,那么在循环中使用rbind生成一个对象(或申请该问题)并不是一个好主意,但我怀疑在这里并非如此。