合并数据帧列表并对不同列求和

时间:2021-11-14 22:57:45

I have a list of dataframes

我有一个数据帧列表

> df.t
[[1]]
column_A start_B stop_C column_D column_E
1        0        23       2        3 
1        23       200      1        0
2        0        55       0        1
[[2]]
column_A start_B stop_C column_D column_E
1        0        200      1        0
2        0        20       2        0
2        20       55       0        1
[[3]]
column_A start_B stop_C column_D column_E
1        0        200      0        0
2        0        55       4        2

I want to split based on column_A, and merge by the column_A, start_B and stop_C and sum column_D and column_E. For each unique column_A element, I want to find all the possible unique range combinations using only a unique start_B and stop_C element. For example where column_A = 1, the unique lowest to highest range between start_B and stop_C is 0 - 200, however, df.t[[1]] has broken this range from 0 - 23 then 23-200. leading to this expected output after merging

我想基于column_A进行拆分,并通过column_A,start_B和stop_C以及sum column_D和column_E进行合并。对于每个唯一的column_A元素,我想仅使用唯一的start_B和stop_C元素来查找所有可能的唯一范围组合。例如,当column_A = 1时,start_B和stop_C之间的唯一最低到最高范围是0到200,但是,df.t [[1]]已经将此范围从0到23然后从23到200。合并后导致预期的产出

expected output:

预期产量:

> df.merge

column_A start_B stop_C column_D column_E
1        0        23       3        3
1        23       200      2        0   
2        0        20       6        3
2        20       55       4        4

I know in order to split and merge just two of the dataframes I can do:

我知道为了分割和合并我可以做的两个数据帧:

lst1 <- split(df.t[[1]], df.t[[1]]$column_A)
lst2 <- split(df.t[[2]], df.t[[2]]$column_A

require(survival)

df <- do.call(rbind, mapply(FUN = function(x, y) {

    x$event <- y$event <- 0
    lst1.spl <- survSplit(x, cut=y$stop_C, start='column_A', end='start_B', event='event')
    lst2.spl <- survSplit(y, cut=x$stop_C, start='column_A', end='start_B', event='event')
    mrg <- merge(lst1.spl, lst2.spl, 
            by=c('column_A', 'start_B', 'stop_C'))
    mrg[c('column_A', 'start_B', 'stop_C', 'column_D', 'column_E')]
    },
lst1, lst2, SIMPLIFY=FALSE))

However to complete this on a list of dataframes and sum column_D and column_E I think I should use reduce(), however I am not sure if it is either possible or the best method!

但是要在数据帧列表和sum column_D和column_E上完成此操作,我想我应该使用reduce(),但是我不确定它是可能的还是最好的方法!

df.merge = Reduce(function(...) merge(..., by=c('column_A', 'start_B', 'stop_C')), df.t)

This only merges by the first three columns and does not sum column_D and column_E. I know I should use ddply, however I am not sure how to correctly use it with reduce.

这只会合并前三列,而不会合并column_D和column_E。我知道我应该使用ddply,但我不确定如何正确使用它与reduce。

Thank you!

谢谢!

1 个解决方案

#1


1  

Your expected output doesn't seem consistent with your description of what you're trying to do. Here's some code that will give you the some of column_D and the sum of column_E for each combination of the column_A, column_B, and column_C in the three data frames. If I've misunderstood your goal, please let me know and I'll update.

您的预期输出似乎与您尝试执行的操作的描述不一致。这里有一些代码可以为三个数据帧中的column_A,column_B和column_C的每个组合提供一些column_D和column_E的总和。如果我误解了你的目标,请告诉我,我会更新。

Method: Combine the three data frames into a single data frame, then split the data by the three grouping columns and sum over each group.

方法:将三个数据帧组合成一个数据帧,然后通过三个分组列分割数据并对每个组求和。

library(dplyr) 

# Combine all three data frames into a single data frame
alldfs = bind_rows(df.t)

# Sum column_D and column_E for each combination of the grouping columns
alldfs %>% group_by(column_A, column_B, column_C) %>%
  summarise(sum_column_D = sum(column_D),
            sum_column_E = sum(column_E))

  column_A column_B column_C sum_column_D sum_column_E
1        1        0       23            2            3
2        1        0      200            1            0
3        1       90      200            1            0
4        2        0       20            2            0
5        2        0       55            4            3
6        2       20       55            0            1

#1


1  

Your expected output doesn't seem consistent with your description of what you're trying to do. Here's some code that will give you the some of column_D and the sum of column_E for each combination of the column_A, column_B, and column_C in the three data frames. If I've misunderstood your goal, please let me know and I'll update.

您的预期输出似乎与您尝试执行的操作的描述不一致。这里有一些代码可以为三个数据帧中的column_A,column_B和column_C的每个组合提供一些column_D和column_E的总和。如果我误解了你的目标,请告诉我,我会更新。

Method: Combine the three data frames into a single data frame, then split the data by the three grouping columns and sum over each group.

方法:将三个数据帧组合成一个数据帧,然后通过三个分组列分割数据并对每个组求和。

library(dplyr) 

# Combine all three data frames into a single data frame
alldfs = bind_rows(df.t)

# Sum column_D and column_E for each combination of the grouping columns
alldfs %>% group_by(column_A, column_B, column_C) %>%
  summarise(sum_column_D = sum(column_D),
            sum_column_E = sum(column_E))

  column_A column_B column_C sum_column_D sum_column_E
1        1        0       23            2            3
2        1        0      200            1            0
3        1       90      200            1            0
4        2        0       20            2            0
5        2        0       55            4            3
6        2       20       55            0            1