使用列表中的数据aframes中的加权平均值创建新的数据aframe

时间:2022-07-13 22:56:53

I have many dataframes stored in a list, and I want to create weighted averages from these and store the results in a new dataframe. For example, with the list:

我有许多数据aframes存储在一个列表中,我希望从中创建加权平均,并将结果存储在一个新的dataframe中。例如,列表:

dfs <- structure(list(df1 = structure(list(A = 4:5, B = c(8L, 4L), Weight = c(TRUE, TRUE), Site = c("X", "X")), 
                                      .Names = c("A", "B", "Weight", "Site"), row.names = c(NA, -2L), class = "data.frame"), 
                      df2 = structure(list(A = c(6L, 8L), B = c(9L, 4L), Weight = c(FALSE, TRUE), Site = c("Y", "Y")), 
                                      .Names = c("A", "B", "Weight", "Site"), row.names = c(NA, -2L), class = "data.frame")), 
                 .Names = c("df1", "df2"))

In this example, I want to use columns A, B, and Weight for the weighted averages. I also want to move over related data such as Site, and want to sum the number of TRUE and FALSE. My desired result would look something like:

在这个例子中,我想用A, B和权重作为加权平均数。我还想转移有关的数据,如站点,并想要对真和假的数量进行总和。我想要的结果应该是:

result <- structure(list(Site = structure(1:2, .Label = c("X", "Y"), class = "factor"), 
    A.Weight = c(4.5, 8), B.Weight = c(6L, 4L), Sum.Weight = c(2L, 
    1L)), .Names = c("Site", "A.Weight", "B.Weight", "Sum.Weight"
), class = "data.frame", row.names = c(NA, -2L))


    Site    A.Weight    B.Weight    Sum.Weight
1   X       4.5         6           2
2   Y       8.0         4           1

The above is just a very simple example, but my real data have many dataframes in the list, and many more columns than just A and B for which I want to calculate weighted averages. I also have several columns similar to Site that are constant in each dataframe and that I want to move to the result.

上面只是一个非常简单的例子,但是我的实际数据在列表中有很多dataframes,并且比我想要计算加权平均的a和B要多很多列。我还有几个类似于Site的列,它们在每个dataframe中都是常量,我想将它们移动到结果中。

I'm able to manually calculate weighted averages using something like

我可以用类似的方法手工计算加权平均数

weighted.mean(dfs$df1$A, dfs$df1$Weight)
weighted.mean(dfs$df1$B, dfs$df1$Weight)
weighted.mean(dfs$df2$A, dfs$df2$Weight)
weighted.mean(dfs$df2$B, dfs$df2$Weight)

but I'm not sure how I can do this in a shorter, less "manual" way. Does anyone have any recommendations? I've recently learned how to lapply across dataframes in a list, but my attempts have not been so great so far.

但我不知道如何用一种更短、更少“手工”的方式来实现这一点。有人有什么建议吗?我最近学会了如何在列表中跨dataframes应用程序,但是到目前为止,我的尝试还不够多。

2 个解决方案

#1


2  

The trick is to create a function that works for a single data.frame, then use lapply to iterate across your list. Since lapply returns a list, we'll then use do.call to rbind the resulting objects together:

诀窍是为单个data.frame创建一个函数,然后使用lapply遍历列表。因为lapply返回一个列表,所以我们将使用do。调用将结果对象绑定到一起:

foo <- function(data, meanCols = LETTERS[1:2], weightCol = "Weight", otherCols = "Site") {
  means <- t(sapply(data[, meanCols], weighted.mean, w = data[, weightCol]))
  sumWeight <- sum(data[, weightCol])
  others <- data[1, otherCols, drop = FALSE] #You said all the other data was constant, so we can just grab first row
  out <- data.frame(others, means, sumWeight)
  return(out)
}

In action:

在行动:

do.call(rbind, lapply(dfs, foo))
---
    Site   A B sumWeight
df1    X 4.5 6         2
df2    Y 8.0 4         1

Since you said this was a minimal example, here's one approach to expanding this to other columns. We'll use grepl() and use regular expressions to identify the right columns. Alternatively, you could write them all out in a vector. Something like this:

既然您说过这是一个最小的示例,下面是一种将其扩展到其他列的方法。我们将使用grepl()并使用正则表达式来标识正确的列。或者,你可以把它们都写成一个向量。是这样的:

do.call(rbind, lapply(dfs, foo, 
                      meanCols = grepl("A|B", names(dfs[[1]])),
                      otherCols = grepl("Site", names(dfs[[1]]))
                      ))

#2


2  

using dplyr

使用dplyr

 library(dplyr)
 library('devtools')
 install_github('hadley/tidyr')
 library(tidyr)

 unnest(dfs) %>%
           group_by(Site) %>% 
           filter(Weight) %>% 
           mutate(Sum=n()) %>%
           select(-Weight) %>% 
           summarise_each(funs(mean=mean(., na.rm=TRUE)))

gives the result

给出了结果

 #  Site   A B Sum
 #1    X 4.5 6   2
 #2    Y 8.0 4   1

Or using data.table

或者使用data.table

 library(data.table)
 DT <- rbindlist(dfs)
 DT[(Weight)][, c(lapply(.SD, mean, na.rm = TRUE), 
                Sum=.N), by = Site, .SDcols = c("A", "B")]
 #   Site   A B Sum
 #1:    X 4.5 6   2
 #2:    Y 8.0 4   1

Update

In response to @jazzuro's comment, Using dplyr 0.3, I am getting

回复@jazzuro的评论,使用dplyr 0.3,我收到了

   unnest(dfs) %>% 
             group_by(Site) %>% 
             summarise_each(funs(weighted.mean=stats::weighted.mean(., Weight),
                    Sum.Weight=sum(Weight)), -starts_with("Weight")) %>%
             select(Site:B_weighted.mean, Sum.Weight=A_Sum.Weight) 

  #    Site A_weighted.mean B_weighted.mean Sum.Weight
  #1    X             4.5               6          2
  #2    Y             8.0               4          1

#1


2  

The trick is to create a function that works for a single data.frame, then use lapply to iterate across your list. Since lapply returns a list, we'll then use do.call to rbind the resulting objects together:

诀窍是为单个data.frame创建一个函数,然后使用lapply遍历列表。因为lapply返回一个列表,所以我们将使用do。调用将结果对象绑定到一起:

foo <- function(data, meanCols = LETTERS[1:2], weightCol = "Weight", otherCols = "Site") {
  means <- t(sapply(data[, meanCols], weighted.mean, w = data[, weightCol]))
  sumWeight <- sum(data[, weightCol])
  others <- data[1, otherCols, drop = FALSE] #You said all the other data was constant, so we can just grab first row
  out <- data.frame(others, means, sumWeight)
  return(out)
}

In action:

在行动:

do.call(rbind, lapply(dfs, foo))
---
    Site   A B sumWeight
df1    X 4.5 6         2
df2    Y 8.0 4         1

Since you said this was a minimal example, here's one approach to expanding this to other columns. We'll use grepl() and use regular expressions to identify the right columns. Alternatively, you could write them all out in a vector. Something like this:

既然您说过这是一个最小的示例,下面是一种将其扩展到其他列的方法。我们将使用grepl()并使用正则表达式来标识正确的列。或者,你可以把它们都写成一个向量。是这样的:

do.call(rbind, lapply(dfs, foo, 
                      meanCols = grepl("A|B", names(dfs[[1]])),
                      otherCols = grepl("Site", names(dfs[[1]]))
                      ))

#2


2  

using dplyr

使用dplyr

 library(dplyr)
 library('devtools')
 install_github('hadley/tidyr')
 library(tidyr)

 unnest(dfs) %>%
           group_by(Site) %>% 
           filter(Weight) %>% 
           mutate(Sum=n()) %>%
           select(-Weight) %>% 
           summarise_each(funs(mean=mean(., na.rm=TRUE)))

gives the result

给出了结果

 #  Site   A B Sum
 #1    X 4.5 6   2
 #2    Y 8.0 4   1

Or using data.table

或者使用data.table

 library(data.table)
 DT <- rbindlist(dfs)
 DT[(Weight)][, c(lapply(.SD, mean, na.rm = TRUE), 
                Sum=.N), by = Site, .SDcols = c("A", "B")]
 #   Site   A B Sum
 #1:    X 4.5 6   2
 #2:    Y 8.0 4   1

Update

In response to @jazzuro's comment, Using dplyr 0.3, I am getting

回复@jazzuro的评论,使用dplyr 0.3,我收到了

   unnest(dfs) %>% 
             group_by(Site) %>% 
             summarise_each(funs(weighted.mean=stats::weighted.mean(., Weight),
                    Sum.Weight=sum(Weight)), -starts_with("Weight")) %>%
             select(Site:B_weighted.mean, Sum.Weight=A_Sum.Weight) 

  #    Site A_weighted.mean B_weighted.mean Sum.Weight
  #1    X             4.5               6          2
  #2    Y             8.0               4          1