在R中的数据帧列表中取列均值

Here's what I'm trying to do. My data frame has a factor variable, "country", and I want to split the data frame based on country. Then, I want to take the column mean over every variable for every country's data frame.

这就是我想要做的。我的数据框有一个因子变量“country”,我想根据国家/地区拆分数据框。然后,我想对每个国家的数据框采用每个变量的列均值。

Data here: https://github.com/pourque/country-data

数据:https://github.com/pourque/country-data

I've done this so far...

到目前为止我做到了这一点......

myList <- split(df1, df1$country)
for(i in 1:length(myList)) {
aggregate <- mapply(myList[[i]][,-c(38:39)], colMeans)
}

(I'm not including the 38th and 39th columns because those are factors.)

(我不包括第38和第39列,因为这些是因素。)

I've read this (function over more than one list) , which makes me think mapply is the answer here...but I'm getting this error:

我已经读过这个(功能超过一个列表),这让我觉得mapply就是这里的答案......但是我收到了这个错误:

Error in match.fun(FUN) : 
'myList[[i]][, -c(38:39)]' is not a function, character or symbol

Maybe I'm formatting it incorrectly?

也许我格式化不正确?

4 个解决方案

#1

A data.table answer:

一个data.table答案:

library(data.table)

setDT(df1)[, lapply(.SD, mean), by = country, .SDcols = -c('age', 'gender')]

Now tidier syntax with deselection in .SDcols, thanks to user Arun

由于用户Arun,现在在.SDcols中取消选择更整洁的语法

To explain what's happening here:

要解释这里发生的事情:

setDT(df1) make the data.frame a data.table

setDT(df1)使data.frame成为data.table

lapply(.SD, mean) for each column in the subset of data, take the mean

lapply(.SD,mean)对于数据子集中的每一列,取均值

by = county do this by groups split according to country

by = county按国家分组进行分组

.SDcols = -c('age', 'gender') omit age and gender columns from the subset of data

.SDcols = -c('age','gender')省略数据子集中的年龄和性别列

#2

It's straightforward in base R using aggregate without the need to split the data.frame into a list beforehand. Here's an example using the built-in iris data where you compute the mean of all variables except those in the first and second column by group of Species:

使用聚合在基础R中直接使用,而无需事先将data.frame拆分为列表。这是一个使用内置虹膜数据的示例,您可以通过一组物种计算除第一和第二列之外的所有变量的平均值:

data(iris)
aggregate(. ~ Species, iris[-(1:2)], mean)
#     Species Petal.Length Petal.Width
#1     setosa        1.462       0.246
#2 versicolor        4.260       1.326
#3  virginica        5.552       2.026

The . inside aggregate is used to specify that you want to use all remaining columns of the data.frame except the grouping variable (Species in this case). And because you specify iris[-(1:2)] as input data, the first and second columns are not used either.

这个。 inside aggregate用于指定您要使用除分组变量(在本例中为Species)之外的data.frame的所有剩余列。并且因为您指定iris [ - (1:2)]作为输入数据,所以也不使用第一列和第二列。

For your data, it should then be something like:

对于您的数据,它应该是这样的:

aggregate(. ~ country, df1[-c(38:39)], mean)

#3

library(dplyr)

df1 %>%
    group_by(country) %>%
    select(-age, -gender) %>%
    summarise_each(funs(mean))

#4

If you insist on keeping all in list:

如果你坚持要列出所有内容:

#split and make list of df
myList <- split(df, df$country)

#aggregate without age and gender
my_aggregate <- function(df_inlist) {
  df_inlist <- aggregate(.~country, df_inlist[ , -c(38, 39)], mean)
}

#Apply aggregate function on all data frames in the list
out <- lapply(myList, function (x) {
  my_aggregate(x)
})

out is a list of data.frames for each country and colmeans over variables. How put it all together in a data.frame :

out是每个国家/地区的data.frames列表和变量colmeans。如何将它们放在data.frame中:

composite_df <- do.call(rbind, out)

#1