Here's what I'm trying to do. My data frame has a factor variable, "country", and I want to split the data frame based on country. Then, I want to take the column mean over every variable for every country's data frame.
这就是我想要做的。我的数据框有一个因子变量“country”,我想根据国家/地区拆分数据框。然后,我想对每个国家的数据框采用每个变量的列均值。
Data here: https://github.com/pourque/country-data
数据:https://github.com/pourque/country-data
I've done this so far...
到目前为止我做到了这一点......
myList <- split(df1, df1$country)
for(i in 1:length(myList)) {
aggregate <- mapply(myList[[i]][,-c(38:39)], colMeans)
}
(I'm not including the 38th and 39th columns because those are factors.)
(我不包括第38和第39列,因为这些是因素。)
I've read this (function over more than one list) , which makes me think mapply is the answer here...but I'm getting this error:
我已经读过这个(功能超过一个列表),这让我觉得mapply就是这里的答案......但是我收到了这个错误:
Error in match.fun(FUN) :
'myList[[i]][, -c(38:39)]' is not a function, character or symbol
Maybe I'm formatting it incorrectly?
也许我格式化不正确?
4 个解决方案
#1
3
A data.table answer:
一个data.table答案:
library(data.table)
setDT(df1)[, lapply(.SD, mean), by = country, .SDcols = -c('age', 'gender')]
Now tidier syntax with deselection in .SDcols, thanks to user Arun
由于用户Arun,现在在.SDcols中取消选择更整洁的语法
To explain what's happening here:
要解释这里发生的事情:
-
setDT(df1)
make the data.frame a data.table -
lapply(.SD, mean)
for each column in the subset of data, take themean
-
by = county
do this by groups split according tocountry
-
.SDcols = -c('age', 'gender')
omitage
andgender
columns from the subset of data
setDT(df1)使data.frame成为data.table
lapply(.SD,mean)对于数据子集中的每一列,取均值
by = county按国家分组进行分组
.SDcols = -c('age','gender')省略数据子集中的年龄和性别列
#2
7
It's straightforward in base R using aggregate
without the need to split
the data.frame into a list beforehand. Here's an example using the built-in iris data where you compute the mean
of all variables except those in the first and second column by group of Species
:
使用聚合在基础R中直接使用,而无需事先将data.frame拆分为列表。这是一个使用内置虹膜数据的示例,您可以通过一组物种计算除第一和第二列之外的所有变量的平均值:
data(iris)
aggregate(. ~ Species, iris[-(1:2)], mean)
# Species Petal.Length Petal.Width
#1 setosa 1.462 0.246
#2 versicolor 4.260 1.326
#3 virginica 5.552 2.026
The .
inside aggregate
is used to specify that you want to use all remaining columns of the data.frame except the grouping variable (Species in this case). And because you specify iris[-(1:2)]
as input data, the first and second columns are not used either.
这个。 inside aggregate用于指定您要使用除分组变量(在本例中为Species)之外的data.frame的所有剩余列。并且因为您指定iris [ - (1:2)]作为输入数据,所以也不使用第一列和第二列。
For your data, it should then be something like:
对于您的数据,它应该是这样的:
aggregate(. ~ country, df1[-c(38:39)], mean)
#3
6
library(dplyr)
df1 %>%
group_by(country) %>%
select(-age, -gender) %>%
summarise_each(funs(mean))
#4
3
If you insist on keeping all in list:
如果你坚持要列出所有内容:
#split and make list of df
myList <- split(df, df$country)
#aggregate without age and gender
my_aggregate <- function(df_inlist) {
df_inlist <- aggregate(.~country, df_inlist[ , -c(38, 39)], mean)
}
#Apply aggregate function on all data frames in the list
out <- lapply(myList, function (x) {
my_aggregate(x)
})
out
is a list
of data.frames for each country and colmeans over variables. How put it all together in a data.frame :
out是每个国家/地区的data.frames列表和变量colmeans。如何将它们放在data.frame中:
composite_df <- do.call(rbind, out)
#1
3
A data.table answer:
一个data.table答案:
library(data.table)
setDT(df1)[, lapply(.SD, mean), by = country, .SDcols = -c('age', 'gender')]
Now tidier syntax with deselection in .SDcols, thanks to user Arun
由于用户Arun,现在在.SDcols中取消选择更整洁的语法
To explain what's happening here:
要解释这里发生的事情:
-
setDT(df1)
make the data.frame a data.table -
lapply(.SD, mean)
for each column in the subset of data, take themean
-
by = county
do this by groups split according tocountry
-
.SDcols = -c('age', 'gender')
omitage
andgender
columns from the subset of data
setDT(df1)使data.frame成为data.table
lapply(.SD,mean)对于数据子集中的每一列,取均值
by = county按国家分组进行分组
.SDcols = -c('age','gender')省略数据子集中的年龄和性别列
#2
7
It's straightforward in base R using aggregate
without the need to split
the data.frame into a list beforehand. Here's an example using the built-in iris data where you compute the mean
of all variables except those in the first and second column by group of Species
:
使用聚合在基础R中直接使用,而无需事先将data.frame拆分为列表。这是一个使用内置虹膜数据的示例,您可以通过一组物种计算除第一和第二列之外的所有变量的平均值:
data(iris)
aggregate(. ~ Species, iris[-(1:2)], mean)
# Species Petal.Length Petal.Width
#1 setosa 1.462 0.246
#2 versicolor 4.260 1.326
#3 virginica 5.552 2.026
The .
inside aggregate
is used to specify that you want to use all remaining columns of the data.frame except the grouping variable (Species in this case). And because you specify iris[-(1:2)]
as input data, the first and second columns are not used either.
这个。 inside aggregate用于指定您要使用除分组变量(在本例中为Species)之外的data.frame的所有剩余列。并且因为您指定iris [ - (1:2)]作为输入数据,所以也不使用第一列和第二列。
For your data, it should then be something like:
对于您的数据,它应该是这样的:
aggregate(. ~ country, df1[-c(38:39)], mean)
#3
6
library(dplyr)
df1 %>%
group_by(country) %>%
select(-age, -gender) %>%
summarise_each(funs(mean))
#4
3
If you insist on keeping all in list:
如果你坚持要列出所有内容:
#split and make list of df
myList <- split(df, df$country)
#aggregate without age and gender
my_aggregate <- function(df_inlist) {
df_inlist <- aggregate(.~country, df_inlist[ , -c(38, 39)], mean)
}
#Apply aggregate function on all data frames in the list
out <- lapply(myList, function (x) {
my_aggregate(x)
})
out
is a list
of data.frames for each country and colmeans over variables. How put it all together in a data.frame :
out是每个国家/地区的data.frames列表和变量colmeans。如何将它们放在data.frame中:
composite_df <- do.call(rbind, out)