快速/优雅的方法来构建均值/方差汇总表

时间:2021-11-25 20:21:51

I can achieve this task, but I feel like there must be a "best" (slickest, most compact, clearest-code, fastest?) way of doing it and have not figured it out so far ...

我可以完成这项任务,但我觉得必须有一个“最好的”(最简洁,最紧凑,最清晰,最快?)的方式,到目前为止还没有弄明白......

For a specified set of categorical factors I want to construct a table of means and variances by group.

对于一组指定的分类因素,我想按组构建均值和方差表。

generate data:

生成数据:

set.seed(1001)
d <- expand.grid(f1=LETTERS[1:3],f2=letters[1:3],
                 f3=factor(as.character(as.roman(1:3))),rep=1:4)
d$y <- runif(nrow(d))
d$z <- rnorm(nrow(d))

desired output:

期望的输出:

  f1 f2  f3    y.mean      y.var
1  A  a   I 0.6502307 0.09537958
2  A  a  II 0.4876630 0.11079670
3  A  a III 0.3102926 0.20280568
4  A  b   I 0.3914084 0.05869310
5  A  b  II 0.5257355 0.21863126
6  A  b III 0.3356860 0.07943314
... etc. ...

using aggregate/merge:

使用聚合/合并:

library(reshape)
m1 <- aggregate(y~f1*f2*f3,data=d,FUN=mean)
m2 <- aggregate(y~f1*f2*f3,data=d,FUN=var)
mvtab <- merge(rename(m1,c(y="y.mean")),
      rename(m2,c(y="y.var")))

using ddply/summarise (possibly best but haven't been able to make it work):

使用ddply / summarize(可能最好但无法使其工作):

mvtab2 <- ddply(subset(d,select=-c(z,rep)),
                .(f1,f2,f3),
                summarise,numcolwise(mean),numcolwise(var))

results in

结果是

Error in output[[var]][rng] <- df[[var]] : 
  incompatible types (from closure to logical) in subassignment type fix

using melt/cast (maybe best?)

使用熔化/铸造(也许最好?)

mvtab3 <- cast(melt(subset(d,select=-c(z,rep)),
          id.vars=1:3),
     ...~.,fun.aggregate=c(mean,var))
## now have to drop "variable"
mvtab3 <- subset(mvtab3,select=-variable)
## also should rename response variables

Won't (?) work in reshape2. Explaining ...~. to someone could be tricky!

不会(?)在reshape2中工作。解释......〜。对某人来说可能很棘手!

8 个解决方案

#1


11  

I'm a bit puzzled. Does this not work:

我有点不解。这不起作用:

mvtab2 <- ddply(d,.(f1,f2,f3),
            summarise,y.mean = mean(y),y.var = var(y))

This give me something like this:

这给我这样的东西:

   f1 f2  f3    y.mean       y.var
1   A  a   I 0.6502307 0.095379578
2   A  a  II 0.4876630 0.110796695
3   A  a III 0.3102926 0.202805677
4   A  b   I 0.3914084 0.058693103
5   A  b  II 0.5257355 0.218631264

Which is in the right form, but it looks like the values are different that what you specified.

哪个是正确的形式,但看起来值与您指定的值不同。

Edit

编辑

Here's how to make your version with numcolwise work:

以下是如何使用numcolwise工作制作您的版本:

mvtab2 <- ddply(subset(d,select=-c(z,rep)),.(f1,f2,f3),summarise,
                y.mean = numcolwise(mean)(piece),
                y.var = numcolwise(var)(piece)) 

You forgot to pass the actual data to numcolwise. And then there's the little ddply trick that each piece is called piece internally. (Which Hadley points out in the comments shouldn't be relied upon as it may change in future versions of plyr.)

您忘记将实际数据传递给numcolwise。然后有一个小的ddply技巧,每个部分在内部称为片。 (不应该依赖Hadley在评论中指出的内容,因为它可能会在未来版本的plyr中发生变化。)

#2


17  

Here is a solution using data.table

这是一个使用data.table的解决方案

library(data.table)
d2 = data.table(d)
ans = d2[,list(avg_y = mean(y), var_y = var(y)), 'f1, f2, f3']

#3


12  

(I voted for Joshua's.) Here's an Hmisc::summary.formula solution. The advantage of this for me is that it is well integrated with the Hmisc::latex output "channel".

(我投票支持Joshua的。)这是一个Hmisc :: summary.formula解决方案。这对我来说的优点是它与Hmisc :: latex输出“通道”很好地集成在一起。

summary(y ~ interaction(f3,f2,f1), data=d, method="response", 
                    fun=function(y) c(mean.y=mean(y) ,var.y=var(y) ))
#-----output----------
y    N=108

+-----------------------+-------+---+---------+-----------+
|                       |       |N  |mean.y   |var.y      |
+-----------------------+-------+---+---------+-----------+
|interaction(f3, f2, f1)|I.a.A  |  4|0.6502307|0.095379578|
|                       |II.a.A |  4|0.4876630|0.110796695|

snipped output to show the latex -> PDF -> png output:

剪切输出以显示乳胶 - > PDF - > png输出:

快速/优雅的方法来构建均值/方差汇总表

#4


11  

@joran is spot-on with the ddply answer. Here's how I would do it with aggregate. Note that I avoid the formula interface (it is slower).

@joran与ddply答案相同。这是我如何用聚合做的。请注意,我避免使用公式界面(它更慢)。

aggregate(d$y, d[,c("f1","f2","f3")], FUN=function(x) c(mean=mean(x),var=var(x)))

#5


7  

I'm slightly addicted to speed comparisons even though they're largely irrelevant for me in this situation ...

我有点沉迷于速度比较,即使在这种情况下它们对我来说基本上无关紧要......

joran_ddply <- function(d) ddply(d,.(f1,f2,f3),
                                 summarise,y.mean = mean(y),y.var = var(y))
joshulrich_aggregate <- function(d) {
  aggregate(d$y, d[,c("f1","f2","f3")],
            FUN=function(x) c(mean=mean(x),var=var(x)))
}

formula_aggregate <- function(d) {
  aggregate(y~f1*f2*f3,data=d,
            FUN=function(x) c(mean=mean(x),var=var(x)))
}
library(data.table)
d2 <- data.table(d)
ramnath_datatable <- function(d) {
  d[,list(avg_y = mean(y), var_y = var(y)), 'f1, f2, f3']
}


library(Hmisc)
dwin_hmisc <- function(d) {summary(y ~ interaction(f3,f2,f1), 
                   data=d, method="response", 
                   fun=function(y) c(mean.y=mean(y) ,var.y=var(y) ))
                         }


library(rbenchmark)
benchmark(joran_ddply(d),
          joshulrich_aggregate(d),
          ramnath_datatable(d2),
          formula_aggregate(d),
          dwin_hmisc(d))

aggregate is fastest (even faster than data.table, which is a surprise to me, although things might be different with a bigger table to aggregate), even using the formula interface ...)

聚合是最快的(甚至比data.table更快,这对我来说是一个惊喜,虽然可能会有更大的表来聚合),甚至使用公式接口...)

                     test replications elapsed relative user.self sys.self
5           dwin_hmisc(d)          100   1.235 2.125645     1.168    0.044
4    formula_aggregate(d)          100   0.703 1.209983     0.656    0.036
1          joran_ddply(d)          100   3.345 5.757315     3.152    0.144
2 joshulrich_aggregate(d)          100   0.581 1.000000     0.596    0.000
3   ramnath_datatable(d2)          100   0.750 1.290878     0.708    0.000

(Now I just need Dirk to step up and post an Rcpp solution that is 1000 times faster than anything else ...)

(现在我只需要Dirk加强并发布一个比其他任何东西快1000倍的Rcpp解决方案......)

#6


4  

I've came accross with this question and found the benchmarks are done with small tables, so it's hard to tell which method is better with 100 rows.

我已经遇到了这个问题,并发现基准测试是用小表完成的,所以很难说100行哪种方法更好。

I've also modified the data a bit also to make it "unsorted", this would be a more common case, for example as the data is in a DB. I've added a few more data.table trials to see if setting a key is faster beforehand. It seems here, setting the key beforehand doesn't improve much the performance, so ramnath solution seems to be the fastest.

我还修改了一些数据以使其“未排序”,这将是一种更常见的情况,例如数据在数据库中。我已经添加了一些data.table试验,看看事先设置一个键是否更快。看来在这里,预先设置密钥并没有提高性能,所以ramnath解决方案似乎是最快的。

set.seed(1001)
d <- data.frame(f1 = sample(LETTERS[1:3], 30e5, replace = T), f2 = sample(letters[1:3], 30e5, replace = T),
                f3 = sample(factor(as.character(as.roman(1:3))), 30e5, replace = T), rep = sample(1:4, replace = T))

d$y <- runif(nrow(d))
d$z <- rnorm(nrow(d))

str(d)

require(Hmisc)
require(plyr)
require(data.table)
d2 = data.table(d)
d3 = data.table(d)

# Set key of d3 to compare how fast it is if the DT is already keyded
setkey(d3,f1,f2,f3)

joran_ddply <- function(d) ddply(d,.(f1,f2,f3),
                                 summarise,y.mean = mean(y),y.var = var(y))

formula_aggregate <- function(d) {
  aggregate(y~f1*f2*f3,data=d,
            FUN=function(x) c(mean=mean(x),var=var(x)))
}

ramnath_datatable <- function(d) {
  d[,list(avg_y = mean(y), var_y = var(y)), 'f1,f2,f3']
}

key_agg_datatable <- function(d) {
  setkey(d2,f1,f2,f3)
  d[,list(avg_y = mean(y), var_y = var(y)), 'f1,f2,f3']
}

one_key_datatable <- function(d) {
  setkey(d2,f1)
  d[,list(avg_y = mean(y), var_y = var(y)), 'f1,f2,f3']
}

including_3key_datatable <- function(d) {
  d[,list(avg_y = mean(y), var_y = var(y)), 'f1,f2,f3']
}

dwin_hmisc <- function(d) {summary(y ~ interaction(f3,f2,f1), 
                                   data=d, method="response", 
                                   fun=function(y) c(mean.y=mean(y) ,var.y=var(y) ))
}

require(rbenchmark)
benchmark(joran_ddply(d),
          joshulrich_aggregate(d),
          ramnath_datatable(d2),
          including_3key_datatable(d3),
          one_key_datatable(d2),
          key_agg_datatable(d2),
          formula_aggregate(d),
          dwin_hmisc(d)
          )

#                         test replications elapsed relative user.self sys.self
#                dwin_hmisc(d)          100 1757.28  252.121   1590.89   165.65
#         formula_aggregate(d)          100  433.56   62.204    390.83    42.50
# including_3key_datatable(d3)          100    7.00    1.004      6.02     0.98
#               joran_ddply(d)          100  173.39   24.877    119.35    53.95
#      joshulrich_aggregate(d)          100  328.51   47.132    307.14    21.22
#        key_agg_datatable(d2)          100   24.62    3.532     19.13     5.50
#        one_key_datatable(d2)          100   29.66    4.255     22.28     7.34
#        ramnath_datatable(d2)          100    6.97    1.000      5.96     1.01

#7


3  

I find the doBy package has some very convenient functions for things like this. For example, the function ?summaryBy is quite handy. Consider:

我发现doBy包有一些非常方便的功能用于这样的事情。例如,函数?summaryBy非常方便。考虑:

> summaryBy(y~f1+f2+f3, data=d, FUN=c(mean, var))
   f1 f2  f3    y.mean       y.var
1   A  a   I 0.6502307 0.095379578
2   A  a  II 0.4876630 0.110796695
3   A  a III 0.3102926 0.202805677
4   A  b   I 0.3914084 0.058693103
5   A  b  II 0.5257355 0.218631264
6   A  b III 0.3356860 0.079433136
7   A  c   I 0.3367841 0.079487973
8   A  c  II 0.6273320 0.041373836
9   A  c III 0.4532720 0.022779672
10  B  a   I 0.6688221 0.044184575
11  B  a  II 0.5514724 0.020359289
12  B  a III 0.6389354 0.104056229
13  B  b   I 0.5052346 0.138379070
14  B  b  II 0.3933283 0.050261804
15  B  b III 0.5953874 0.161943989
16  B  c   I 0.3490460 0.079286849
17  B  c  II 0.5534569 0.207381592
18  B  c III 0.4652424 0.187463143
19  C  a   I 0.3340988 0.004994589
20  C  a  II 0.3970315 0.126967554
21  C  a III 0.3580250 0.066769484
22  C  b   I 0.7676858 0.124945402
23  C  b  II 0.3613772 0.182689385
24  C  b III 0.4175562 0.095933470
25  C  c   I 0.3592491 0.039832864
26  C  c  II 0.7882591 0.084271963
27  C  c III 0.3936949 0.085758343

So the function call is simple, easy to use, and I would say, elegant.

因此函数调用简单,易于使用,我会说,优雅。

Now, if your primary concern is speed, it seems that it would be reasonable--at least with smaller sized tasks (note that I couldn't get the ramnath_datatable function to work for whatever reason):

现在,如果你主要关注的是速度,那么它似乎是合理的 - 至少对于较小的任务(注意我无法让ramnath_datatable函数因任何原因工作):

                     test replications elapsed relative user.self 
4           dwin_hmisc(d)          100    0.50    2.778      0.50 
3    formula_aggregate(d)          100    0.23    1.278      0.24 
5       gung_summaryBy(d)          100    0.34    1.889      0.35 
1          joran_ddply(d)          100    1.34    7.444      1.32 
2 joshulrich_aggregate(d)          100    0.18    1.000      0.19 

#8


2  

And here is a solution using Hadley Wickham's new dplyr library.

这是使用Hadley Wickham的新dplyr库的解决方案。

library(dplyr)
d %>% group_by(f1, f2, f3) %>% 
summarise(y.mean = mean(y), z.mean = mean(z))

#1


11  

I'm a bit puzzled. Does this not work:

我有点不解。这不起作用:

mvtab2 <- ddply(d,.(f1,f2,f3),
            summarise,y.mean = mean(y),y.var = var(y))

This give me something like this:

这给我这样的东西:

   f1 f2  f3    y.mean       y.var
1   A  a   I 0.6502307 0.095379578
2   A  a  II 0.4876630 0.110796695
3   A  a III 0.3102926 0.202805677
4   A  b   I 0.3914084 0.058693103
5   A  b  II 0.5257355 0.218631264

Which is in the right form, but it looks like the values are different that what you specified.

哪个是正确的形式,但看起来值与您指定的值不同。

Edit

编辑

Here's how to make your version with numcolwise work:

以下是如何使用numcolwise工作制作您的版本:

mvtab2 <- ddply(subset(d,select=-c(z,rep)),.(f1,f2,f3),summarise,
                y.mean = numcolwise(mean)(piece),
                y.var = numcolwise(var)(piece)) 

You forgot to pass the actual data to numcolwise. And then there's the little ddply trick that each piece is called piece internally. (Which Hadley points out in the comments shouldn't be relied upon as it may change in future versions of plyr.)

您忘记将实际数据传递给numcolwise。然后有一个小的ddply技巧,每个部分在内部称为片。 (不应该依赖Hadley在评论中指出的内容,因为它可能会在未来版本的plyr中发生变化。)

#2


17  

Here is a solution using data.table

这是一个使用data.table的解决方案

library(data.table)
d2 = data.table(d)
ans = d2[,list(avg_y = mean(y), var_y = var(y)), 'f1, f2, f3']

#3


12  

(I voted for Joshua's.) Here's an Hmisc::summary.formula solution. The advantage of this for me is that it is well integrated with the Hmisc::latex output "channel".

(我投票支持Joshua的。)这是一个Hmisc :: summary.formula解决方案。这对我来说的优点是它与Hmisc :: latex输出“通道”很好地集成在一起。

summary(y ~ interaction(f3,f2,f1), data=d, method="response", 
                    fun=function(y) c(mean.y=mean(y) ,var.y=var(y) ))
#-----output----------
y    N=108

+-----------------------+-------+---+---------+-----------+
|                       |       |N  |mean.y   |var.y      |
+-----------------------+-------+---+---------+-----------+
|interaction(f3, f2, f1)|I.a.A  |  4|0.6502307|0.095379578|
|                       |II.a.A |  4|0.4876630|0.110796695|

snipped output to show the latex -> PDF -> png output:

剪切输出以显示乳胶 - > PDF - > png输出:

快速/优雅的方法来构建均值/方差汇总表

#4


11  

@joran is spot-on with the ddply answer. Here's how I would do it with aggregate. Note that I avoid the formula interface (it is slower).

@joran与ddply答案相同。这是我如何用聚合做的。请注意,我避免使用公式界面(它更慢)。

aggregate(d$y, d[,c("f1","f2","f3")], FUN=function(x) c(mean=mean(x),var=var(x)))

#5


7  

I'm slightly addicted to speed comparisons even though they're largely irrelevant for me in this situation ...

我有点沉迷于速度比较,即使在这种情况下它们对我来说基本上无关紧要......

joran_ddply <- function(d) ddply(d,.(f1,f2,f3),
                                 summarise,y.mean = mean(y),y.var = var(y))
joshulrich_aggregate <- function(d) {
  aggregate(d$y, d[,c("f1","f2","f3")],
            FUN=function(x) c(mean=mean(x),var=var(x)))
}

formula_aggregate <- function(d) {
  aggregate(y~f1*f2*f3,data=d,
            FUN=function(x) c(mean=mean(x),var=var(x)))
}
library(data.table)
d2 <- data.table(d)
ramnath_datatable <- function(d) {
  d[,list(avg_y = mean(y), var_y = var(y)), 'f1, f2, f3']
}


library(Hmisc)
dwin_hmisc <- function(d) {summary(y ~ interaction(f3,f2,f1), 
                   data=d, method="response", 
                   fun=function(y) c(mean.y=mean(y) ,var.y=var(y) ))
                         }


library(rbenchmark)
benchmark(joran_ddply(d),
          joshulrich_aggregate(d),
          ramnath_datatable(d2),
          formula_aggregate(d),
          dwin_hmisc(d))

aggregate is fastest (even faster than data.table, which is a surprise to me, although things might be different with a bigger table to aggregate), even using the formula interface ...)

聚合是最快的(甚至比data.table更快,这对我来说是一个惊喜,虽然可能会有更大的表来聚合),甚至使用公式接口...)

                     test replications elapsed relative user.self sys.self
5           dwin_hmisc(d)          100   1.235 2.125645     1.168    0.044
4    formula_aggregate(d)          100   0.703 1.209983     0.656    0.036
1          joran_ddply(d)          100   3.345 5.757315     3.152    0.144
2 joshulrich_aggregate(d)          100   0.581 1.000000     0.596    0.000
3   ramnath_datatable(d2)          100   0.750 1.290878     0.708    0.000

(Now I just need Dirk to step up and post an Rcpp solution that is 1000 times faster than anything else ...)

(现在我只需要Dirk加强并发布一个比其他任何东西快1000倍的Rcpp解决方案......)

#6


4  

I've came accross with this question and found the benchmarks are done with small tables, so it's hard to tell which method is better with 100 rows.

我已经遇到了这个问题,并发现基准测试是用小表完成的,所以很难说100行哪种方法更好。

I've also modified the data a bit also to make it "unsorted", this would be a more common case, for example as the data is in a DB. I've added a few more data.table trials to see if setting a key is faster beforehand. It seems here, setting the key beforehand doesn't improve much the performance, so ramnath solution seems to be the fastest.

我还修改了一些数据以使其“未排序”,这将是一种更常见的情况,例如数据在数据库中。我已经添加了一些data.table试验,看看事先设置一个键是否更快。看来在这里,预先设置密钥并没有提高性能,所以ramnath解决方案似乎是最快的。

set.seed(1001)
d <- data.frame(f1 = sample(LETTERS[1:3], 30e5, replace = T), f2 = sample(letters[1:3], 30e5, replace = T),
                f3 = sample(factor(as.character(as.roman(1:3))), 30e5, replace = T), rep = sample(1:4, replace = T))

d$y <- runif(nrow(d))
d$z <- rnorm(nrow(d))

str(d)

require(Hmisc)
require(plyr)
require(data.table)
d2 = data.table(d)
d3 = data.table(d)

# Set key of d3 to compare how fast it is if the DT is already keyded
setkey(d3,f1,f2,f3)

joran_ddply <- function(d) ddply(d,.(f1,f2,f3),
                                 summarise,y.mean = mean(y),y.var = var(y))

formula_aggregate <- function(d) {
  aggregate(y~f1*f2*f3,data=d,
            FUN=function(x) c(mean=mean(x),var=var(x)))
}

ramnath_datatable <- function(d) {
  d[,list(avg_y = mean(y), var_y = var(y)), 'f1,f2,f3']
}

key_agg_datatable <- function(d) {
  setkey(d2,f1,f2,f3)
  d[,list(avg_y = mean(y), var_y = var(y)), 'f1,f2,f3']
}

one_key_datatable <- function(d) {
  setkey(d2,f1)
  d[,list(avg_y = mean(y), var_y = var(y)), 'f1,f2,f3']
}

including_3key_datatable <- function(d) {
  d[,list(avg_y = mean(y), var_y = var(y)), 'f1,f2,f3']
}

dwin_hmisc <- function(d) {summary(y ~ interaction(f3,f2,f1), 
                                   data=d, method="response", 
                                   fun=function(y) c(mean.y=mean(y) ,var.y=var(y) ))
}

require(rbenchmark)
benchmark(joran_ddply(d),
          joshulrich_aggregate(d),
          ramnath_datatable(d2),
          including_3key_datatable(d3),
          one_key_datatable(d2),
          key_agg_datatable(d2),
          formula_aggregate(d),
          dwin_hmisc(d)
          )

#                         test replications elapsed relative user.self sys.self
#                dwin_hmisc(d)          100 1757.28  252.121   1590.89   165.65
#         formula_aggregate(d)          100  433.56   62.204    390.83    42.50
# including_3key_datatable(d3)          100    7.00    1.004      6.02     0.98
#               joran_ddply(d)          100  173.39   24.877    119.35    53.95
#      joshulrich_aggregate(d)          100  328.51   47.132    307.14    21.22
#        key_agg_datatable(d2)          100   24.62    3.532     19.13     5.50
#        one_key_datatable(d2)          100   29.66    4.255     22.28     7.34
#        ramnath_datatable(d2)          100    6.97    1.000      5.96     1.01

#7


3  

I find the doBy package has some very convenient functions for things like this. For example, the function ?summaryBy is quite handy. Consider:

我发现doBy包有一些非常方便的功能用于这样的事情。例如,函数?summaryBy非常方便。考虑:

> summaryBy(y~f1+f2+f3, data=d, FUN=c(mean, var))
   f1 f2  f3    y.mean       y.var
1   A  a   I 0.6502307 0.095379578
2   A  a  II 0.4876630 0.110796695
3   A  a III 0.3102926 0.202805677
4   A  b   I 0.3914084 0.058693103
5   A  b  II 0.5257355 0.218631264
6   A  b III 0.3356860 0.079433136
7   A  c   I 0.3367841 0.079487973
8   A  c  II 0.6273320 0.041373836
9   A  c III 0.4532720 0.022779672
10  B  a   I 0.6688221 0.044184575
11  B  a  II 0.5514724 0.020359289
12  B  a III 0.6389354 0.104056229
13  B  b   I 0.5052346 0.138379070
14  B  b  II 0.3933283 0.050261804
15  B  b III 0.5953874 0.161943989
16  B  c   I 0.3490460 0.079286849
17  B  c  II 0.5534569 0.207381592
18  B  c III 0.4652424 0.187463143
19  C  a   I 0.3340988 0.004994589
20  C  a  II 0.3970315 0.126967554
21  C  a III 0.3580250 0.066769484
22  C  b   I 0.7676858 0.124945402
23  C  b  II 0.3613772 0.182689385
24  C  b III 0.4175562 0.095933470
25  C  c   I 0.3592491 0.039832864
26  C  c  II 0.7882591 0.084271963
27  C  c III 0.3936949 0.085758343

So the function call is simple, easy to use, and I would say, elegant.

因此函数调用简单,易于使用,我会说,优雅。

Now, if your primary concern is speed, it seems that it would be reasonable--at least with smaller sized tasks (note that I couldn't get the ramnath_datatable function to work for whatever reason):

现在,如果你主要关注的是速度,那么它似乎是合理的 - 至少对于较小的任务(注意我无法让ramnath_datatable函数因任何原因工作):

                     test replications elapsed relative user.self 
4           dwin_hmisc(d)          100    0.50    2.778      0.50 
3    formula_aggregate(d)          100    0.23    1.278      0.24 
5       gung_summaryBy(d)          100    0.34    1.889      0.35 
1          joran_ddply(d)          100    1.34    7.444      1.32 
2 joshulrich_aggregate(d)          100    0.18    1.000      0.19 

#8


2  

And here is a solution using Hadley Wickham's new dplyr library.

这是使用Hadley Wickham的新dplyr库的解决方案。

library(dplyr)
d %>% group_by(f1, f2, f3) %>% 
summarise(y.mean = mean(y), z.mean = mean(z))