ddply +在大量的列中重复相同的统计功能。

时间:2022-03-20 09:17:26

Ok, second R question in quick succession.

好的,快速连续的第二个R问题。

My data:

我的数据:

           Timestamp    St_01  St_02 ...
1 2008-02-08 00:00:00  26.020 25.840 ...
2 2008-02-08 00:10:00  25.985 25.790 ...
3 2008-02-08 00:20:00  25.930 25.765 ...
4 2008-02-08 00:30:00  25.925 25.730 ...
5 2008-02-08 00:40:00  25.975 25.695 ...
...

Basically normally I would use a combination of ddply and summarize to calculate ensembles (e.g. mean for every hour across the whole year).

基本上我通常会使用ddply和summarize的组合来计算合奏(例如全年每小时的平均值)。

In the case above, I would create a category, e.g. hour (e.g. strptime(data$Timestamp,"%H") -> data$hour and then use that category in ddply, like ddply(data,"hour", summarize, St_01=mean(St_01), St_02=mean(St_02)...) to average by category across each of the columns.

在上面的例子中,我会创建一个类别,例如小时(例如strptime(data $ Timestamp,“%H”) - > data $ hour然后在ddply中使用该类别,如ddply(数据,“小时”,汇总,St_01 =平均值(St_01),St_02 =平均值(St_02) )...)按每个列的类别进行平均。

but here is where it gets sticky. I have more than 40 columns to deal with and I'm not prepared to type them all one by one as parameters to the summarize function. I used to write a loop in shell to generate this code but that's not how programmers solve problems is it?

但这里是粘性的地方。我有40多个列要处理,我不准备将它们作为参数分别输入到汇总函数中。我曾经在shell中编写一个循环来生成这个代码,但这不是程序员如何解决问题的呢?

So pray tell, does anyone have a better way of achieving the same result but with less keystrokes?

所以祈祷告诉,有没有人有更好的方法来实现相同的结果,但更少的击键?

2 个解决方案

#1


36  

You can use numcolwise() to run a summary over all numeric columns.

您可以使用numcolwise()在所有数字列上运行摘要。

Here is an example using iris:

以下是使用iris的示例:

ddply(iris, .(Species), numcolwise(mean))
     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa        5.006       3.428        1.462       0.246
2 versicolor        5.936       2.770        4.260       1.326
3  virginica        6.588       2.974        5.552       2.026

Similarly, there is catcolwise() to summarise over all categorical columns.

同样,catcolwise()总结了所有分类列。

See ?numcolwise for more help and examples.

有关更多帮助和示例,请参阅?numcolwise。


EDIT

编辑

An alternative approach is to use reshape2 (proposed by @gsk3). This has more keystrokes in this example, but gives you enormous flexibility:

另一种方法是使用reshape2(由@ gsk3提出)。在这个例子中,这有更多的击键,但给你巨大的灵活性:

library(reshape2)

库(reshape2)

miris <- melt(iris, id.vars="Species")
x <- ddply(miris, .(Species, variable), summarize, mean=mean(value))

dcast(x, Species~variable, value.var="mean")
     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa        5.006       3.428        1.462       0.246
2 versicolor        5.936       2.770        4.260       1.326
3  virginica        6.588       2.974        5.552       2.026

#2


7  

You can even simplify the second approach proposed by Andrie by omitting the ddply call completely. Just specify mean as the aggregation function in the dcast call:

您甚至可以通过完全省略ddply调用来简化Andrie提出的第二种方法。只需将mean指定为dcast调用中的聚合函数:

library(reshape2)
miris <- melt(iris, id.vars="Species")
dcast(miris, Species ~ variable, mean)

     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa        5.006       3.428        1.462       0.246
2 versicolor        5.936       2.770        4.260       1.326
3  virginica        6.588       2.974        5.552       2.026

The same result can also be calculated very fast using the data.table package. The .SD variable in the j expression is a special data.table variable containing the subset of data for each group, excluding all columns used in by.

使用data.table包也可以非常快速地计算相同的结果。 j表达式中的.SD变量是一个特殊的data.table变量,包含每个组的数据子集,不包括by中使用的所有列。

library(data.table)
dt_iris <- as.data.table(iris)
dt_iris[, lapply(.SD, mean), by = Species]

     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1:     setosa        5.006       3.428        1.462       0.246
2: versicolor        5.936       2.770        4.260       1.326
3:  virginica        6.588       2.974        5.552       2.026

Yet another option would be the new version 0.2 of Hadley's dplyr package

另一个选择是Hadley的dplyr包的新版本0.2

library(dplyr)
group_by(iris, Species) %>% summarise_each(funs(mean))

Source: local data frame [3 x 5]

     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa        5.006       3.428        1.462       0.246
2 versicolor        5.936       2.770        4.260       1.326
3  virginica        6.588       2.974        5.552       2.026

#1


36  

You can use numcolwise() to run a summary over all numeric columns.

您可以使用numcolwise()在所有数字列上运行摘要。

Here is an example using iris:

以下是使用iris的示例:

ddply(iris, .(Species), numcolwise(mean))
     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa        5.006       3.428        1.462       0.246
2 versicolor        5.936       2.770        4.260       1.326
3  virginica        6.588       2.974        5.552       2.026

Similarly, there is catcolwise() to summarise over all categorical columns.

同样,catcolwise()总结了所有分类列。

See ?numcolwise for more help and examples.

有关更多帮助和示例,请参阅?numcolwise。


EDIT

编辑

An alternative approach is to use reshape2 (proposed by @gsk3). This has more keystrokes in this example, but gives you enormous flexibility:

另一种方法是使用reshape2(由@ gsk3提出)。在这个例子中,这有更多的击键,但给你巨大的灵活性:

library(reshape2)

库(reshape2)

miris <- melt(iris, id.vars="Species")
x <- ddply(miris, .(Species, variable), summarize, mean=mean(value))

dcast(x, Species~variable, value.var="mean")
     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa        5.006       3.428        1.462       0.246
2 versicolor        5.936       2.770        4.260       1.326
3  virginica        6.588       2.974        5.552       2.026

#2


7  

You can even simplify the second approach proposed by Andrie by omitting the ddply call completely. Just specify mean as the aggregation function in the dcast call:

您甚至可以通过完全省略ddply调用来简化Andrie提出的第二种方法。只需将mean指定为dcast调用中的聚合函数:

library(reshape2)
miris <- melt(iris, id.vars="Species")
dcast(miris, Species ~ variable, mean)

     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa        5.006       3.428        1.462       0.246
2 versicolor        5.936       2.770        4.260       1.326
3  virginica        6.588       2.974        5.552       2.026

The same result can also be calculated very fast using the data.table package. The .SD variable in the j expression is a special data.table variable containing the subset of data for each group, excluding all columns used in by.

使用data.table包也可以非常快速地计算相同的结果。 j表达式中的.SD变量是一个特殊的data.table变量,包含每个组的数据子集,不包括by中使用的所有列。

library(data.table)
dt_iris <- as.data.table(iris)
dt_iris[, lapply(.SD, mean), by = Species]

     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1:     setosa        5.006       3.428        1.462       0.246
2: versicolor        5.936       2.770        4.260       1.326
3:  virginica        6.588       2.974        5.552       2.026

Yet another option would be the new version 0.2 of Hadley's dplyr package

另一个选择是Hadley的dplyr包的新版本0.2

library(dplyr)
group_by(iris, Species) %>% summarise_each(funs(mean))

Source: local data frame [3 x 5]

     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa        5.006       3.428        1.462       0.246
2 versicolor        5.936       2.770        4.260       1.326
3  virginica        6.588       2.974        5.552       2.026