按日期汇总数据并将不同的函数应用于相应的列？

I have the following data frame "DF" which is part of a much larger one:

我有以下数据框“DF”,它是更大的数据框的一部分:

             X1  X2            X3 X4 X5
4468 2010-03-24   3  1.000000e+00  1  2
7662 2010-03-24   9  3.000000e+00  2  1
1272 2010-03-25   8  2.000000e+00  1  1
1273 2010-03-26   9  0.000000e+00  1  1
1274 2010-03-27   8  0.000000e+00  1  1
4469 2010-03-28   4  0.000000e+00  1  2
7663 2010-03-28   4  3.000000e+00  3  1
8734 2010-03-28   7  4.000000e+00  2  3
1275 2010-03-29   8  0.000000e+00  1  1

As you can see the first column contains a date. What I want to do is as follows: I want to transform this dataframe to a new one "DF2" where there is only 1 row per date with corresponding column values:

如您所见,第一列包含日期。我想要做的是如下:我想将此数据帧转换为新的“DF2”,其中每个日期只有1行具有相应的列值:

X2, the average 
X3, the sum
X4, the maximum

of all previous values per date. X5 is not relevant and can be removed. This would be the result:

每个日期的所有先前值。 X5不相关,可以删除。这将是结果:

             X1  X2            X3 X4
7662 2010-03-24   6  4.000000e+00  2  
1272 2010-03-25   8  2.000000e+00  1  
1273 2010-03-26   9  0.000000e+00  1  
1274 2010-03-27   8  0.000000e+00  1  
8734 2010-03-28   5  7.000000e+00  3  
1275 2010-03-29   8  0.000000e+00  1

Does anyone know how to accomplish this? Help would be much appreciated!

有谁知道怎么做到这一点?非常感谢帮助!

3 个解决方案

#1

You can use the ddply function from the plyr package to do arbitrary aggregations or other transforms by some grouping variable.

您可以使用plyr包中的ddply函数通过某些分组变量执行任意聚合或其他转换。

For your question the code would look something like:

对于您的问题,代码看起来像:

library(plyr)
result <- ddply(DF, .(X1), function(df) {
  with(df, data.frame( X1=mean(X1), X2=sum(X2), X3=max(X3) ) )
} )

If this is a medium-large project then you may want to set the progress argument to show a progress bar. For a really large problem it can be set to use parallel processing.

如果这是一个中型项目,那么您可能需要设置progress参数以显示进度条。对于一个非常大的问题,可以设置为使用并行处理。

#2

DF <- read.table(text="             X1  X2            X3 X4 X5
4468 2010-03-24   3  1.000000e+00  1  2
7662 2010-03-24   9  3.000000e+00  2  1
1272 2010-03-25   8  2.000000e+00  1  1
1273 2010-03-26   9  0.000000e+00  1  1
1274 2010-03-27   8  0.000000e+00  1  1
4469 2010-03-28   4  0.000000e+00  1  2
7663 2010-03-28   4  3.000000e+00  3  1
8734 2010-03-28   7  4.000000e+00  2  3
1275 2010-03-29   8  0.000000e+00  1  1",header=TRUE)

library(data.table)

DT <- as.data.table(DF)

DT[,list(X2=mean(X2),X3=sum(X3),X4=max(X4)),by=X1]

#            X1 X2 X3 X4
# 1: 2010-03-24  6  4  2
# 2: 2010-03-25  8  2  1
# 3: 2010-03-26  9  0  1
# 4: 2010-03-27  8  0  1
# 5: 2010-03-28  5  7  3
# 6: 2010-03-29  8  0  1

#3

There are many ways to do this but here is an sqldf solution:

有很多方法可以做到这一点,但这是一个sqldf解决方案:

library(sqldf)
sqldf("select X1, avg(X2), sum(X3), max(X4) from DF group by X1")

The result is:

结果是:

          X1 avg(X2) sum(X3) max(X4)
1 2010-03-24       6       4       2
2 2010-03-25       8       2       1
3 2010-03-26       9       0       1
4 2010-03-27       8       0       1
5 2010-03-28       5       7       3
6 2010-03-29       8       0       1

#1