I want to compute the mean over the 3-D of a multidimensional array. As this dimension is supposed to be the time, I wanted to computed monthly means. For that, I tried to use apply, but I am not sure where the problem is. Let's say my data is as the following:
我想计算多维数组的三维平均值。由于这个维度应该是时间,我想计算月度均值。为此,我尝试使用apply,但我不确定问题出在哪里。假设我的数据如下:
#Creating a sample
m <-array(1:12, dim=c(20,4,36))
#number of months
months <- seq(1:12)
#Compute the mean over each month (dimension of the result should be [20,4,12]
monmean <- apply(m,1:2,function(x) for(i in 1:12) mean(x[,,months==i],na.rm=TRUE))
Any idea?? Thanks in advance
任何的想法??提前致谢
1 个解决方案
#1
I think I understand what you're after. This is actually slightly more complex than it may seem, because months are not regular periods of time; they vary in number of days, and February varies between years due to leap years. Thus a simple regular logical or numeric index vector will not be sufficient to calculate this result precisely. You need to take into account the exact dates that are covered by the z-dimension of your array.
我想我明白你在追求什么。这实际上比看起来要复杂得多,因为几个月不是常规的时间段;它们的天数各不相同,而且由于闰年,2月份的年份不同。因此,简单的常规逻辑或数字索引向量将不足以精确地计算该结果。您需要考虑数组的z维所涵盖的确切日期。
Solution 1
What you can do is separately compute a date vector that identifies the dates that correspond to each z-index of your array. Within the apply()
call for each z-line, you can then call strftime()
to extract the months for each such date, and group by that month value using tapply()
to take monthly mean()
s. Here's how it could be done:
你可以做的是分别计算一个日期向量,该日期向量标识与数组的每个z-index对应的日期。在每个z-line的apply()调用中,您可以调用strftime()来提取每个这样的日期的月份,并使用tapply()按月份值进行分组,以获取月度mean()s。以下是它的完成方式:
set.seed(1);
R <- 48;
C <- 39;
Z <- 3653;
N <- R*C*Z;
a1 <- array(rnorm(N,10,2),c(R,C,Z));
dates <- seq(as.Date('2000-01-01'),as.Date('2009-12-31'),1);
a2 <- aperm(apply(a1,1:2,function(x) tapply(x,strftime(dates,'%m'),mean)),c(2,3,1));
Here's a demo showing a few specific proofs of correctness:
这是一个演示,展示了一些正确性的具体证明:
for (r in sample(1:nrow(a2),2)) for (c in sample(1:ncol(a2),2)) for (m in sample(1:dim(a2)[3],2)) cat(sprintf('[%02d,%02d,%3s] %f %f\n',r,c,month.abb[m],mean(a1[r,c,strftime(dates,'%m')==sprintf('%02d',m)]),a2[r,c,m]));
## [14,05,Aug] 10.030313 10.030313
## [14,05,Apr] 10.200982 10.200982
## [14,25,Jan] 9.957879 9.957879
## [14,25,Apr] 10.185447 10.185447
## [26,34,Oct] 10.056931 10.056931
## [26,34,Nov] 9.876327 9.876327
## [26,17,Apr] 10.005423 10.005423
## [26,17,Sep] 10.009785 10.009785
Notes
- I randomly chose a date range of 2000-01-01 to 2009-12-31 because it covers a 10 year period during which (due to leap years) there were exactly 3653 days, but obviously you should be sure to use whatever dates are actually covered by your real data.
- As you can see, you were on the right track by calling
apply()
with1:2
as the margins, because that allows you to operate independently on each z-line, such that you can group that z-line by month and compute the mean for each month along that z-line. - Unfortunately,
apply()
has an annoying habit of returning the result in a different transposition than people generally expect. For two-dimensional usages, this is normally solved with a simple call tot()
, but since we're working in three dimensions here, we need to callaperm()
to fix the dimension order. - Since the dates I chose begin with January and advance through the months in calendar order, the means in the result will end up being ordered by calendar month. IOW, z-indexes 1:12 in
a2
correspond to months Jan-Dec. If your dates do not begin with January, then this solution should still work, but you'll have to be careful about the correspondence between z-indexes and months in the result. For example, my "proof of correctness" code assumed that indexes 1:12 corresponded to months Jan-Dec, but that wouldn't be correct if the months occurred in a different order in the input array.
我随机选择了2000-01-01到2009-12-31的日期范围,因为它涵盖了10年期间(由于闰年)恰好有3653天,但显然你应该确定使用的日期是实际数据覆盖了你的实际数据。
正如您所看到的,通过调用带有1:2作为边距的apply(),您处于正确的轨道,因为这允许您在每个z线上独立操作,这样您就可以按月对该z线进行分组并进行计算沿着该z线的每个月的平均值。
不幸的是,apply()有一种令人烦恼的习惯,即以不同于人们通常预期的不同换位方式返回结果。对于二维用法,这通常通过对t()的简单调用来解决,但由于我们在这里工作三维,我们需要调用aperm()来修复维度顺序。
由于我选择的日期从1月开始并按日历顺序提前几个月,因此结果中的均值最终将按日历月排序。 IOW,z-index 1:12 in a2对应于Jan-Dec月份。如果您的日期不是从1月开始,那么此解决方案仍然有效,但您必须注意z-index与结果中的月份之间的对应关系。例如,我的“正确性证明”代码假设索引1:12对应于Jan-Dec个月,但如果月份在输入数组中以不同的顺序发生则不正确。
Solution 2
While writing this answer I actually thought of a slightly different, and one could argue slightly better, solution. You can call tapply()
just once and group by rows, then columns, and finally months. Unfortunately, tapply()
doesn't seem to be designed to naturally cycle its group vectors to cover the input vector, so we have to cycle them ourselves using carefully crafted calls to rep()
(using the each
and times
arguments carefully--and I suppose tapply()
actually wouldn't even know how to do this properly for our input data), but other than that, it's fairly straightforward:
在写这个答案的时候,我实际上认为有点不同,而且可以提出稍微好一点的解决方案。你可以只调用一次tapply()并按行,然后是列,最后是几个月。不幸的是,tapply()似乎并不是为了自然地循环其组向量来覆盖输入向量,所以我们必须使用精心设计的rep()调用来循环它们(仔细使用each和times参数 - 和我想tapply()实际上甚至不知道如何正确地为我们的输入数据做这个),但除此之外,它是相当简单的:
a3 <- tapply(a1,list(rep(1:R,C*Z),rep(1:C,each=R,times=Z),rep(strftime(dates,'%m'),each=R*C)),mean);
Here's a proof that the result is identical to my first method (dimnames()
have to be fixed first to get the identical()
call to work, but that's trivial):
这里有一个结果与我的第一个方法相同的证明(dimnames()必须先修复才能使相同的()调用工作,但这很简单):
dimnames(a3) <- dimnames(a2);
identical(a3,a2);
## [1] TRUE
Performance
Here's some basic performance testing using system.time()
to give an idea of the superiority of the second solution:
以下是使用system.time()进行的一些基本性能测试,以了解第二种解决方案的优越性:
first <- function() a2 <- aperm(apply(a1,1:2,function(x) tapply(x,strftime(dates,'%m'),mean)),c(2,3,1));
second <- function() a3 <- tapply(a1,list(rep(1:R,C*Z),rep(1:C,each=R,times=Z),rep(strftime(dates,'%m'),each=R*C)),mean);
system.time({ first() });
## user system elapsed
## 3.672 0.015 3.719
system.time({ first() });
## user system elapsed
## 3.672 0.016 3.720
system.time({ second() });
## user system elapsed
## 1.797 0.344 2.135
system.time({ second() });
## user system elapsed
## 1.719 0.391 2.124
#1
I think I understand what you're after. This is actually slightly more complex than it may seem, because months are not regular periods of time; they vary in number of days, and February varies between years due to leap years. Thus a simple regular logical or numeric index vector will not be sufficient to calculate this result precisely. You need to take into account the exact dates that are covered by the z-dimension of your array.
我想我明白你在追求什么。这实际上比看起来要复杂得多,因为几个月不是常规的时间段;它们的天数各不相同,而且由于闰年,2月份的年份不同。因此,简单的常规逻辑或数字索引向量将不足以精确地计算该结果。您需要考虑数组的z维所涵盖的确切日期。
Solution 1
What you can do is separately compute a date vector that identifies the dates that correspond to each z-index of your array. Within the apply()
call for each z-line, you can then call strftime()
to extract the months for each such date, and group by that month value using tapply()
to take monthly mean()
s. Here's how it could be done:
你可以做的是分别计算一个日期向量,该日期向量标识与数组的每个z-index对应的日期。在每个z-line的apply()调用中,您可以调用strftime()来提取每个这样的日期的月份,并使用tapply()按月份值进行分组,以获取月度mean()s。以下是它的完成方式:
set.seed(1);
R <- 48;
C <- 39;
Z <- 3653;
N <- R*C*Z;
a1 <- array(rnorm(N,10,2),c(R,C,Z));
dates <- seq(as.Date('2000-01-01'),as.Date('2009-12-31'),1);
a2 <- aperm(apply(a1,1:2,function(x) tapply(x,strftime(dates,'%m'),mean)),c(2,3,1));
Here's a demo showing a few specific proofs of correctness:
这是一个演示,展示了一些正确性的具体证明:
for (r in sample(1:nrow(a2),2)) for (c in sample(1:ncol(a2),2)) for (m in sample(1:dim(a2)[3],2)) cat(sprintf('[%02d,%02d,%3s] %f %f\n',r,c,month.abb[m],mean(a1[r,c,strftime(dates,'%m')==sprintf('%02d',m)]),a2[r,c,m]));
## [14,05,Aug] 10.030313 10.030313
## [14,05,Apr] 10.200982 10.200982
## [14,25,Jan] 9.957879 9.957879
## [14,25,Apr] 10.185447 10.185447
## [26,34,Oct] 10.056931 10.056931
## [26,34,Nov] 9.876327 9.876327
## [26,17,Apr] 10.005423 10.005423
## [26,17,Sep] 10.009785 10.009785
Notes
- I randomly chose a date range of 2000-01-01 to 2009-12-31 because it covers a 10 year period during which (due to leap years) there were exactly 3653 days, but obviously you should be sure to use whatever dates are actually covered by your real data.
- As you can see, you were on the right track by calling
apply()
with1:2
as the margins, because that allows you to operate independently on each z-line, such that you can group that z-line by month and compute the mean for each month along that z-line. - Unfortunately,
apply()
has an annoying habit of returning the result in a different transposition than people generally expect. For two-dimensional usages, this is normally solved with a simple call tot()
, but since we're working in three dimensions here, we need to callaperm()
to fix the dimension order. - Since the dates I chose begin with January and advance through the months in calendar order, the means in the result will end up being ordered by calendar month. IOW, z-indexes 1:12 in
a2
correspond to months Jan-Dec. If your dates do not begin with January, then this solution should still work, but you'll have to be careful about the correspondence between z-indexes and months in the result. For example, my "proof of correctness" code assumed that indexes 1:12 corresponded to months Jan-Dec, but that wouldn't be correct if the months occurred in a different order in the input array.
我随机选择了2000-01-01到2009-12-31的日期范围,因为它涵盖了10年期间(由于闰年)恰好有3653天,但显然你应该确定使用的日期是实际数据覆盖了你的实际数据。
正如您所看到的,通过调用带有1:2作为边距的apply(),您处于正确的轨道,因为这允许您在每个z线上独立操作,这样您就可以按月对该z线进行分组并进行计算沿着该z线的每个月的平均值。
不幸的是,apply()有一种令人烦恼的习惯,即以不同于人们通常预期的不同换位方式返回结果。对于二维用法,这通常通过对t()的简单调用来解决,但由于我们在这里工作三维,我们需要调用aperm()来修复维度顺序。
由于我选择的日期从1月开始并按日历顺序提前几个月,因此结果中的均值最终将按日历月排序。 IOW,z-index 1:12 in a2对应于Jan-Dec月份。如果您的日期不是从1月开始,那么此解决方案仍然有效,但您必须注意z-index与结果中的月份之间的对应关系。例如,我的“正确性证明”代码假设索引1:12对应于Jan-Dec个月,但如果月份在输入数组中以不同的顺序发生则不正确。
Solution 2
While writing this answer I actually thought of a slightly different, and one could argue slightly better, solution. You can call tapply()
just once and group by rows, then columns, and finally months. Unfortunately, tapply()
doesn't seem to be designed to naturally cycle its group vectors to cover the input vector, so we have to cycle them ourselves using carefully crafted calls to rep()
(using the each
and times
arguments carefully--and I suppose tapply()
actually wouldn't even know how to do this properly for our input data), but other than that, it's fairly straightforward:
在写这个答案的时候,我实际上认为有点不同,而且可以提出稍微好一点的解决方案。你可以只调用一次tapply()并按行,然后是列,最后是几个月。不幸的是,tapply()似乎并不是为了自然地循环其组向量来覆盖输入向量,所以我们必须使用精心设计的rep()调用来循环它们(仔细使用each和times参数 - 和我想tapply()实际上甚至不知道如何正确地为我们的输入数据做这个),但除此之外,它是相当简单的:
a3 <- tapply(a1,list(rep(1:R,C*Z),rep(1:C,each=R,times=Z),rep(strftime(dates,'%m'),each=R*C)),mean);
Here's a proof that the result is identical to my first method (dimnames()
have to be fixed first to get the identical()
call to work, but that's trivial):
这里有一个结果与我的第一个方法相同的证明(dimnames()必须先修复才能使相同的()调用工作,但这很简单):
dimnames(a3) <- dimnames(a2);
identical(a3,a2);
## [1] TRUE
Performance
Here's some basic performance testing using system.time()
to give an idea of the superiority of the second solution:
以下是使用system.time()进行的一些基本性能测试,以了解第二种解决方案的优越性:
first <- function() a2 <- aperm(apply(a1,1:2,function(x) tapply(x,strftime(dates,'%m'),mean)),c(2,3,1));
second <- function() a3 <- tapply(a1,list(rep(1:R,C*Z),rep(1:C,each=R,times=Z),rep(strftime(dates,'%m'),each=R*C)),mean);
system.time({ first() });
## user system elapsed
## 3.672 0.015 3.719
system.time({ first() });
## user system elapsed
## 3.672 0.016 3.720
system.time({ second() });
## user system elapsed
## 1.797 0.344 2.135
system.time({ second() });
## user system elapsed
## 1.719 0.391 2.124