通过基于时间的窗口优化不规则时间序列的滚动函数

Is there some way to use rollapply (from zoo package or something similar) optimized functions (rollmean, rollmedian etc) to compute rolling functions with a time-based window, instead of one based on a number of observations? What I want is simple: for each element in an irregular time series, I want to compute a rolling function with a N-days window. That is, the window should include all the observations up to N days before the current observation. Time series may also contain duplicates.

有没有办法使用rollapply（来自动物园包或类似的东西）优化函数（rollmean，rollmedian等）来计算具有基于时间的窗口的滚动函数，而不是基于大量观察的窗口？我想要的很简单：对于不规则时间序列中的每个元素，我想计算一个带有N天窗口的滚动函数。也就是说，窗口应包括当前观察前N天的所有观察结果。时间序列也可能包含重复项。

Here follows an example. Given the following time series:

以下是一个例子。鉴于以下时间序列：

      date  value
 1/11/2011      5
 1/11/2011      4
 1/11/2011      2
 8/11/2011      1
13/11/2011      0
14/11/2011      0
15/11/2011      0
18/11/2011      1
21/11/2011      4
 5/12/2011      3

A rolling median with a 5-day window, aligned to the right, should result in the following calculation:

具有5天窗口的滚动中位数（右侧对齐）应导致以下计算：

> c(
    median(c(5)),
    median(c(5,4)),
    median(c(5,4,2)),
    median(c(1)),
    median(c(1,0)), 
    median(c(0,0)),
    median(c(0,0,0)),
    median(c(0,0,0,1)),
    median(c(1,4)),
    median(c(3))
   )

 [1] 5.0 4.5 4.0 1.0 0.5 0.0 0.0 0.0 2.5 3.0

I already found some solutions out there but they are usually tricky, which usually means slow. I managed to implement my own rolling function calculation. The problem is that for very long time series the optimized version of median (rollmedian) can make a huge time difference, since it takes into account the overlap between windows. I would like to avoid reimplementing it. I suspect there are some trick with rollapply parameters that will make it work, but I cannot figure it out. Thanks in advance for the help.

我已经找到了一些解决方案，但它们通常很棘手，通常意味着很慢。我设法实现了自己的滚动函数计算。问题是，对于非常长的时间序列，中值（rollmedian）的优化版本可以产生巨大的时间差，因为它考虑了窗口之间的重叠。我想避免重新实现它。我怀疑rollapply参数有一些技巧可以使它工作，但我无法弄明白。先谢谢您的帮助。

4 个解决方案

#1

Most of the answers suggest to insert NA to make the time series regular. However, this can be slow in case of long time series. Additionally, it does not work for functions which can not be used with NA.

大多数答案建议插入NA以使时间序列规则。但是，在长时间序列的情况下，这可能会很慢。此外，它不适用于不能与NA一起使用的功能。

The width argument of rollapply (zoo package) can be a list (see help of rollapply for details). Based on this I wrote a function which creates a list to be used with rollapply as width parameter. The function extracts indexes for irregular zoo objects if the moving window is to be time and not index based. Therefore the index of the zoo object should be the actual time.

rollapply（zoo包）的width参数可以是一个列表（有关详细信息，请参阅rollapply的帮助）。基于此，我编写了一个函数，该函数创建一个与rollapply一起使用的列表作为width参数。如果移动窗口是时间而不是基于索引，则该函数提取不规则动物园对象的索引。因此，zoo对象的索引应该是实际时间。

# Create a zoo object where index represents time (e.g. in seconds) 

d <- zoo(c(1,1,1,1,1,2,2,2,2,2,16,25,27,27,27,27,27,31),     
         c(1:5,11:15,16,25:30,31))

# Create function 

createRollapplyWidth = function(zoodata, steps, window ){   

  mintime =  min(time(zoodata))     

  maxtime =  max(time(zoodata)) 

  spotstime = seq(from = mintime , to = maxtime, by = steps)

  spotsindex = list() 

    for (i in 1:length(spotstime)){
    spotsindex[[i]] =  as.numeric(which(spotstime[i]  <=  time(zoodata) & time(zoodata) < spotstime[i] + window))}

  rollapplywidth = list()
    for (i in 1:length(spotsindex)){
    if (!is.na(median(spotsindex[[i]])) ){ 
      rollapplywidth[[round(median(spotsindex[[i]]))]] = spotsindex[[i]] - round(median(spotsindex[[i]]))}
  }
  return(rollapplywidth)
  }


# Create width parameter for rollapply using function

rollwidth =  createRollapplyWidth(zoodata = d, steps = 5, window = 5) 

# Use parameter in rollapply 

result = rollapply(d, width = rollwidth , FUN =  sum, na.rm = T) 
result

Limitation: not based on dated but on time in seconds. Parameter "partial" of rollapply does not work.

限制：不是基于过时的，而是基于秒的时间。 rollapply的参数“partial”不起作用。

#2

Here is my tinkering with the problem. If that sort of gets at what you wanted (I don't know if it's satisfactory in terms of speed), I can write it up as a more detailed answer (even though it's based on @rbatt's idea).

这是我修补这个问题。如果那种得到你想要的东西（我不知道它在速度方面是否令人满意），我可以把它写成一个更详细的答案（即使它是基于@ rbatt的想法）。

library(zoo)
library(dplyr)

# create a long time series
start <- as.Date("1800-01-01")
end <- as.Date(Sys.Date())

df <- data.frame(V1 = seq.Date(start, end, by = "day"))
df$V2 <- sample(1:10, nrow(df), replace = T)

# make it an irregular time series by sampling 10000 rows
# including allowing for duplicates (replace = T)
df2 <- df %>% 
  sample_n(10000, replace = T)

# create 'complete' time series & join the data & compute the rolling median
df_rollmed <- data.frame(V1 = seq.Date(min(df$V1), max(df$V1), by = "day")) %>% 
  left_join(., df2) %>% 
  mutate(rollmed = rollapply(V2, 5, median, na.rm = T, align = "right", partial = T)) %>% 
  filter(!is.na(V2)) # throw out the NAs from the complete dataset

#3

Haven't check the speed but if no date has more than max.dup occurences then it must be that the last 5 * max.dup entries contain the last 5 days so the one-line function fn shown below passed to rollapplyr will do it:

没有检查速度，但如果没有日期超过max.dup出现，则必须是最后5 * max.dup条目包含最后5天，所以下面显示的单行函数fn传递给rollapplyr将执行此操作：

k <- 5

dates <- as.numeric(DF$date)
values <- DF$value

max.dup <- max(table(dates))

fn <- function(ix, d = dates[ix], v = values[ix], n = length(ix)) median(v[d >= d[n]-k])

rollapplyr(1:nrow(DF), max.dup * k, fn, partial = TRUE)
## [1] 5.0 4.5 4.0 1.0 0.5 0.0 0.0 0.0 2.5 3.0

Note: We used this for DF:

注意：我们将此用于DF：

 Lines <- "
      date  value
 1/11/2011      5
 1/11/2011      4
 1/11/2011      2
 8/11/2011      1
13/11/2011      0
14/11/2011      0
15/11/2011      0
18/11/2011      1
21/11/2011      4
 5/12/2011      3
"
DF <- read.table(text = Lines, header = TRUE)
DF$date <- as.Date(DF$date, format = "%d/%m/%Y")

#4

We can do this just using base apply as follows:

我们可以使用base apply执行此操作，如下所示：

First set up the data (based on the note by @g-grothendieck)

首先设置数据（基于@ g-grothendieck的注释）

library(data.table)
Lines <- "
      date  value
1/11/2011      5
1/11/2011      4
1/11/2011      2
8/11/2011      1
13/11/2011      0
14/11/2011      0
15/11/2011      0
18/11/2011      1
21/11/2011      4
5/12/2011      3
"
DT <- as.data.table(read.table(text = Lines, header = TRUE))
DT$date <- as.Date(DF$date, format = "%d/%m/%Y")
DT$row <- 1:NROW(DF)
setkey(DT, row, date) #mark columns as sorted, for speed

Note that I added a vector to the data table containing the row number, so that we can pass row number into the apply function. I also used data table to simplify syntax for the next step, and to speed up the function if it is applied to large arrays. Now, we use apply as follows:

请注意，我向包含行号的数据表添加了一个向量，以便我们可以将行号传递给apply函数。我还使用数据表来简化下一步的语法，并在将函数应用于大型数组时加速该函数。现在，我们使用如下申请：

roll.median.DT <- function(x){
  this.date <- as.Date(x[1])
  this.row <- as.numeric(x[3])
  median(DT[row <= this.row & date >= (this.date-5)]$value) #NB DT is not defined within function, so it is found from parent scope
}
apply(DT, FUN=roll.median.DT, MARGIN = 1)
[1] 5.0 4.5 4.0 1.0 0.5 0.0 0.0 0.0 2.5 3.0

#1