using data.table to speed up rollapply

时间:2021-03-29 02:43:55

I am new to data.tables so apologies if this is a very basic question.

我是data.tables的新手,如果这是一个非常基本的问题,请道歉。

I have heard that data.tables significantly improves computational times when working with large amounts data, and so would like to see if data.table is able to help in speeding up the rollapply function.

我听说data.tables在处理大量数据时显着提高了计算时间,因此希望看看data.table是否能够帮助加速rollapply函数。

if we have some univariate data

如果我们有一些单变量数据

xts.obj <- xts(rnorm(1e6), order.by=as.POSIXct(Sys.time()-1e6:1), tz="GMT") 
colnames(xts.obj) <- "rtns" 

a simple rolling quantile with width of 100 and a p of 0.75 takes a surprisingly long time...

一个简单的滚动分位数,宽度为100,p为0.75需要相当长的时间......

i.e. the line of code

即代码行

xts.obj$quant.75 <- rollapply(xts.obj$rtns,width=100, FUN='quantile', p=0.75) 

seems to take forever...

似乎永远......

is there anything that data.table can do to speed things up? i.e. is there a generic roll function that can be applied?

data.table可以做些什么来加快速度吗?即是否有可以应用的通用滚动功能?

perhaps a routine to convert an xts object to a data.table object to carry out the function in a speeded up manner and then reconvert back to xts at the end?

也许是一个例程,将xts对象转换为data.table对象,以加速的方式执行函数,然后在最后重新转换回xts?

thanks in advance

提前致谢

hlm

p.s. I didn't seem to be getting much of a response on the data.table mailing list so am posting up here, to see if I get a better response.

附:我似乎没有在data.table邮件列表上得到很多回复,所以我在这里发帖,看看我是否得到了更好的回复。

p.p.s having a quick go with another example using dataframes the data.table solution seems to take longer than the rollapply function, i.e. shown below:

p.p.s快速使用数据帧的另一个例子data.table解决方案似乎需要比rollapply函数更长的时间,如下所示:

> x <- data.frame(x=rnorm(10000))
> x.dt <- data.table(x)
> system.time(l1 <- as.numeric(rollapply(x,width=10,FUN=quantile,p=0.75)))   
   user  system elapsed 
   2.69    0.00    2.68 
> system.time(l <- as.numeric(unlist(x.dt[,lapply(1:((nrow(x.dt))-10+1), function(i){ x.dt[i:(i+10-1),quantile(x,p=0.75)]})])))
   user  system elapsed 
  11.22    0.00   11.51 
> identical(l,l1)
[1] TRUE

2 个解决方案

#1


7  

datatable is quite irrelevant here - you're essentially running sapply on a vector, that is pretty much the fastest operation you can get (other than going to C). data frames and data tables will always be slower than vectors. You can gain a bit by using a straight vector (without xts dispatch), but the only easy way to get this done quickly is to parallelize:

数据表在这里是无关紧要的 - 你实际上是在向量上运行sapply,这几乎是你可以获得的最快的操作(除了转到C)。数据帧和数据表总是比矢量慢。您可以通过使用直接向量(没有xts dispatch)获得一点,但是快速完成此操作的唯一简单方法是并行化:

> x = as.vector(xts.obj$rtns)
> system.time(unclass(mclapply(1:(length(x) - 99),
                      function(i) quantile(x[i:(i + 99)], p=0.75), mc.cores=32)))
   user  system elapsed 
325.481  15.533  11.221 

If you need that even faster, then you may want to write a specialized function: the naive apply approach re-sorts every chunk which is obviously wasteful - all you need to do is to drop the one element and sort in the next one to obtain the quantile so you can expect roughly 50x speedup if you do that - but you'll have to code that yourself (so it's only worth if you use it more often ...).

如果你需要更快,那么你可能想要编写一个专门的函数:天真的apply方法重新排序每个明显浪费的块 - 所有你需要做的就是删除一个元素并在下一个元素中排序以获得分位数,所以你可以期望大约50倍的加速,如果你这样做 - 但你必须自己编码(所以只有你经常使用它才有价值......)。

#2


5  

data.table works quickly by splitting the data by a key. I don't think data.table currently supports a rolling key, or an expression that in the by or i arguments that would do this.

data.table通过按键分割数据来快速工作。我不认为data.table目前支持滚动键,或者在执行此操作的by或i参数中的表达式。

You could use the fact that subsetting is faster for data.table than a data.frame

您可以使用以下事实:data.table的子集化比data.frame更快

DT <- as.data.table(x)
.x <- 1:(nrow(DT)-9)
system.time(.xl <- unlist(lapply(.x, function(.i) DT[.i:(.i+10),quantile(x,0.75, na.rm = T)])))
   user  system elapsed 
   8.77    0.00    8.77 

Or you could construct key variables that will uniquely identify the rolling ids. Width = 10, therefore we need 10 columns (padded with NA_real_)

或者您可以构造唯一标识滚动ID的关键变量。宽度= 10,因此我们需要10列(用NA_real_填充)

library(plyr) # for as.quoted
.j <- paste0('x',1:10, ':= c(rep(NA_real_,',0:9,'),rep(seq(',1:10,',9991,by=10),each=10), rep(NA_real_,',c(0,9:1),'))')


 datatable <- function(){
   invisible(lapply(.j, function(.jc) x.dt[,eval(as.quoted(.jc)[[1]])]))
x_roll <- rbind(x.dt[!is.na(x1),quantile(x,0.75),by=x1],
  x.dt[!is.na(x2),quantile(x,0.75),by=x2],
  x.dt[!is.na(x3),quantile(x,0.75),by=x3],
  x.dt[!is.na(x4),quantile(x,0.75),by=x4],
      x.dt[!is.na(x5),quantile(x,0.75),by=x5],
      x.dt[!is.na(x6),quantile(x,0.75),by=x6],
      x.dt[!is.na(x7),quantile(x,0.75),by=x7],
      x.dt[!is.na(x8),quantile(x,0.75),by=x8],
      x.dt[!is.na(x9),quantile(x,0.75),by=x9],
      x.dt[!is.na(x10),quantile(x,0.75),by=x10],use.names =F)


setkeyv(x_roll,'x1')

invisible(x.dt[,x1:= 1:10000])
setkeyv(x.dt,'x1')
x_roll[x.dt][, list(x,V1)]}

l1 <- function()as.numeric(rollapply(x,width=10,FUN=quantile,p=0.75))
lapply_only <- function() unclass(lapply(1:(nrow(x) - 9), function(i) quantile(x[['x']][i:(i + 9)], p=0.75)))


benchmark(datatable(),l1(),lapply_only(), replications = 5)
##            test replications elapsed relative user.self 
## 1   datatable()            5    9.41 1.000000      9.40      
## 2          l1()            5   10.97 1.165781     10.85        
## 3 lapply_only()            5   10.39 1.104145     10.35 

EDIT --- benchmarking

data.table is quicker than rollapply and raw lapply. I can't test the parallel solution.

data.table比rollapply和raw lapply更快。我无法测试并行解决方案。

#1


7  

datatable is quite irrelevant here - you're essentially running sapply on a vector, that is pretty much the fastest operation you can get (other than going to C). data frames and data tables will always be slower than vectors. You can gain a bit by using a straight vector (without xts dispatch), but the only easy way to get this done quickly is to parallelize:

数据表在这里是无关紧要的 - 你实际上是在向量上运行sapply,这几乎是你可以获得的最快的操作(除了转到C)。数据帧和数据表总是比矢量慢。您可以通过使用直接向量(没有xts dispatch)获得一点,但是快速完成此操作的唯一简单方法是并行化:

> x = as.vector(xts.obj$rtns)
> system.time(unclass(mclapply(1:(length(x) - 99),
                      function(i) quantile(x[i:(i + 99)], p=0.75), mc.cores=32)))
   user  system elapsed 
325.481  15.533  11.221 

If you need that even faster, then you may want to write a specialized function: the naive apply approach re-sorts every chunk which is obviously wasteful - all you need to do is to drop the one element and sort in the next one to obtain the quantile so you can expect roughly 50x speedup if you do that - but you'll have to code that yourself (so it's only worth if you use it more often ...).

如果你需要更快,那么你可能想要编写一个专门的函数:天真的apply方法重新排序每个明显浪费的块 - 所有你需要做的就是删除一个元素并在下一个元素中排序以获得分位数,所以你可以期望大约50倍的加速,如果你这样做 - 但你必须自己编码(所以只有你经常使用它才有价值......)。

#2


5  

data.table works quickly by splitting the data by a key. I don't think data.table currently supports a rolling key, or an expression that in the by or i arguments that would do this.

data.table通过按键分割数据来快速工作。我不认为data.table目前支持滚动键,或者在执行此操作的by或i参数中的表达式。

You could use the fact that subsetting is faster for data.table than a data.frame

您可以使用以下事实:data.table的子集化比data.frame更快

DT <- as.data.table(x)
.x <- 1:(nrow(DT)-9)
system.time(.xl <- unlist(lapply(.x, function(.i) DT[.i:(.i+10),quantile(x,0.75, na.rm = T)])))
   user  system elapsed 
   8.77    0.00    8.77 

Or you could construct key variables that will uniquely identify the rolling ids. Width = 10, therefore we need 10 columns (padded with NA_real_)

或者您可以构造唯一标识滚动ID的关键变量。宽度= 10,因此我们需要10列(用NA_real_填充)

library(plyr) # for as.quoted
.j <- paste0('x',1:10, ':= c(rep(NA_real_,',0:9,'),rep(seq(',1:10,',9991,by=10),each=10), rep(NA_real_,',c(0,9:1),'))')


 datatable <- function(){
   invisible(lapply(.j, function(.jc) x.dt[,eval(as.quoted(.jc)[[1]])]))
x_roll <- rbind(x.dt[!is.na(x1),quantile(x,0.75),by=x1],
  x.dt[!is.na(x2),quantile(x,0.75),by=x2],
  x.dt[!is.na(x3),quantile(x,0.75),by=x3],
  x.dt[!is.na(x4),quantile(x,0.75),by=x4],
      x.dt[!is.na(x5),quantile(x,0.75),by=x5],
      x.dt[!is.na(x6),quantile(x,0.75),by=x6],
      x.dt[!is.na(x7),quantile(x,0.75),by=x7],
      x.dt[!is.na(x8),quantile(x,0.75),by=x8],
      x.dt[!is.na(x9),quantile(x,0.75),by=x9],
      x.dt[!is.na(x10),quantile(x,0.75),by=x10],use.names =F)


setkeyv(x_roll,'x1')

invisible(x.dt[,x1:= 1:10000])
setkeyv(x.dt,'x1')
x_roll[x.dt][, list(x,V1)]}

l1 <- function()as.numeric(rollapply(x,width=10,FUN=quantile,p=0.75))
lapply_only <- function() unclass(lapply(1:(nrow(x) - 9), function(i) quantile(x[['x']][i:(i + 9)], p=0.75)))


benchmark(datatable(),l1(),lapply_only(), replications = 5)
##            test replications elapsed relative user.self 
## 1   datatable()            5    9.41 1.000000      9.40      
## 2          l1()            5   10.97 1.165781     10.85        
## 3 lapply_only()            5   10.39 1.104145     10.35 

EDIT --- benchmarking

data.table is quicker than rollapply and raw lapply. I can't test the parallel solution.

data.table比rollapply和raw lapply更快。我无法测试并行解决方案。