是时候从data.table和data.frame对象中获取单个元素了

时间:2022-03-17 22:54:58

In my work I use to have several tables (customer details, transaction records, etc). Being some of them are very big (millions of rows), I've recently switched to the data.table package (thanks Matthew). However, some of them are quite small (few hundreds of rows and 4/5 column) and are called several times. Therefore I started to think about [.data.table overhead in retrieving data rather then set()ting value as already clearly described in ?set, where, regardless the size of table one item is set in around 2 microseconds (depending on cpu).

在我的工作中,我使用了几个表(客户详细信息,交易记录等)。由于其中一些非常大(数百万行),我最近切换到data.table包(感谢Matthew)。但是,它们中的一些非常小(几百行和4/5列),并被称为几次。因此,我开始考虑[.data.table开销检索数据而不是set()ting值,如已经在?set中清楚描述的那样,无论表的大小如何,一个项目设置在2微秒左右(取决于cpu) 。

However it doesn't seem to exist the equivalent of set for getting a value from a data.table knowing the exact row and column. A sort of loopable [.data.table.

但是,它似乎不存在用于从data.table获取值的等价集,知道确切的行和列。一种loopable [.data.table。

library(data.table)
library(microbenchmark)

m = matrix(1,nrow=100000,ncol=100)
DF = as.data.frame(m)
DT = as.data.table(m)  # same data used in ?set

> microbenchmark(DF[3450,1] , DT[3450, V1], times=1000) # much more overhead in DT

Unit: microseconds
expr     min      lq   median      uq      max neval
DF[3450, 1]  32.745  36.166  40.5645  43.497  193.533  1000
DT[3450, V1] 788.791 803.453 813.2270 832.287 5826.982  1000

> microbenchmark(DF$V1[3450], DT[3450, 1, with=F], times=1000)  # using atomic vector and
                                                                # removing part of DT overhead
Unit: microseconds                                              
expr     min      lq  median      uq      max neval
DF$V1[3450]   2.933   3.910   5.865   6.354   36.166  1000
DT[3450, 1, with = F] 297.629 303.494 305.938 309.359 1878.632  1000

> microbenchmark(DF$V1[3450], DT$V1[3450], times=1000) # using only atomic vectors
Unit: microseconds
        expr   min    lq median    uq    max neval
 DF$V1[3450] 2.933 2.933  3.421 3.422 40.565  1000    # DF seems still a bit faster (23%)
 DT$V1[3450] 3.910 3.911  4.399 4.399 16.128  1000

The last method is indeed the best one to fast retrieve a single element several times. However, set is even faster

最后一种方法确实是多次快速检索单个元素的最佳方法。但是,设置更快

> microbenchmark(set(DT,1L,1L,5L), times=1000)
Unit: microseconds
                expr   min    lq median    uq    max neval
 set(DT, 1L, 1L, 5L) 1.955 1.956  2.444 2.444 24.926  1000

the question is: if we can set a value in 2.444 microseconds shouldn't be possible to get a value in a smaller (or at least similar) amount of time? Thanks.

问题是:如果我们可以设置一个2.444微秒的值,那么不可能在较小(或至少相似)的时间内得到一个值?谢谢。

EDIT: adding two more options as suggested:

编辑:根据建议添加两个选项:

> microbenchmark(`[.data.frame`(DT,3450,1), DT[["V1"]][3450], times=1000)
Unit: microseconds
                        expr    min     lq median     uq      max neval
 `[.data.frame`(DT, 3450, 1) 46.428 47.895 48.383 48.872 2165.509  1000
            DT[["V1"]][3450] 20.038 21.504 23.459 24.437  116.316  1000

which unfortunately are not faster than the previous attempts.

不幸的是,这并不比以前的尝试快。

1 个解决方案

#1


7  

Thanks to @hadley we have the solution!

感谢@hadley我们有解决方案!

> microbenchmark(DT$V1[3450], set(DT,1L,1L,5L), .subset2(DT, "V1")[3450], times=1000, unit="us")
Unit: microseconds
                     expr   min    lq median    uq    max neval
              DT$V1[3450] 2.566 3.208  3.208 3.528 27.582  1000
      set(DT, 1L, 1L, 5L) 1.604 1.925  1.925 2.246 15.074  1000
 .subset2(DT, "V1")[3450] 0.000 0.321  0.322 0.642  8.339  1000

#1


7  

Thanks to @hadley we have the solution!

感谢@hadley我们有解决方案!

> microbenchmark(DT$V1[3450], set(DT,1L,1L,5L), .subset2(DT, "V1")[3450], times=1000, unit="us")
Unit: microseconds
                     expr   min    lq median    uq    max neval
              DT$V1[3450] 2.566 3.208  3.208 3.528 27.582  1000
      set(DT, 1L, 1L, 5L) 1.604 1.925  1.925 2.246 15.074  1000
 .subset2(DT, "V1")[3450] 0.000 0.321  0.322 0.642  8.339  1000