R按data.table中的条件分组

时间:2022-10-13 07:38:46

In R, I have a large data.table. For every row, I want to count rows with a similar value of x1 (+/- some tolerance, tol). I can get this to work using adply, but it's too slow. It seems like the sort of thing data.table would be good for - in fact, I'm already using data.table for part of the computation.

在R中,我有一个大的data.table。对于每一行,我想计算具有类似值x1(+/-一些容差,tol)的行。我可以使用adply来使用它,但它太慢了。似乎data.table有点好处 - 事实上,我已经在使用data.table进行部分计算。

Is there a way to do this entirely with data.table? Here is an example:

有没有办法完全使用data.table?这是一个例子:

library(data.table)
library(plyr)
my.df = data.table(x1 = 1:1000,
                   x2 = 4:1003)
tol = 3
adply(my.df, 1, function(df) my.df[x1 > (df$x1 - tol) & x1 < (df$x1 + tol), .N])

Results:

        x1   x2 V1
   1:    1    4  3
   2:    2    5  4
   3:    3    6  5
   4:    4    7  5
   5:    5    8  5
  ---             
 996:  996  999  5
 997:  997 1000  5
 998:  998 1001  5
 999:  999 1002  4
1000: 1000 1003  3

Update:

Here's a sample dataset that is a little closer to my real data:

这是一个与我的真实数据更接近的示例数据集:

set.seed(10)
x = seq(1,100000000,100000)
x = x + sample(1:50000, length(x), replace=T)
x2 = x + sample(1:50000, length(x), replace=T)
my.df = data.table(x1 = x,
                   x2 = x2)
setkey(my.df,x1)
tol = 100000

og = function(my.df) {
  adply(my.df, 1, function(df) my.df[x1 > (df$x1 - tol) & x1 < (df$x1 + tol), .N])
}

microbenchmark(r_ed <- ed(copy(my.df)),
               r_ar <- ar(copy(my.df)),
               r_og <- og(copy(my.df)),
               times = 1)

Unit: milliseconds
                    expr         min          lq      median          uq         max neval
 r_ed <- ed(copy(my.df))    8.553137    8.553137    8.553137    8.553137    8.553137     1
 r_ar <- ar(copy(my.df))   10.229438   10.229438   10.229438   10.229438   10.229438     1
 r_og <- og(copy(my.df)) 1424.472844 1424.472844 1424.472844 1424.472844 1424.472844     1

Obviously, solutions from both @eddi and @Arun are much faster than mine. Now I just have to try to understand rolls.

显然,来自@eddi和@Arun的解决方案比我的快得多。现在我只需要尝试理解卷。

4 个解决方案

#1


4  

Here's a faster data.table solution. The idea is to use the rolling merge functionality of data.table, but before we do that we need to modify the data slightly and make the column x1 numeric instead of integer. This is because OP is using strict inequality and to use rolling joins with that we're going to have to decrease the tolerance by a tiny amount, making it a floating point number.

这是一个更快的data.table解决方案。我们的想法是使用data.table的滚动合并功能,但在我们这样做之前,我们需要稍微修改数据并使列x1成为数字而不是整数。这是因为OP使用严格的不等式并且使用滚动连接,我们将不得不将容差减少一点,使其成为浮点数。

my.df[, x1 := as.numeric(x1)]

# set the key to x1 for the merges and to sort
# (note, if data already sorted can make this step instantaneous using setattr)
setkey(my.df, x1)

# and now we're going to do two rolling merges, one with the upper bound
# and one with lower, then get the index of the match and subtract the ends
# (+1, to get the count)
my.df[, res := my.df[J(x1 + tol - 1e-6), list(ind = .I), roll = Inf]$ind -
               my.df[J(x1 - tol + 1e-6), list(ind = .I), roll = -Inf]$ind + 1]


# and here's the bench vs @Arun's solution
ed = function(my.df) {
  my.df[, x1 := as.numeric(x1)]
  setkey(my.df, x1)
  my.df[, res := my.df[J(x1 + tol - 1e-6), list(ind = .I), roll = Inf]$ind -
                 my.df[J(x1 - tol + 1e-6), list(ind = .I), roll = -Inf]$ind + 1]
}

microbenchmark(ed(copy(my.df)), ar(copy(my.df)))
#Unit: milliseconds
#            expr       min       lq   median       uq      max neval
# ed(copy(my.df))  7.297928 10.09947 10.87561 11.80083 23.05907   100
# ar(copy(my.df)) 10.825521 15.38151 16.36115 18.15350 21.98761   100

Note: as both Arun and Matthew pointed out, if x1 is integer, one doesn't have to convert to numeric and subtract a small amount from tol and can use tol - 1L instead of tol - 1e-6 above.

注意:正如Arun和Matthew所指出的那样,如果x1是整数,则不必转换为数字并从tol中减去一小部分,并且可以使用tol - 1L而不是tol - 1e-6。

#2


9  

See @eddi's answer for a faster solution (to this particular problem). It also works when x1 is not an integer.

The algorithm you're looking for is Interval Tree. And there's a bioconductor package called IRanges that accomplishes this task. It's hard to beat that.

您正在寻找的算法是Interval Tree。并且有一个名为IRanges的生物导体包可以完成这项任务。这很难打败。

require(IRanges)
require(data.table)
my.df[, res := countOverlaps(IRanges(my.df$x1, width=1), 
           IRanges(my.df$x1-tol+1, my.df$x1+tol-1))]

Some explanation:

If you break down the code, you can write it in three lines:

如果分解代码,可以用三行代码编写:

ir1 <- IRanges(my.df$x1, width=1)
ir2 <- IRanges(my.df$x1-tol+1, my.df$x1+tol-1)
cnt <- countOverlaps(ir1, ir2)

What we essentially do is to is to create two "ranges" (just type ir1 and ir2 to see how they are). Then we ask, for each entry in ir1 how many do they overlap in ir2 (this is the "interval tree" part). And this is very efficient. Implicitly the argument type to countOverlaps, by default is "type = any". You can explore the other types if you want. It's extremely useful. Also of relevance is findOverlaps function.

我们基本上做的是创建两个“范围”(只需键入ir1和ir2以查看它们是如何)。然后我们问,对于ir1中的每个条目,它们在ir2中重叠了多少(这是“间隔树”部分)。这非常有效。隐式地,countOverlaps的参数类型,默认为“type = any”。如果需要,您可以探索其他类型。这非常有用。相关的还有findOverlaps功能。

Note: There can be faster solutions (in fact there is, see @eddi's) for this particular case, where width of ir1 = 1. But for problems where widths are variable and/or > 1, this should be the fastest.

注意:对于这种特殊情况,可以有更快的解决方案(事实上,见@ eddi's),其中ir1的宽度= 1.但是对于宽度可变和/或> 1的问题,这应该是最快的。


Benchmarking:

ag <- function(my.df) my.df[, res := sum(abs(my.df$x1-x1) < tol), by=x1]
ro <- function(my.df) {
            my.df[,res:= { y = my.df$x1
            sum(y > (x1 - tol) & y < (x1 + tol))
            }, by=x1]
      }
ar <- function(my.df) {
           my.df[, res := countOverlaps(IRanges(my.df$x1, width=1), 
            IRanges(my.df$x1-tol+1, my.df$x1+tol-1))]
      }


require(microbenchmark)
microbenchmark(r1 <- ag(copy(my.df)), r2 <- ro(copy(my.df)), 
               r3 <- ar(copy(my.df)), times=100)

Unit: milliseconds
                  expr      min       lq   median       uq       max neval
 r1 <- ag(copy(my.df)) 33.15940 39.63531 41.61555 44.56616 208.99067   100
 r2 <- ro(copy(my.df)) 69.35311 76.66642 80.23917 84.67419 344.82031   100
 r3 <- ar(copy(my.df)) 11.22027 12.14113 13.21196 14.72830  48.61417   100 <~~~

identical(r1, r2) # TRUE
identical(r1, r3) # TRUE

#3


2  

Here is a pure data.table solution:

这是一个纯data.table解决方案:

my.df[, res:=sum(my.df$x1 > (x1 - tol) & my.df$x1 < (x1 + tol)), by=x1]

my.df <- adply(my.df, 1, 
           function(df) my.df[x1 > (df$x1 - tol) & x1 < (df$x1 + tol), .N])

identical(my.df[,res],my.df[,V1])
#[1] TRUE

However, this will still be relatively slow if you have many unique x1. After all, you need to do a huge number of comparisons and I can't think of a way to avoid that right now.

但是,如果你有许多独特的x1,这仍然会相对较慢。毕竟,你需要做大量的比较,我现在想不出办法避免这种情况。

#4


2  

Using the fact that

用这个事实

 abs(x-y) < tol ~    y-tol <= x <= y+ tol 

you can enhance performance by a factor of 2.

你可以将性能提高2倍。

## wrap codes in 2 function for benchmarking
library(data.table)
library(plyr)
my.df = data.table(x1 = 1:1000,
                   x2 = 4:1003)
tol = 3
ag <- function()
my.df[, res := sum(abs(my.df$x1-x1) < tol), by=x1]
ro <- function()
  my.df[,res:= { y = my.df$x1
          sum(y > (x1 - tol) & y < (x1 + tol))
          }, by=x1]
## check equal results
identical(ag(),ro())
TRUE
library(microbenchmark)
## benchmarks 
microbenchmark(ag(),
               ro(),times=1)

Unit: milliseconds
 expr      min       lq   median       uq      max neval
 ag() 32.75638 32.75638 32.75638 32.75638 32.75638     1
 ro() 63.50043 63.50043 63.50043 63.50043 63.50043     1

#1


4  

Here's a faster data.table solution. The idea is to use the rolling merge functionality of data.table, but before we do that we need to modify the data slightly and make the column x1 numeric instead of integer. This is because OP is using strict inequality and to use rolling joins with that we're going to have to decrease the tolerance by a tiny amount, making it a floating point number.

这是一个更快的data.table解决方案。我们的想法是使用data.table的滚动合并功能,但在我们这样做之前,我们需要稍微修改数据并使列x1成为数字而不是整数。这是因为OP使用严格的不等式并且使用滚动连接,我们将不得不将容差减少一点,使其成为浮点数。

my.df[, x1 := as.numeric(x1)]

# set the key to x1 for the merges and to sort
# (note, if data already sorted can make this step instantaneous using setattr)
setkey(my.df, x1)

# and now we're going to do two rolling merges, one with the upper bound
# and one with lower, then get the index of the match and subtract the ends
# (+1, to get the count)
my.df[, res := my.df[J(x1 + tol - 1e-6), list(ind = .I), roll = Inf]$ind -
               my.df[J(x1 - tol + 1e-6), list(ind = .I), roll = -Inf]$ind + 1]


# and here's the bench vs @Arun's solution
ed = function(my.df) {
  my.df[, x1 := as.numeric(x1)]
  setkey(my.df, x1)
  my.df[, res := my.df[J(x1 + tol - 1e-6), list(ind = .I), roll = Inf]$ind -
                 my.df[J(x1 - tol + 1e-6), list(ind = .I), roll = -Inf]$ind + 1]
}

microbenchmark(ed(copy(my.df)), ar(copy(my.df)))
#Unit: milliseconds
#            expr       min       lq   median       uq      max neval
# ed(copy(my.df))  7.297928 10.09947 10.87561 11.80083 23.05907   100
# ar(copy(my.df)) 10.825521 15.38151 16.36115 18.15350 21.98761   100

Note: as both Arun and Matthew pointed out, if x1 is integer, one doesn't have to convert to numeric and subtract a small amount from tol and can use tol - 1L instead of tol - 1e-6 above.

注意:正如Arun和Matthew所指出的那样,如果x1是整数,则不必转换为数字并从tol中减去一小部分,并且可以使用tol - 1L而不是tol - 1e-6。

#2


9  

See @eddi's answer for a faster solution (to this particular problem). It also works when x1 is not an integer.

The algorithm you're looking for is Interval Tree. And there's a bioconductor package called IRanges that accomplishes this task. It's hard to beat that.

您正在寻找的算法是Interval Tree。并且有一个名为IRanges的生物导体包可以完成这项任务。这很难打败。

require(IRanges)
require(data.table)
my.df[, res := countOverlaps(IRanges(my.df$x1, width=1), 
           IRanges(my.df$x1-tol+1, my.df$x1+tol-1))]

Some explanation:

If you break down the code, you can write it in three lines:

如果分解代码,可以用三行代码编写:

ir1 <- IRanges(my.df$x1, width=1)
ir2 <- IRanges(my.df$x1-tol+1, my.df$x1+tol-1)
cnt <- countOverlaps(ir1, ir2)

What we essentially do is to is to create two "ranges" (just type ir1 and ir2 to see how they are). Then we ask, for each entry in ir1 how many do they overlap in ir2 (this is the "interval tree" part). And this is very efficient. Implicitly the argument type to countOverlaps, by default is "type = any". You can explore the other types if you want. It's extremely useful. Also of relevance is findOverlaps function.

我们基本上做的是创建两个“范围”(只需键入ir1和ir2以查看它们是如何)。然后我们问,对于ir1中的每个条目,它们在ir2中重叠了多少(这是“间隔树”部分)。这非常有效。隐式地,countOverlaps的参数类型,默认为“type = any”。如果需要,您可以探索其他类型。这非常有用。相关的还有findOverlaps功能。

Note: There can be faster solutions (in fact there is, see @eddi's) for this particular case, where width of ir1 = 1. But for problems where widths are variable and/or > 1, this should be the fastest.

注意:对于这种特殊情况,可以有更快的解决方案(事实上,见@ eddi's),其中ir1的宽度= 1.但是对于宽度可变和/或> 1的问题,这应该是最快的。


Benchmarking:

ag <- function(my.df) my.df[, res := sum(abs(my.df$x1-x1) < tol), by=x1]
ro <- function(my.df) {
            my.df[,res:= { y = my.df$x1
            sum(y > (x1 - tol) & y < (x1 + tol))
            }, by=x1]
      }
ar <- function(my.df) {
           my.df[, res := countOverlaps(IRanges(my.df$x1, width=1), 
            IRanges(my.df$x1-tol+1, my.df$x1+tol-1))]
      }


require(microbenchmark)
microbenchmark(r1 <- ag(copy(my.df)), r2 <- ro(copy(my.df)), 
               r3 <- ar(copy(my.df)), times=100)

Unit: milliseconds
                  expr      min       lq   median       uq       max neval
 r1 <- ag(copy(my.df)) 33.15940 39.63531 41.61555 44.56616 208.99067   100
 r2 <- ro(copy(my.df)) 69.35311 76.66642 80.23917 84.67419 344.82031   100
 r3 <- ar(copy(my.df)) 11.22027 12.14113 13.21196 14.72830  48.61417   100 <~~~

identical(r1, r2) # TRUE
identical(r1, r3) # TRUE

#3


2  

Here is a pure data.table solution:

这是一个纯data.table解决方案:

my.df[, res:=sum(my.df$x1 > (x1 - tol) & my.df$x1 < (x1 + tol)), by=x1]

my.df <- adply(my.df, 1, 
           function(df) my.df[x1 > (df$x1 - tol) & x1 < (df$x1 + tol), .N])

identical(my.df[,res],my.df[,V1])
#[1] TRUE

However, this will still be relatively slow if you have many unique x1. After all, you need to do a huge number of comparisons and I can't think of a way to avoid that right now.

但是,如果你有许多独特的x1,这仍然会相对较慢。毕竟,你需要做大量的比较,我现在想不出办法避免这种情况。

#4


2  

Using the fact that

用这个事实

 abs(x-y) < tol ~    y-tol <= x <= y+ tol 

you can enhance performance by a factor of 2.

你可以将性能提高2倍。

## wrap codes in 2 function for benchmarking
library(data.table)
library(plyr)
my.df = data.table(x1 = 1:1000,
                   x2 = 4:1003)
tol = 3
ag <- function()
my.df[, res := sum(abs(my.df$x1-x1) < tol), by=x1]
ro <- function()
  my.df[,res:= { y = my.df$x1
          sum(y > (x1 - tol) & y < (x1 + tol))
          }, by=x1]
## check equal results
identical(ag(),ro())
TRUE
library(microbenchmark)
## benchmarks 
microbenchmark(ag(),
               ro(),times=1)

Unit: milliseconds
 expr      min       lq   median       uq      max neval
 ag() 32.75638 32.75638 32.75638 32.75638 32.75638     1
 ro() 63.50043 63.50043 63.50043 63.50043 63.50043     1