R data.table计算直到达到值的行

时间:2021-11-29 21:27:38

I would like to return a new column in a data.table which shows how many rows down until a value lower than the current value (of Temp) is reached.

我想在data.table中返回一个新列,该列显示向下的行数,直到达到低于当前值(Temp)的值。

library(data.table)
set.seed(123)
DT <- data.table( Temp = runif(10,0,20) )

This is how I would like it to look:

这就是我希望它看起来的样子:

set.seed(123)
DT <- data.table(
        Temp = runif(10,0,20),
        Day_Below_Temp = c("5","1","3","2","1","NA","3","1","1","NA")
)

3 个解决方案

#1


4  

Using the newly implemented non-equi joins in the current development version, this can be accomplished in a straightforward manner as follows:

使用当前开发版本中新实现的非equi连接,可以通过以下简单方式完成:

require(data.table) # v1.9.7+
DT[, row := .I] # add row numbers
DT[DT, x.row-i.row, on = .(row > row, Temp < Temp), mult="first"]
# [1]  5  1  3  2  1 NA  3  1  1 NA

The row number is necessary since we need to find indices lower than the current index, hence needs to be a condition in the join. We perform a self-join, i.e., for each row in DT (inner), based on condition provided to on argument, we find the first matching row index in DT (outer). Then we subtract the row indices to get the position from the current row. x.row refers to the index of outer DT and i.row to the inner DT.

行号是必要的,因为我们需要找到低于当前索引的索引,因此需要成为连接中的条件。我们执行自连接,即对于DT(内部)中的每一行,基于提供给on参数的条件,我们在DT(外部)中找到第一个匹配的行索引。然后我们减去行索引以从当前行获取位置。 x.row指外DT的索引和内DT的i.row。

To get the devel version, see installation instructions here.

要获得devel版本,请参阅此处的安装说明。


On 1e5 rows:

在1e5行:

set.seed(123)
DT <- data.table(Temp = runif(1e5L, 0L, 20L))

DT[, row := .I]
system.time({
    ans = DT[DT, x.row-i.row, on = .(row > row, Temp < Temp), mult="first", verbose=TRUE]
})
# Non-equi join operators detected ... 
#   forder took ... 0.001 secs
#   Generating non-equi group ids ... done in 0.452 secs
#   Recomputing forder with non-equi ids ... done in 0.001 secs
#   Found 623 non-equi group(s) ...
# Starting bmerge ...done in 8.118 secs
# Detected that j uses these columns: x.row,i.row 
#    user  system elapsed 
#   8.492   0.038   8.577 

head(ans)
# [1]  5  1  3  2  1 12
tail(ans)
# [1]  2  1  1  2  1 NA

#2


2  

Here's a dplyr method:

这是一个dplyr方法:

library(dplyr)
set.seed(123)
dt <- data.frame( Temp = runif(10,0,20) )
dt %>% mutate(Day_Below_Temp = 
                 sapply(1:length(Temp), function(x) min(which(.$Temp[x:length(.$Temp)] < .$Temp[x]))-1))

        Temp Day_Below_Temp
1   5.751550              5
2  15.766103              1
3   8.179538              3
4  17.660348              2
5  18.809346              1
6   0.911130            Inf
7  10.562110              3
8  17.848381              1
9  11.028700              1
10  9.132295            Inf

#3


1  

This does the job - not very fast though

这样做的工作 - 虽然不是很快

DT[, rowN := .I]

DT[, Day_Below_Temp := which(DT$Temp[rowN:nrow(DT)] < Temp)[1] - 1, 
   by = rowN
   ][, rowN := NULL]

#1


4  

Using the newly implemented non-equi joins in the current development version, this can be accomplished in a straightforward manner as follows:

使用当前开发版本中新实现的非equi连接,可以通过以下简单方式完成:

require(data.table) # v1.9.7+
DT[, row := .I] # add row numbers
DT[DT, x.row-i.row, on = .(row > row, Temp < Temp), mult="first"]
# [1]  5  1  3  2  1 NA  3  1  1 NA

The row number is necessary since we need to find indices lower than the current index, hence needs to be a condition in the join. We perform a self-join, i.e., for each row in DT (inner), based on condition provided to on argument, we find the first matching row index in DT (outer). Then we subtract the row indices to get the position from the current row. x.row refers to the index of outer DT and i.row to the inner DT.

行号是必要的,因为我们需要找到低于当前索引的索引,因此需要成为连接中的条件。我们执行自连接,即对于DT(内部)中的每一行,基于提供给on参数的条件,我们在DT(外部)中找到第一个匹配的行索引。然后我们减去行索引以从当前行获取位置。 x.row指外DT的索引和内DT的i.row。

To get the devel version, see installation instructions here.

要获得devel版本,请参阅此处的安装说明。


On 1e5 rows:

在1e5行:

set.seed(123)
DT <- data.table(Temp = runif(1e5L, 0L, 20L))

DT[, row := .I]
system.time({
    ans = DT[DT, x.row-i.row, on = .(row > row, Temp < Temp), mult="first", verbose=TRUE]
})
# Non-equi join operators detected ... 
#   forder took ... 0.001 secs
#   Generating non-equi group ids ... done in 0.452 secs
#   Recomputing forder with non-equi ids ... done in 0.001 secs
#   Found 623 non-equi group(s) ...
# Starting bmerge ...done in 8.118 secs
# Detected that j uses these columns: x.row,i.row 
#    user  system elapsed 
#   8.492   0.038   8.577 

head(ans)
# [1]  5  1  3  2  1 12
tail(ans)
# [1]  2  1  1  2  1 NA

#2


2  

Here's a dplyr method:

这是一个dplyr方法:

library(dplyr)
set.seed(123)
dt <- data.frame( Temp = runif(10,0,20) )
dt %>% mutate(Day_Below_Temp = 
                 sapply(1:length(Temp), function(x) min(which(.$Temp[x:length(.$Temp)] < .$Temp[x]))-1))

        Temp Day_Below_Temp
1   5.751550              5
2  15.766103              1
3   8.179538              3
4  17.660348              2
5  18.809346              1
6   0.911130            Inf
7  10.562110              3
8  17.848381              1
9  11.028700              1
10  9.132295            Inf

#3


1  

This does the job - not very fast though

这样做的工作 - 虽然不是很快

DT[, rowN := .I]

DT[, Day_Below_Temp := which(DT$Temp[rowN:nrow(DT)] < Temp)[1] - 1, 
   by = rowN
   ][, rowN := NULL]