用data.table提取行中最后一个非缺失值

时间:2020-12-31 19:21:03

I have a data.table of factor columns, and I want to pull out the label of the last non-missing value in each row. It's kindof a typical max.col situation, but I don't want to needlessly be coercing as I am trying to optimize this code using data.table. The real data has other types of columns as well.

我有一个数据。因子列的表,我想提取每行中最后一个非缺失值的标签。这是典型的最大值。col的情况,但我不想不必要地强制使用data.table来优化这段代码。真正的数据也有其他类型的列。

Here is the example,

这是例子,

## Some sample data
set.seed(0)
dat <- sapply(split(letters[1:25], rep.int(1:5, 5)), sample, size=8, replace=TRUE)
dat[upper.tri(dat)] <- NA
dat[4:5, 4:5] <- NA                              # the real data isnt nice and upper.triangular
dat <- data.frame(dat, stringsAsFactors = TRUE)  # factor columns

## So, it looks like this
setDT(dat)[]
#    X1 X2 X3 X4 X5
# 1:  u NA NA NA NA
# 2:  f  q NA NA NA
# 3:  f  b  w NA NA
# 4:  k  g  h NA NA
# 5:  u  b  r NA NA
# 6:  f  q  w  x  t
# 7:  u  g  h  i  e
# 8:  u  q  r  n  t

## I just want to get the labels of the factors
## that are 'rightmost' in each row.  I tried a number of things 
## that probably don't make sense here.
## This just about gets the column index
dat[, colInd := sum(!is.na(.SD)), by=1:nrow(dat)]

This is the goal though, to extract these labels, here using regular base functions.

这是我们的目标,使用常规的基本函数提取这些标签。

## Using max.col and a data.frame
df1 <- as.data.frame(dat)
inds <- max.col(is.na(as.matrix(df1)), ties="first")-1
inds[inds==0] <- ncol(df1)
df1[cbind(1:nrow(df1), inds)]
# [1] "u" "q" "w" "h" "r" "t" "e" "t"

5 个解决方案

#1


10  

Here's another way:

这是另一种方式:

dat[, res := NA_character_]
for (v in rev(names(dat))[-1]) dat[is.na(res), res := get(v)]


   X1 X2 X3 X4 X5 res
1:  u NA NA NA NA   u
2:  f  q NA NA NA   q
3:  f  b  w NA NA   w
4:  k  g  h NA NA   h
5:  u  b  r NA NA   r
6:  f  q  w  x  t   t
7:  u  g  h  i  e   e
8:  u  q  r  n  t   t

Benchmarks Using the same data as @alexis_laz and making (apparently) superficial changes to the functions, I see different results. Just showing them here in case anyone is curious. Alexis' answer (with small modifications) still comes out ahead.

使用与@alexis_laz相同的数据并(显然)对函数进行表面更改的基准测试,我看到了不同的结果。给他们看以防有人好奇。亚历克西斯的回答(稍加修改)仍然遥遥领先。

Functions:

功能:

alex = function(x, ans = rep_len(NA, length(x[[1L]])), wh = seq_len(length(x[[1L]]))){
    if(!length(wh)) return(ans)
    ans[wh] = as.character(x[[length(x)]])[wh]
    Recall(x[-length(x)], ans, wh[is.na(ans[wh])])
}   

alex2 = function(x){
    x[, res := NA_character_]
    wh = x[, .I]
    for (v in (length(x)-1):1){
      if (!length(wh)) break
      set(x, j="res", i=wh, v = x[[v]][wh])
      wh = wh[is.na(x$res[wh])]
    }
    x$res
}

frank = function(x){
    x[, res := NA_character_]
    for(v in rev(names(x))[-1]) x[is.na(res), res := get(v)]
    return(x$res)       
}

frank2 = function(x){
    x[, res := NA_character_]
    for(v in rev(names(x))[-1]) x[is.na(res), res := .SD, .SDcols=v]
    x$res
}

Example data and benchmark:

示例数据和基准:

DAT1 = as.data.table(lapply(ceiling(seq(0, 1e4, length.out = 1e2)), 
                     function(n) c(rep(NA, n), sample(letters, 3e5 - n, TRUE))))
DAT2 = copy(DAT1)
DAT3 = as.list(copy(DAT1))
DAT4 = copy(DAT1)

library(microbenchmark)
microbenchmark(frank(DAT1), frank2(DAT2), alex(DAT3), alex2(DAT4), times = 30)

Unit: milliseconds
         expr       min        lq      mean    median         uq        max neval
  frank(DAT1) 850.05980 909.28314 985.71700 979.84230 1023.57049 1183.37898    30
 frank2(DAT2)  88.68229  93.40476 118.27959 107.69190  121.60257  346.48264    30
   alex(DAT3)  98.56861 109.36653 131.21195 131.20760  149.99347  183.43918    30
  alex2(DAT4)  26.14104  26.45840  30.79294  26.67951   31.24136   50.66723    30

#2


9  

Another idea -similar to Frank's- that tries (1) to avoid subsetting 'data.table' rows (which I assume must have some cost) and (2) to avoid checking a length == nrow(dat) vector for NAs in every iteration.

另一个想法——类似于Frank的想法——尝试(1)避免细分“数据”。表的行(我认为必须有一些代价)和(2)避免在每次迭代中检查NAs的长度== nrow(dat)向量。

alex = function(x, ans = rep_len(NA, length(x[[1L]])), wh = seq_len(length(x[[1L]])))
{
    if(!length(wh)) return(ans)
    ans[wh] = as.character(x[[length(x)]])[wh]
    Recall(x[-length(x)], ans, wh[is.na(ans[wh])])
}   
alex(as.list(dat)) #had some trouble with 'data.table' subsetting
# [1] "u" "q" "w" "h" "r" "t" "e" "t"

And to compare with Frank's:

和弗兰克的相比:

frank = function(x)
{
    x[, res := NA_character_]
    for(v in rev(names(x))[-1]) x[is.na(res), res := get(v)]
    return(x$res)       
}

DAT1 = as.data.table(lapply(ceiling(seq(0, 1e4, length.out = 1e2)), 
                     function(n) c(rep(NA, n), sample(letters, 3e5 - n, TRUE))))
DAT2 = copy(DAT1)
microbenchmark::microbenchmark(alex(as.list(DAT1)), 
                               { frank(DAT2); DAT2[, res := NULL] }, 
                               times = 30)
#Unit: milliseconds
#                                            expr       min        lq    median        uq       max neval
#                             alex(as.list(DAT1))  102.9767  108.5134  117.6595  133.1849  166.9594    30
# {     frank(DAT2)     DAT2[, `:=`(res, NULL)] } 1413.3296 1455.1553 1497.3517 1540.8705 1685.0589    30
identical(alex(as.list(DAT1)), frank(DAT2))
#[1] TRUE

#3


4  

We convert the 'data.frame' to 'data.table' and create a row id column (setDT(df1, keep.rownames=TRUE)). We reshape the 'wide' to 'long' format with melt. Grouped by 'rn', if there is no NA element in 'value' column, we get the last element of 'value' (value[.N]) or else, we get the element before the first NA in the 'value' to get the 'V1' column, which we extract ($V1).

我们将“data.frame”转换为“data”。创建一个行id列(setDT(df1, keep.rownames=TRUE))。我们用熔融将“宽”格式重塑为“长”格式。根据'rn'进行分组,如果'value'列中没有NA元素,我们得到'value' (value[. n])的最后一个元素,否则,我们得到'value'中的第一个NA之前的元素,得到'V1'列,我们提取它($V1)。

melt(setDT(df1, keep.rownames=TRUE), id.var='rn')[,
     if(!any(is.na(value))) value[.N] 
     else value[which(is.na(value))[1]-1], by =  rn]$V1
#[1] "u" "q" "w" "h" "r" "t" "e" "t"

In case, the data is already a data.table

在这种情况下,数据已经是一个data.table

dat[, rn := 1:.N]#create the 'rn' column
melt(dat, id.var='rn')[, #melt from wide to long format
     if(!any(is.na(value))) value[.N] 
     else value[which(is.na(value))[1]-1], by =  rn]$V1
#[1] "u" "q" "w" "h" "r" "t" "e" "t"

Here is another option

这是另一个选择

dat[, colInd := sum(!is.na(.SD)), by=1:nrow(dat)][
   , as.character(.SD[[.BY[[1]]]]), by=colInd]

Or as @Frank mentioned in the comments, we can use na.rm=TRUE from melt and make it more compact

或者正如@Frank在评论中提到的,我们可以使用na。rm=真实的熔体,使它更紧凑

 melt(dat[, r := .I], id="r", na.rm=TRUE)[, value[.N], by=r]

#4


3  

Here is a one liner base R approach:

这里有一个线性基准R方法:

sapply(split(dat, seq(nrow(dat))), function(x) tail(x[!is.na(x)],1))
#  1   2   3   4   5   6   7   8 
#"u" "q" "w" "h" "r" "t" "e" "t" 

#5


3  

I'm not sure how to improve upon @alexis's answer beyond what @Frank has already done, but your original approach with base R wasn't too far off of something that is reasonably performant.

我不知道如何改进@alexis的答案,超出了@Frank已经做过的事情,但是你的原始的R的方法并没有太离谱。

Here's a variant of your approach that I liked because (1) it's reasonably quick and (2) it doesn't require too much thought to figure out what's going on:

这是我喜欢你的方法的一个变体,因为(1)它是相当快的;(2)它不需要太多的思考就能弄明白发生了什么:

as.matrix(dat)[cbind(1:nrow(dat), max.col(!is.na(dat), "last"))] 

The most expensive part of this seems to be the as.matrix(dat) part, but otherwise, it seems to be faster than the melt approach that @akrun shared.

其中最昂贵的部分似乎是asn .matrix(dat)部分,但除此之外,它似乎比@akrun共享的melt方法要快。

#1


10  

Here's another way:

这是另一种方式:

dat[, res := NA_character_]
for (v in rev(names(dat))[-1]) dat[is.na(res), res := get(v)]


   X1 X2 X3 X4 X5 res
1:  u NA NA NA NA   u
2:  f  q NA NA NA   q
3:  f  b  w NA NA   w
4:  k  g  h NA NA   h
5:  u  b  r NA NA   r
6:  f  q  w  x  t   t
7:  u  g  h  i  e   e
8:  u  q  r  n  t   t

Benchmarks Using the same data as @alexis_laz and making (apparently) superficial changes to the functions, I see different results. Just showing them here in case anyone is curious. Alexis' answer (with small modifications) still comes out ahead.

使用与@alexis_laz相同的数据并(显然)对函数进行表面更改的基准测试,我看到了不同的结果。给他们看以防有人好奇。亚历克西斯的回答(稍加修改)仍然遥遥领先。

Functions:

功能:

alex = function(x, ans = rep_len(NA, length(x[[1L]])), wh = seq_len(length(x[[1L]]))){
    if(!length(wh)) return(ans)
    ans[wh] = as.character(x[[length(x)]])[wh]
    Recall(x[-length(x)], ans, wh[is.na(ans[wh])])
}   

alex2 = function(x){
    x[, res := NA_character_]
    wh = x[, .I]
    for (v in (length(x)-1):1){
      if (!length(wh)) break
      set(x, j="res", i=wh, v = x[[v]][wh])
      wh = wh[is.na(x$res[wh])]
    }
    x$res
}

frank = function(x){
    x[, res := NA_character_]
    for(v in rev(names(x))[-1]) x[is.na(res), res := get(v)]
    return(x$res)       
}

frank2 = function(x){
    x[, res := NA_character_]
    for(v in rev(names(x))[-1]) x[is.na(res), res := .SD, .SDcols=v]
    x$res
}

Example data and benchmark:

示例数据和基准:

DAT1 = as.data.table(lapply(ceiling(seq(0, 1e4, length.out = 1e2)), 
                     function(n) c(rep(NA, n), sample(letters, 3e5 - n, TRUE))))
DAT2 = copy(DAT1)
DAT3 = as.list(copy(DAT1))
DAT4 = copy(DAT1)

library(microbenchmark)
microbenchmark(frank(DAT1), frank2(DAT2), alex(DAT3), alex2(DAT4), times = 30)

Unit: milliseconds
         expr       min        lq      mean    median         uq        max neval
  frank(DAT1) 850.05980 909.28314 985.71700 979.84230 1023.57049 1183.37898    30
 frank2(DAT2)  88.68229  93.40476 118.27959 107.69190  121.60257  346.48264    30
   alex(DAT3)  98.56861 109.36653 131.21195 131.20760  149.99347  183.43918    30
  alex2(DAT4)  26.14104  26.45840  30.79294  26.67951   31.24136   50.66723    30

#2


9  

Another idea -similar to Frank's- that tries (1) to avoid subsetting 'data.table' rows (which I assume must have some cost) and (2) to avoid checking a length == nrow(dat) vector for NAs in every iteration.

另一个想法——类似于Frank的想法——尝试(1)避免细分“数据”。表的行(我认为必须有一些代价)和(2)避免在每次迭代中检查NAs的长度== nrow(dat)向量。

alex = function(x, ans = rep_len(NA, length(x[[1L]])), wh = seq_len(length(x[[1L]])))
{
    if(!length(wh)) return(ans)
    ans[wh] = as.character(x[[length(x)]])[wh]
    Recall(x[-length(x)], ans, wh[is.na(ans[wh])])
}   
alex(as.list(dat)) #had some trouble with 'data.table' subsetting
# [1] "u" "q" "w" "h" "r" "t" "e" "t"

And to compare with Frank's:

和弗兰克的相比:

frank = function(x)
{
    x[, res := NA_character_]
    for(v in rev(names(x))[-1]) x[is.na(res), res := get(v)]
    return(x$res)       
}

DAT1 = as.data.table(lapply(ceiling(seq(0, 1e4, length.out = 1e2)), 
                     function(n) c(rep(NA, n), sample(letters, 3e5 - n, TRUE))))
DAT2 = copy(DAT1)
microbenchmark::microbenchmark(alex(as.list(DAT1)), 
                               { frank(DAT2); DAT2[, res := NULL] }, 
                               times = 30)
#Unit: milliseconds
#                                            expr       min        lq    median        uq       max neval
#                             alex(as.list(DAT1))  102.9767  108.5134  117.6595  133.1849  166.9594    30
# {     frank(DAT2)     DAT2[, `:=`(res, NULL)] } 1413.3296 1455.1553 1497.3517 1540.8705 1685.0589    30
identical(alex(as.list(DAT1)), frank(DAT2))
#[1] TRUE

#3


4  

We convert the 'data.frame' to 'data.table' and create a row id column (setDT(df1, keep.rownames=TRUE)). We reshape the 'wide' to 'long' format with melt. Grouped by 'rn', if there is no NA element in 'value' column, we get the last element of 'value' (value[.N]) or else, we get the element before the first NA in the 'value' to get the 'V1' column, which we extract ($V1).

我们将“data.frame”转换为“data”。创建一个行id列(setDT(df1, keep.rownames=TRUE))。我们用熔融将“宽”格式重塑为“长”格式。根据'rn'进行分组,如果'value'列中没有NA元素,我们得到'value' (value[. n])的最后一个元素,否则,我们得到'value'中的第一个NA之前的元素,得到'V1'列,我们提取它($V1)。

melt(setDT(df1, keep.rownames=TRUE), id.var='rn')[,
     if(!any(is.na(value))) value[.N] 
     else value[which(is.na(value))[1]-1], by =  rn]$V1
#[1] "u" "q" "w" "h" "r" "t" "e" "t"

In case, the data is already a data.table

在这种情况下,数据已经是一个data.table

dat[, rn := 1:.N]#create the 'rn' column
melt(dat, id.var='rn')[, #melt from wide to long format
     if(!any(is.na(value))) value[.N] 
     else value[which(is.na(value))[1]-1], by =  rn]$V1
#[1] "u" "q" "w" "h" "r" "t" "e" "t"

Here is another option

这是另一个选择

dat[, colInd := sum(!is.na(.SD)), by=1:nrow(dat)][
   , as.character(.SD[[.BY[[1]]]]), by=colInd]

Or as @Frank mentioned in the comments, we can use na.rm=TRUE from melt and make it more compact

或者正如@Frank在评论中提到的,我们可以使用na。rm=真实的熔体,使它更紧凑

 melt(dat[, r := .I], id="r", na.rm=TRUE)[, value[.N], by=r]

#4


3  

Here is a one liner base R approach:

这里有一个线性基准R方法:

sapply(split(dat, seq(nrow(dat))), function(x) tail(x[!is.na(x)],1))
#  1   2   3   4   5   6   7   8 
#"u" "q" "w" "h" "r" "t" "e" "t" 

#5


3  

I'm not sure how to improve upon @alexis's answer beyond what @Frank has already done, but your original approach with base R wasn't too far off of something that is reasonably performant.

我不知道如何改进@alexis的答案,超出了@Frank已经做过的事情,但是你的原始的R的方法并没有太离谱。

Here's a variant of your approach that I liked because (1) it's reasonably quick and (2) it doesn't require too much thought to figure out what's going on:

这是我喜欢你的方法的一个变体,因为(1)它是相当快的;(2)它不需要太多的思考就能弄明白发生了什么:

as.matrix(dat)[cbind(1:nrow(dat), max.col(!is.na(dat), "last"))] 

The most expensive part of this seems to be the as.matrix(dat) part, but otherwise, it seems to be faster than the melt approach that @akrun shared.

其中最昂贵的部分似乎是asn .matrix(dat)部分,但除此之外,它似乎比@akrun共享的melt方法要快。