I have a data.table of factor columns, and I want to pull out the label of the last non-missing value in each row. It's kindof a typical max.col
situation, but I don't want to needlessly be coercing as I am trying to optimize this code using data.table. The real data has other types of columns as well.
我有一个数据。因子列的表,我想提取每行中最后一个非缺失值的标签。这是典型的最大值。col的情况,但我不想不必要地强制使用data.table来优化这段代码。真正的数据也有其他类型的列。
Here is the example,
这是例子,
## Some sample data
set.seed(0)
dat <- sapply(split(letters[1:25], rep.int(1:5, 5)), sample, size=8, replace=TRUE)
dat[upper.tri(dat)] <- NA
dat[4:5, 4:5] <- NA # the real data isnt nice and upper.triangular
dat <- data.frame(dat, stringsAsFactors = TRUE) # factor columns
## So, it looks like this
setDT(dat)[]
# X1 X2 X3 X4 X5
# 1: u NA NA NA NA
# 2: f q NA NA NA
# 3: f b w NA NA
# 4: k g h NA NA
# 5: u b r NA NA
# 6: f q w x t
# 7: u g h i e
# 8: u q r n t
## I just want to get the labels of the factors
## that are 'rightmost' in each row. I tried a number of things
## that probably don't make sense here.
## This just about gets the column index
dat[, colInd := sum(!is.na(.SD)), by=1:nrow(dat)]
This is the goal though, to extract these labels, here using regular base functions.
这是我们的目标,使用常规的基本函数提取这些标签。
## Using max.col and a data.frame
df1 <- as.data.frame(dat)
inds <- max.col(is.na(as.matrix(df1)), ties="first")-1
inds[inds==0] <- ncol(df1)
df1[cbind(1:nrow(df1), inds)]
# [1] "u" "q" "w" "h" "r" "t" "e" "t"
5 个解决方案
#1
10
Here's another way:
这是另一种方式:
dat[, res := NA_character_]
for (v in rev(names(dat))[-1]) dat[is.na(res), res := get(v)]
X1 X2 X3 X4 X5 res
1: u NA NA NA NA u
2: f q NA NA NA q
3: f b w NA NA w
4: k g h NA NA h
5: u b r NA NA r
6: f q w x t t
7: u g h i e e
8: u q r n t t
Benchmarks Using the same data as @alexis_laz and making (apparently) superficial changes to the functions, I see different results. Just showing them here in case anyone is curious. Alexis' answer (with small modifications) still comes out ahead.
使用与@alexis_laz相同的数据并(显然)对函数进行表面更改的基准测试,我看到了不同的结果。给他们看以防有人好奇。亚历克西斯的回答(稍加修改)仍然遥遥领先。
Functions:
功能:
alex = function(x, ans = rep_len(NA, length(x[[1L]])), wh = seq_len(length(x[[1L]]))){
if(!length(wh)) return(ans)
ans[wh] = as.character(x[[length(x)]])[wh]
Recall(x[-length(x)], ans, wh[is.na(ans[wh])])
}
alex2 = function(x){
x[, res := NA_character_]
wh = x[, .I]
for (v in (length(x)-1):1){
if (!length(wh)) break
set(x, j="res", i=wh, v = x[[v]][wh])
wh = wh[is.na(x$res[wh])]
}
x$res
}
frank = function(x){
x[, res := NA_character_]
for(v in rev(names(x))[-1]) x[is.na(res), res := get(v)]
return(x$res)
}
frank2 = function(x){
x[, res := NA_character_]
for(v in rev(names(x))[-1]) x[is.na(res), res := .SD, .SDcols=v]
x$res
}
Example data and benchmark:
示例数据和基准:
DAT1 = as.data.table(lapply(ceiling(seq(0, 1e4, length.out = 1e2)),
function(n) c(rep(NA, n), sample(letters, 3e5 - n, TRUE))))
DAT2 = copy(DAT1)
DAT3 = as.list(copy(DAT1))
DAT4 = copy(DAT1)
library(microbenchmark)
microbenchmark(frank(DAT1), frank2(DAT2), alex(DAT3), alex2(DAT4), times = 30)
Unit: milliseconds
expr min lq mean median uq max neval
frank(DAT1) 850.05980 909.28314 985.71700 979.84230 1023.57049 1183.37898 30
frank2(DAT2) 88.68229 93.40476 118.27959 107.69190 121.60257 346.48264 30
alex(DAT3) 98.56861 109.36653 131.21195 131.20760 149.99347 183.43918 30
alex2(DAT4) 26.14104 26.45840 30.79294 26.67951 31.24136 50.66723 30
#2
9
Another idea -similar to Frank's- that tries (1) to avoid subsetting 'data.table' rows (which I assume must have some cost) and (2) to avoid checking a length == nrow(dat)
vector for NA
s in every iteration.
另一个想法——类似于Frank的想法——尝试(1)避免细分“数据”。表的行(我认为必须有一些代价)和(2)避免在每次迭代中检查NAs的长度== nrow(dat)向量。
alex = function(x, ans = rep_len(NA, length(x[[1L]])), wh = seq_len(length(x[[1L]])))
{
if(!length(wh)) return(ans)
ans[wh] = as.character(x[[length(x)]])[wh]
Recall(x[-length(x)], ans, wh[is.na(ans[wh])])
}
alex(as.list(dat)) #had some trouble with 'data.table' subsetting
# [1] "u" "q" "w" "h" "r" "t" "e" "t"
And to compare with Frank's:
和弗兰克的相比:
frank = function(x)
{
x[, res := NA_character_]
for(v in rev(names(x))[-1]) x[is.na(res), res := get(v)]
return(x$res)
}
DAT1 = as.data.table(lapply(ceiling(seq(0, 1e4, length.out = 1e2)),
function(n) c(rep(NA, n), sample(letters, 3e5 - n, TRUE))))
DAT2 = copy(DAT1)
microbenchmark::microbenchmark(alex(as.list(DAT1)),
{ frank(DAT2); DAT2[, res := NULL] },
times = 30)
#Unit: milliseconds
# expr min lq median uq max neval
# alex(as.list(DAT1)) 102.9767 108.5134 117.6595 133.1849 166.9594 30
# { frank(DAT2) DAT2[, `:=`(res, NULL)] } 1413.3296 1455.1553 1497.3517 1540.8705 1685.0589 30
identical(alex(as.list(DAT1)), frank(DAT2))
#[1] TRUE
#3
4
We convert the 'data.frame' to 'data.table' and create a row id column (setDT(df1, keep.rownames=TRUE)
). We reshape the 'wide' to 'long' format with melt
. Grouped by 'rn', if
there is no NA
element in 'value' column, we get the last element of 'value' (value[.N]
) or else
, we get the element before the first NA in the 'value' to get the 'V1' column, which we extract ($V1
).
我们将“data.frame”转换为“data”。创建一个行id列(setDT(df1, keep.rownames=TRUE))。我们用熔融将“宽”格式重塑为“长”格式。根据'rn'进行分组,如果'value'列中没有NA元素,我们得到'value' (value[. n])的最后一个元素,否则,我们得到'value'中的第一个NA之前的元素,得到'V1'列,我们提取它($V1)。
melt(setDT(df1, keep.rownames=TRUE), id.var='rn')[,
if(!any(is.na(value))) value[.N]
else value[which(is.na(value))[1]-1], by = rn]$V1
#[1] "u" "q" "w" "h" "r" "t" "e" "t"
In case, the data is already a data.table
在这种情况下,数据已经是一个data.table
dat[, rn := 1:.N]#create the 'rn' column
melt(dat, id.var='rn')[, #melt from wide to long format
if(!any(is.na(value))) value[.N]
else value[which(is.na(value))[1]-1], by = rn]$V1
#[1] "u" "q" "w" "h" "r" "t" "e" "t"
Here is another option
这是另一个选择
dat[, colInd := sum(!is.na(.SD)), by=1:nrow(dat)][
, as.character(.SD[[.BY[[1]]]]), by=colInd]
Or as @Frank mentioned in the comments, we can use na.rm=TRUE
from melt
and make it more compact
或者正如@Frank在评论中提到的,我们可以使用na。rm=真实的熔体,使它更紧凑
melt(dat[, r := .I], id="r", na.rm=TRUE)[, value[.N], by=r]
#4
3
Here is a one liner base R
approach:
这里有一个线性基准R方法:
sapply(split(dat, seq(nrow(dat))), function(x) tail(x[!is.na(x)],1))
# 1 2 3 4 5 6 7 8
#"u" "q" "w" "h" "r" "t" "e" "t"
#5
3
I'm not sure how to improve upon @alexis's answer beyond what @Frank has already done, but your original approach with base R wasn't too far off of something that is reasonably performant.
我不知道如何改进@alexis的答案,超出了@Frank已经做过的事情,但是你的原始的R的方法并没有太离谱。
Here's a variant of your approach that I liked because (1) it's reasonably quick and (2) it doesn't require too much thought to figure out what's going on:
这是我喜欢你的方法的一个变体,因为(1)它是相当快的;(2)它不需要太多的思考就能弄明白发生了什么:
as.matrix(dat)[cbind(1:nrow(dat), max.col(!is.na(dat), "last"))]
The most expensive part of this seems to be the as.matrix(dat)
part, but otherwise, it seems to be faster than the melt
approach that @akrun shared.
其中最昂贵的部分似乎是asn .matrix(dat)部分,但除此之外,它似乎比@akrun共享的melt方法要快。
#1
10
Here's another way:
这是另一种方式:
dat[, res := NA_character_]
for (v in rev(names(dat))[-1]) dat[is.na(res), res := get(v)]
X1 X2 X3 X4 X5 res
1: u NA NA NA NA u
2: f q NA NA NA q
3: f b w NA NA w
4: k g h NA NA h
5: u b r NA NA r
6: f q w x t t
7: u g h i e e
8: u q r n t t
Benchmarks Using the same data as @alexis_laz and making (apparently) superficial changes to the functions, I see different results. Just showing them here in case anyone is curious. Alexis' answer (with small modifications) still comes out ahead.
使用与@alexis_laz相同的数据并(显然)对函数进行表面更改的基准测试,我看到了不同的结果。给他们看以防有人好奇。亚历克西斯的回答(稍加修改)仍然遥遥领先。
Functions:
功能:
alex = function(x, ans = rep_len(NA, length(x[[1L]])), wh = seq_len(length(x[[1L]]))){
if(!length(wh)) return(ans)
ans[wh] = as.character(x[[length(x)]])[wh]
Recall(x[-length(x)], ans, wh[is.na(ans[wh])])
}
alex2 = function(x){
x[, res := NA_character_]
wh = x[, .I]
for (v in (length(x)-1):1){
if (!length(wh)) break
set(x, j="res", i=wh, v = x[[v]][wh])
wh = wh[is.na(x$res[wh])]
}
x$res
}
frank = function(x){
x[, res := NA_character_]
for(v in rev(names(x))[-1]) x[is.na(res), res := get(v)]
return(x$res)
}
frank2 = function(x){
x[, res := NA_character_]
for(v in rev(names(x))[-1]) x[is.na(res), res := .SD, .SDcols=v]
x$res
}
Example data and benchmark:
示例数据和基准:
DAT1 = as.data.table(lapply(ceiling(seq(0, 1e4, length.out = 1e2)),
function(n) c(rep(NA, n), sample(letters, 3e5 - n, TRUE))))
DAT2 = copy(DAT1)
DAT3 = as.list(copy(DAT1))
DAT4 = copy(DAT1)
library(microbenchmark)
microbenchmark(frank(DAT1), frank2(DAT2), alex(DAT3), alex2(DAT4), times = 30)
Unit: milliseconds
expr min lq mean median uq max neval
frank(DAT1) 850.05980 909.28314 985.71700 979.84230 1023.57049 1183.37898 30
frank2(DAT2) 88.68229 93.40476 118.27959 107.69190 121.60257 346.48264 30
alex(DAT3) 98.56861 109.36653 131.21195 131.20760 149.99347 183.43918 30
alex2(DAT4) 26.14104 26.45840 30.79294 26.67951 31.24136 50.66723 30
#2
9
Another idea -similar to Frank's- that tries (1) to avoid subsetting 'data.table' rows (which I assume must have some cost) and (2) to avoid checking a length == nrow(dat)
vector for NA
s in every iteration.
另一个想法——类似于Frank的想法——尝试(1)避免细分“数据”。表的行(我认为必须有一些代价)和(2)避免在每次迭代中检查NAs的长度== nrow(dat)向量。
alex = function(x, ans = rep_len(NA, length(x[[1L]])), wh = seq_len(length(x[[1L]])))
{
if(!length(wh)) return(ans)
ans[wh] = as.character(x[[length(x)]])[wh]
Recall(x[-length(x)], ans, wh[is.na(ans[wh])])
}
alex(as.list(dat)) #had some trouble with 'data.table' subsetting
# [1] "u" "q" "w" "h" "r" "t" "e" "t"
And to compare with Frank's:
和弗兰克的相比:
frank = function(x)
{
x[, res := NA_character_]
for(v in rev(names(x))[-1]) x[is.na(res), res := get(v)]
return(x$res)
}
DAT1 = as.data.table(lapply(ceiling(seq(0, 1e4, length.out = 1e2)),
function(n) c(rep(NA, n), sample(letters, 3e5 - n, TRUE))))
DAT2 = copy(DAT1)
microbenchmark::microbenchmark(alex(as.list(DAT1)),
{ frank(DAT2); DAT2[, res := NULL] },
times = 30)
#Unit: milliseconds
# expr min lq median uq max neval
# alex(as.list(DAT1)) 102.9767 108.5134 117.6595 133.1849 166.9594 30
# { frank(DAT2) DAT2[, `:=`(res, NULL)] } 1413.3296 1455.1553 1497.3517 1540.8705 1685.0589 30
identical(alex(as.list(DAT1)), frank(DAT2))
#[1] TRUE
#3
4
We convert the 'data.frame' to 'data.table' and create a row id column (setDT(df1, keep.rownames=TRUE)
). We reshape the 'wide' to 'long' format with melt
. Grouped by 'rn', if
there is no NA
element in 'value' column, we get the last element of 'value' (value[.N]
) or else
, we get the element before the first NA in the 'value' to get the 'V1' column, which we extract ($V1
).
我们将“data.frame”转换为“data”。创建一个行id列(setDT(df1, keep.rownames=TRUE))。我们用熔融将“宽”格式重塑为“长”格式。根据'rn'进行分组,如果'value'列中没有NA元素,我们得到'value' (value[. n])的最后一个元素,否则,我们得到'value'中的第一个NA之前的元素,得到'V1'列,我们提取它($V1)。
melt(setDT(df1, keep.rownames=TRUE), id.var='rn')[,
if(!any(is.na(value))) value[.N]
else value[which(is.na(value))[1]-1], by = rn]$V1
#[1] "u" "q" "w" "h" "r" "t" "e" "t"
In case, the data is already a data.table
在这种情况下,数据已经是一个data.table
dat[, rn := 1:.N]#create the 'rn' column
melt(dat, id.var='rn')[, #melt from wide to long format
if(!any(is.na(value))) value[.N]
else value[which(is.na(value))[1]-1], by = rn]$V1
#[1] "u" "q" "w" "h" "r" "t" "e" "t"
Here is another option
这是另一个选择
dat[, colInd := sum(!is.na(.SD)), by=1:nrow(dat)][
, as.character(.SD[[.BY[[1]]]]), by=colInd]
Or as @Frank mentioned in the comments, we can use na.rm=TRUE
from melt
and make it more compact
或者正如@Frank在评论中提到的,我们可以使用na。rm=真实的熔体,使它更紧凑
melt(dat[, r := .I], id="r", na.rm=TRUE)[, value[.N], by=r]
#4
3
Here is a one liner base R
approach:
这里有一个线性基准R方法:
sapply(split(dat, seq(nrow(dat))), function(x) tail(x[!is.na(x)],1))
# 1 2 3 4 5 6 7 8
#"u" "q" "w" "h" "r" "t" "e" "t"
#5
3
I'm not sure how to improve upon @alexis's answer beyond what @Frank has already done, but your original approach with base R wasn't too far off of something that is reasonably performant.
我不知道如何改进@alexis的答案,超出了@Frank已经做过的事情,但是你的原始的R的方法并没有太离谱。
Here's a variant of your approach that I liked because (1) it's reasonably quick and (2) it doesn't require too much thought to figure out what's going on:
这是我喜欢你的方法的一个变体,因为(1)它是相当快的;(2)它不需要太多的思考就能弄明白发生了什么:
as.matrix(dat)[cbind(1:nrow(dat), max.col(!is.na(dat), "last"))]
The most expensive part of this seems to be the as.matrix(dat)
part, but otherwise, it seems to be faster than the melt
approach that @akrun shared.
其中最昂贵的部分似乎是asn .matrix(dat)部分,但除此之外,它似乎比@akrun共享的melt方法要快。