Given a list a
containing vectors of unequal length and a vector b
containing some elements from the vectors in a
, I want to get a vector of equal length to b
containing the index in a
where the element in b
matches (this is a bad explanation I know)...
给定一个列表包含不等长度的向量和包含a中向量的一些元素的向量b,我想得到一个长度相等的向量,包含b中元素匹配的索引(这是一个不好的解释)我知道)...
The following code does the job:
以下代码完成了这项工作:
a <- list(1:3, 4:5, 6:9)
b <- c(2, 3, 5, 8)
sapply(b, function(x, list) which(unlist(lapply(list, function(y, z) z %in% y, z=x))), list=a)
[1] 1 1 2 3
Replacing the sapply
with a for loop achieves the same of course
用for循环替换sapply当然也可以实现相同的目的
The problem is that this code will be used with list and vectors with a length above 1000. On a real life set the function takes around 15 seconds (both the for loop and the sapply
).
问题是这个代码将用于长度大于1000的列表和向量。在现实生活中,该函数大约需要15秒(for循环和sapply)。
Does anyone have an idea how to speed this up, safe for a parallel approach? I have failed to see a vectorized approach (and I cannot program in C, though that would probably be the fastest).
有没有人知道如何加快速度,对并行方法安全?我没有看到矢量化方法(我不能用C编程,尽管这可能是最快的)。
Edit:
编辑:
Will just emphasize Aaron's elegant solution using match() which gave a speed increase in the order of 1667 times (from 15 to 0.009)
将使用match()强调Aaron的优雅解决方案,其速度提升1667次(从15到0.009)
I expanded a bit on it to allow multiple matches (the return is then a list)
我在它上面扩展了一下以允许多个匹配(返回是一个列表)
a <- list(1:3, 3:5, 3:7)
b <- c(3, 5)
g <- rep(seq_along(a), sapply(a, length))
sapply(b, function(x) g[which(unlist(a) %in% x)])
[[1]]
[1] 1 2 3
[[2]]
[1] 2 3
The runtime for this was 0.169 which is arguably quite slower, but on the other hand more flexible
这个的运行时间是0.169,这可能相当慢,但另一方面更灵活
2 个解决方案
#1
13
Here's one possibility using match
:
这是使用匹配的一种可能性:
> a <- list(1:3, 4:5, 6:9)
> b <- c(2, 3, 5, 8)
> g <- rep(seq_along(a), sapply(a, length))
> g[match(b, unlist(a))]
[1] 1 1 2 3
findInterval
is another option:
findInterval是另一种选择:
> findInterval(match(b, unlist(a)), cumsum(c(0,sapply(a, length)))+1)
[1] 1 1 2 3
For returning a list, try this:
要返回列表,请尝试以下操作:
a <- list(1:3, 4:5, 5:9)
b <- c(2,3,5,8,5)
g <- rep(seq_along(a), sapply(a, length))
aa <- unlist(a)
au <- unique(aa)
af <- factor(aa, levels=au)
gg <- split(g, af)
gg[match(b, au)]
#2
0
As a comment to your post suggests, it depends on what you want to do if/when the same element appears in multiple vectors in a
. Assuming that you want the lowest index you could do:
正如您对帖子的评论所暗示的那样,如果/当相同元素出现在a中的多个向量中时,它取决于您想要做什么。假设您想要最低的索引,您可以这样做:
apply(sapply(a, function(vec) {b %in% vec}), 1, which.max)
#1
13
Here's one possibility using match
:
这是使用匹配的一种可能性:
> a <- list(1:3, 4:5, 6:9)
> b <- c(2, 3, 5, 8)
> g <- rep(seq_along(a), sapply(a, length))
> g[match(b, unlist(a))]
[1] 1 1 2 3
findInterval
is another option:
findInterval是另一种选择:
> findInterval(match(b, unlist(a)), cumsum(c(0,sapply(a, length)))+1)
[1] 1 1 2 3
For returning a list, try this:
要返回列表,请尝试以下操作:
a <- list(1:3, 4:5, 5:9)
b <- c(2,3,5,8,5)
g <- rep(seq_along(a), sapply(a, length))
aa <- unlist(a)
au <- unique(aa)
af <- factor(aa, levels=au)
gg <- split(g, af)
gg[match(b, au)]
#2
0
As a comment to your post suggests, it depends on what you want to do if/when the same element appears in multiple vectors in a
. Assuming that you want the lowest index you could do:
正如您对帖子的评论所暗示的那样,如果/当相同元素出现在a中的多个向量中时,它取决于您想要做什么。假设您想要最低的索引,您可以这样做:
apply(sapply(a, function(vec) {b %in% vec}), 1, which.max)