I have a vector of values, call it X, and a data frame, call it dat.fram. I want to run something like "grep" or "which" to find all the indices of dat.fram[,3] which match each of the elements of X.
我有一个值向量,称之为X,数据帧,称之为dat.fram。我想运行类似“grep”或“which”的东西来查找与X的每个元素匹配的dat.fram [,3]的所有索引。
This is the very inefficient for loop I have below. Notice that there are many observations in X and each member of "match.ind" can have zero or more matches. Also, dat.fram has over 1 million observations. Is there any way to use a vector function in R to make this process more efficient?
这是我在下面循环的非常低效的循环。请注意,X中有许多观察值,“match.ind”的每个成员可以有零个或多个匹配。此外,dat.fram有超过100万次观测。有没有办法在R中使用向量函数来提高这个过程的效率?
Ultimately, I need a list since I will pass the list to another function that will retrieve the appropriate values from dat.fram .
最终,我需要一个列表,因为我将列表传递给另一个函数,该函数将从dat.fram中检索适当的值。
Code:
码:
match.ind=list()
for(i in 1:150000){
match.ind[[i]]=which(dat.fram[,3]==X[i])
}
1 个解决方案
#1
1
UPDATE:
更新:
Ok, wow, I just found an awesome way of doing this... it's really slick. Wondering if it's useful in other contexts...?!
好的,哇,我刚刚找到了一个很棒的方法来做到这一点...它真的很光滑。想知道它是否在其他环境中有用......?!
### define v as a sample column of data - you should define v to be
### the column in the data frame you mentioned (data.fram[,3])
v = sample(1:150000, 1500000, rep=TRUE)
### now here's the trick: concatenate the indices for each possible value of v,
### to form mybiglist - the rownames of mybiglist give you the possible values
### of v, and the values in mybiglist give you the index points
mybiglist = tapply(seq_along(v),v,c)
### now you just want the parts of this that intersect with X... again I'll
### generate a random X but use whatever X you need to
X = sample(1:200000, 150000)
mylist = mybiglist[which(names(mybiglist)%in%X)]
And that's it! As a check, let's look at the first 3 rows of mylist:
就是这样!作为检查,让我们看看前3行的mylist:
> mylist[1:3]
$`1`
[1] 401143 494448 703954 757808 1364904 1485811
$`2`
[1] 230769 332970 389601 582724 804046 997184 1080412 1169588 1310105
$`4`
[1] 149021 282361 289661 456147 774672 944760 969734 1043875 1226377
There's a gap at 3, as 3 doesn't appear in X (even though it occurs in v). And the numbers listed against 4 are the index points in v where 4 appears:
在3处有一个间隙,因为3中没有出现3(即使它出现在v中)。并且针对4列出的数字是v中的索引点,其中4出现:
> which(X==3)
integer(0)
> which(v==3)
[1] 102194 424873 468660 593570 713547 769309 786156 828021 870796
883932 1036943 1246745 1381907 1437148
> which(v==4)
[1] 149021 282361 289661 456147 774672 944760 969734 1043875 1226377
Finally, it's worth noting that values that appear in X but not in v won't have an entry in the list, but this is presumably what you want anyway as they're NULL!
最后,值得注意的是,出现在X但不在v中的值在列表中没有条目,但这可能是你想要的,因为它们是NULL!
Extra note: You can use the code below to create an NA entry for each member of X not in v...
额外注意:您可以使用以下代码为X的每个成员创建一个NA条目,而不是v ...
blanks = sort(setdiff(X,names(mylist)))
mylist_extras = rep(list(NA),length(blanks))
names(mylist_extras) = blanks
mylist_all = c(mylist,mylist_extras)
mylist_all = mylist_all[order(as.numeric(names(mylist_all)))]
Fairly self-explanatory: mylist_extras is a list with all the additional list stuff you need (the names are the values of X not featuring in names(mylist), and the actual entries in the list are simply NA). The final two lines firstly merge mylist and mylist_extras, and then perform a reordering so that the names in mylist_all are in numeric order. These names should then match exactly the (unique) values in the vector X.
相当不言自明:mylist_extras是一个列表,其中包含您需要的所有其他列表内容(名称是名称中没有特征的X值(mylist),列表中的实际条目只是NA)。最后两行首先合并mylist和mylist_extras,然后执行重新排序,以便mylist_all中的名称按数字顺序排列。然后,这些名称应完全匹配向量X中的(唯一)值。
Cheers! :)
干杯! :)
ORIGINAL POST BELOW... superseded by the above, obviously!
ORIGINAL POST BELOW ...明显被上面取代了!
Here's a toy example with tapply that might well run significantly quicker... I made X and d relatively small so you could see what's going on:
这是一个tapply的玩具示例,可能会更快地运行...我使X和d相对较小,所以你可以看到发生了什么:
X = 3:7
n = 100
d = data.frame(a = sample(1:10,n,rep=TRUE), b = sample(1:10,n,rep=TRUE),
c = sample(1:10,n,rep=TRUE), stringsAsFactors = FALSE)
tapply(X,X,function(x) {which(d[,3]==x)})
#1
1
UPDATE:
更新:
Ok, wow, I just found an awesome way of doing this... it's really slick. Wondering if it's useful in other contexts...?!
好的,哇,我刚刚找到了一个很棒的方法来做到这一点...它真的很光滑。想知道它是否在其他环境中有用......?!
### define v as a sample column of data - you should define v to be
### the column in the data frame you mentioned (data.fram[,3])
v = sample(1:150000, 1500000, rep=TRUE)
### now here's the trick: concatenate the indices for each possible value of v,
### to form mybiglist - the rownames of mybiglist give you the possible values
### of v, and the values in mybiglist give you the index points
mybiglist = tapply(seq_along(v),v,c)
### now you just want the parts of this that intersect with X... again I'll
### generate a random X but use whatever X you need to
X = sample(1:200000, 150000)
mylist = mybiglist[which(names(mybiglist)%in%X)]
And that's it! As a check, let's look at the first 3 rows of mylist:
就是这样!作为检查,让我们看看前3行的mylist:
> mylist[1:3]
$`1`
[1] 401143 494448 703954 757808 1364904 1485811
$`2`
[1] 230769 332970 389601 582724 804046 997184 1080412 1169588 1310105
$`4`
[1] 149021 282361 289661 456147 774672 944760 969734 1043875 1226377
There's a gap at 3, as 3 doesn't appear in X (even though it occurs in v). And the numbers listed against 4 are the index points in v where 4 appears:
在3处有一个间隙,因为3中没有出现3(即使它出现在v中)。并且针对4列出的数字是v中的索引点,其中4出现:
> which(X==3)
integer(0)
> which(v==3)
[1] 102194 424873 468660 593570 713547 769309 786156 828021 870796
883932 1036943 1246745 1381907 1437148
> which(v==4)
[1] 149021 282361 289661 456147 774672 944760 969734 1043875 1226377
Finally, it's worth noting that values that appear in X but not in v won't have an entry in the list, but this is presumably what you want anyway as they're NULL!
最后,值得注意的是,出现在X但不在v中的值在列表中没有条目,但这可能是你想要的,因为它们是NULL!
Extra note: You can use the code below to create an NA entry for each member of X not in v...
额外注意:您可以使用以下代码为X的每个成员创建一个NA条目,而不是v ...
blanks = sort(setdiff(X,names(mylist)))
mylist_extras = rep(list(NA),length(blanks))
names(mylist_extras) = blanks
mylist_all = c(mylist,mylist_extras)
mylist_all = mylist_all[order(as.numeric(names(mylist_all)))]
Fairly self-explanatory: mylist_extras is a list with all the additional list stuff you need (the names are the values of X not featuring in names(mylist), and the actual entries in the list are simply NA). The final two lines firstly merge mylist and mylist_extras, and then perform a reordering so that the names in mylist_all are in numeric order. These names should then match exactly the (unique) values in the vector X.
相当不言自明:mylist_extras是一个列表,其中包含您需要的所有其他列表内容(名称是名称中没有特征的X值(mylist),列表中的实际条目只是NA)。最后两行首先合并mylist和mylist_extras,然后执行重新排序,以便mylist_all中的名称按数字顺序排列。然后,这些名称应完全匹配向量X中的(唯一)值。
Cheers! :)
干杯! :)
ORIGINAL POST BELOW... superseded by the above, obviously!
ORIGINAL POST BELOW ...明显被上面取代了!
Here's a toy example with tapply that might well run significantly quicker... I made X and d relatively small so you could see what's going on:
这是一个tapply的玩具示例,可能会更快地运行...我使X和d相对较小,所以你可以看到发生了什么:
X = 3:7
n = 100
d = data.frame(a = sample(1:10,n,rep=TRUE), b = sample(1:10,n,rep=TRUE),
c = sample(1:10,n,rep=TRUE), stringsAsFactors = FALSE)
tapply(X,X,function(x) {which(d[,3]==x)})