R:如何在数据帧中查找和提取值

时间:2021-08-31 16:18:18

I have a character vector in R with 330000 values e.g.

我在R中有一个带有330000个值的字符向量,例如

amp184660
amp947
amp53303
amp364886
amp121615

amp184660 amp947 amp53303 amp364886 amp121615

and and a data frame like this:

和一个像这样的数据框:

R:如何在数据帧中查找和提取值

I want to find each value from my character vector in first column of the data frame i.e. "Assay Name" and then output its corresponding chromosome position i.e "Chrom" into a new vector. I want to do this as quickly as possible as there are about 330k entries and doing this via grep over a loop will take about 12 hours to finish.

我想在数据帧的第一列中找到我的字符向量中的每个值,即“测定名称”,然后将其相应的染色体位置,即“Chrom”输出到新的向量中。我希望尽可能快地做到这一点,因为有大约330k条目,并且通过循环grep执行此操作将需要大约12个小时才能完成。

Any ideas? Thanks Jason.

有任何想法吗?谢谢杰森。

3 个解决方案

#1


1  

I would suggest %in%, which is likely to be faster than merge. Here's a toy example:

我会建议%in%,这可能比合并更快。这是一个玩具示例:

## Assume that "x" is your data.frame
set.seed(1)
x <- data.frame(Assay = sample(letters, 30, replace = TRUE), 
                Chrom = 4, ChromPos = rnorm(30))

## And that "y" is your vector you want to match
y <- c("a", "b", "c", "d", "e")

## Here's how you can use %in%
x[x$Assay %in% y, ]
#    Assay Chrom   ChromPos
# 10     b     4  0.6198257
# 12     e     4 -0.1557955
# 24     d     4  1.1000254
# 27     a     4 -0.2533617

## And can also directly extract a specific column
x[x$Assay %in% y, "ChromPos"]
# [1]  0.6198257 -0.1557955  1.1000254 -0.2533617

#2


0  

# assume your df called your_data_frame and vector called your_character_vector

vector_frame<-data.frame("Assay Name"=your_character_vector)
merge(vector_frame,your_data_frame,by="Assay Name")[,3]

note I changed the column notation from $Chrom to [,3] because I saw you wanted the third column and R will rename the column in the $ call e.g. to Chrom.Pos..bp. or something similar - if you type the $ and press TAB in the RStudio editor it'll give you the options

注意我将列符号从$ Chrom更改为[,3],因为我看到你想要第三列,R将重命名$ call中的列,例如到Chrom.Pos..bp。或类似的东西 - 如果你输入$并在RStudio编辑器中按TAB它会给你选项

#3


0  

Just in case runtime is still a problem, using the data.table package is approx. 100x faster than merge and 50x faster than %in%:

为了防止运行时仍然存在问题,使用data.table包大约是。比合并快100倍,比%%快50倍:

library(data.table)
dt <- as.data.table( yourDataFrame )
setkey( dt, Assay )
dt[ J(yourVector) ]

#1


1  

I would suggest %in%, which is likely to be faster than merge. Here's a toy example:

我会建议%in%,这可能比合并更快。这是一个玩具示例:

## Assume that "x" is your data.frame
set.seed(1)
x <- data.frame(Assay = sample(letters, 30, replace = TRUE), 
                Chrom = 4, ChromPos = rnorm(30))

## And that "y" is your vector you want to match
y <- c("a", "b", "c", "d", "e")

## Here's how you can use %in%
x[x$Assay %in% y, ]
#    Assay Chrom   ChromPos
# 10     b     4  0.6198257
# 12     e     4 -0.1557955
# 24     d     4  1.1000254
# 27     a     4 -0.2533617

## And can also directly extract a specific column
x[x$Assay %in% y, "ChromPos"]
# [1]  0.6198257 -0.1557955  1.1000254 -0.2533617

#2


0  

# assume your df called your_data_frame and vector called your_character_vector

vector_frame<-data.frame("Assay Name"=your_character_vector)
merge(vector_frame,your_data_frame,by="Assay Name")[,3]

note I changed the column notation from $Chrom to [,3] because I saw you wanted the third column and R will rename the column in the $ call e.g. to Chrom.Pos..bp. or something similar - if you type the $ and press TAB in the RStudio editor it'll give you the options

注意我将列符号从$ Chrom更改为[,3],因为我看到你想要第三列,R将重命名$ call中的列,例如到Chrom.Pos..bp。或类似的东西 - 如果你输入$并在RStudio编辑器中按TAB它会给你选项

#3


0  

Just in case runtime is still a problem, using the data.table package is approx. 100x faster than merge and 50x faster than %in%:

为了防止运行时仍然存在问题,使用data.table包大约是。比合并快100倍,比%%快50倍:

library(data.table)
dt <- as.data.table( yourDataFrame )
setkey( dt, Assay )
dt[ J(yourVector) ]