R regex使用向量和两列数据帧

时间:2022-04-22 19:35:43

Suppose I have a vector and a two column data.frame.

假设我有一个向量和两列data.frame。

motif <- c("DAGTACTHV","AGT","WSAT")

motif_ref <- data.frame("sym"=c("W","S","M","K","R","Y","B","D","H","V","N"),
                              "bases"=c("(A|T)","(C|G)","(A|C)","(G|T)","(A|G)","(C|T)","(C|G|T)","(A|G|T)","(A|C|T)","(A|C|G)","(A|C|G|T)"))

I'm trying to use stri_replace_all to replace all elements in motif_ref$sym with the corresponding elements in motif_ref$bases, in motif.

我正在尝试使用stri_replace_all将motif_ref $ sym中的所有元素替换为motif中的motif_ref $ bases中的相应元素。

m <- stri_replace_all_regex(motif, motif_ref$sym, motif_ref$bases)

However this gives me:

但是这给了我:

> m
[1] "DAGTACTHV"       "DAGTACTHV"       "DAGTACTHV"       "DAGTACTHV"       "DAGTACTHV"       "DAGTACTHV"       "DAGTACTHV"      
 [8] "(A|G|T)AGTACTHV" "DAGTACT(A|C|T)V" "DAGTACTH(A|C|G)" "DAGTACTHV"      

when I actually want something like:

当我真正想要的东西:

> m 
[1] "(A|G|T)AGTACT(A|C|T)(A|C|G)" "AGT" "(A|T)(C|G)AT"

I was thinking about using chartr, however I dont know if it'll work on replacing single characters with longer strings.

我正在考虑使用chartr,但是我不知道它是否可以用更长的字符串替换单个字符。

Thanks everyone

感谢大家

1 个解决方案

#1


2  

This is a perfect use case for its vectorize_all argument.

这是vectorize_all参数的完美用例。

library(stringi)

stri_replace_all_fixed(motif, motif_ref$sym, motif_ref$bases, vectorize_all = FALSE)
# [1] "(A|G|T)AGTACT(A|C|T)(A|C|G)" "AGT"                         "(A|T)(C|G)AT"

Or a bit more clearly written -

或者说得更清楚一点 -

with(motif_ref, {
    stri_replace_all_fixed(motif, sym, bases, vectorize_all = FALSE)
})

Note that using stri_replace_all_fixed will be more efficient since we are searching for exact matches.

请注意,使用stri_replace_all_fixed将更有效,因为我们正在搜索完全匹配。

#1


2  

This is a perfect use case for its vectorize_all argument.

这是vectorize_all参数的完美用例。

library(stringi)

stri_replace_all_fixed(motif, motif_ref$sym, motif_ref$bases, vectorize_all = FALSE)
# [1] "(A|G|T)AGTACT(A|C|T)(A|C|G)" "AGT"                         "(A|T)(C|G)AT"

Or a bit more clearly written -

或者说得更清楚一点 -

with(motif_ref, {
    stri_replace_all_fixed(motif, sym, bases, vectorize_all = FALSE)
})

Note that using stri_replace_all_fixed will be more efficient since we are searching for exact matches.

请注意,使用stri_replace_all_fixed将更有效,因为我们正在搜索完全匹配。