在数据帧输出的序列向量中搜索图案矢量

时间:2022-06-14 16:21:43

I have a set of nucleotide sequences in a vector of strings called x.

我在称为x的字符串向量中有一组核苷酸序列。

I want to check whether some (say 10) motifs are present in x. I want to produce a data frame or table where the rows are the sequences in X and the columns are the patterns/motifs are in the vector sdseqs.

我想检查x中是否存在一些(比如10个)图案。我想生成一个数据框或表,其中行是X中的序列,列是模式/主题在向量sdseqs中。

sdframe <- data.frame
sdseqs = c("AGGAG.+ATG", 
"AGAAG.+ATG","AAAGG.+ATG","GGAGG.+ATG","GAAGA.+ATG",
"GGAGA.+ATG","AAGGT.+ATG","AGGAA.+ATG","AAGGA.+ATG","GTGGA.+ATG")
for (i in 1:10) {
sdframe <- cbind(sdframe,(grepl(sdseqs[i], x)))
}

This code works just fine but the first column of the data frame will be empty, with question marks. The other columns are populated with true and false - that's what i want.

此代码工作正常,但数据框的第一列将为空,带有问号。其他列填充了true和false - 这就是我想要的。

I tried to define an empty data frame outside the loop at the beginning. I am new to R and I am coming from Perl. This what I usually did in Perl: you define variables to be used within a loop outside. How can I do this in R?

我试图在开头的循环外定义一个空数据框。我是R的新手,我来自Perl。这就是我在Perl中经常做的事情:您定义要在外部循环中使用的变量。我怎么能在R中这样做?

Also, a viable option would be to delete the first column from my data frame, but that does not seem so straightforward to me.

另外,一个可行的选择是从我的数据框中删除第一列,但这对我来说似乎并不那么简单。

Any help is appreciated.

任何帮助表示赞赏。

The output i Get with my code now:

输出我现在使用我的代码:

  sdframe                                                            
[1,] ?       TRUE  FALSE TRUE  TRUE  FALSE TRUE  TRUE  TRUE  TRUE  FALSE
[2,] ?       FALSE TRUE  FALSE FALSE FALSE FALSE FALSE FALSE TRUE  TRUE 
[3,] ?       FALSE FALSE TRUE  FALSE TRUE  FALSE TRUE  TRUE  TRUE  TRUE 
[4,] ?       TRUE  FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[5,] ?       FALSE TRUE  FALSE FALSE TRUE  FALSE FALSE FALSE FALSE FALSE
[6,] ?       FALSE FALSE FALSE TRUE  FALSE FALSE FALSE TRUE  FALSE TRUE 
[7,] ?       FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE  FALSE FALSE
[8,] ?       FALSE FALSE TRUE  FALSE FALSE TRUE  FALSE FALSE TRUE  FALSE
[9,] ?       FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[10,] ?       FALSE FALSE FALSE FALSE TRUE  FALSE FALSE FALSE FALSE FALSE
[11,] ?       FALSE FALSE TRUE  FALSE FALSE FALSE FALSE FALSE FALSE FALSE

I want the same but without the first column of ?. Note my x has 11 sequences, the motifs i checked for are the column (10 columns, 11 counting the first with ?)

我想要相同但没有第一列?注意我的x有11个序列,我检查的主题是列(10列,11个计数第一个?)

2 个解决方案

#1


0  

A common R solution would use a function from the apply family to apply a function over a a vector.

常见的R解决方案将使用apply系列中的函数来在向量上应用函数。

sdseqs = c(
  "AGGAG.+ATG",
  "AGAAG.+ATG",
  "AAAGG.+ATG",
  "GGAGG.+ATG",
  "GAAGA.+ATG",
  "GGAGA.+ATG",
  "AAGGT.+ATG",
  "AGGAA.+ATG",
  "AAGGA.+ATG",
  "GTGGA.+ATG"
)

sdframe <- sapply(sdseqs, function(one.motif) {
  grepl(one.motif, x = x)
})

sdframe

     AGGAG.+ATG AGAAG.+ATG AAAGG.+ATG GGAGG.+ATG GAAGA.+ATG GGAGA.+ATG AAGGT.+ATG AGGAA.+ATG AAGGA.+ATG GTGGA.+ATG
[1,]      FALSE       TRUE      FALSE      FALSE       TRUE       TRUE       TRUE      FALSE       TRUE      FALSE
[2,]      FALSE       TRUE      FALSE      FALSE       TRUE       TRUE       TRUE      FALSE       TRUE      FALSE
[3,]      FALSE       TRUE      FALSE      FALSE       TRUE       TRUE       TRUE      FALSE       TRUE      FALSE

sdframe.t <- t(sdframe)

sdframe.t

            [,1]  [,2]  [,3]
AGGAG.+ATG FALSE FALSE FALSE
AGAAG.+ATG  TRUE  TRUE  TRUE
AAAGG.+ATG FALSE FALSE FALSE
GGAGG.+ATG FALSE FALSE FALSE
GAAGA.+ATG  TRUE  TRUE  TRUE
GGAGA.+ATG  TRUE  TRUE  TRUE
AAGGT.+ATG  TRUE  TRUE  TRUE
AGGAA.+ATG FALSE FALSE FALSE
AAGGA.+ATG  TRUE  TRUE  TRUE
GTGGA.+ATG FALSE FALSE FALSE

#2


0  

In first line in fact you do not create a data.frame. So your output is a list.

事实上,在第一行中,您不会创建data.frame。所以你的输出是一个列表。

Instead of cbind you need rbind to add rows:

而不是cbind你需要rbind来添加行:

sdframe <- data.frame()
sdseqs = c("AGGAG.+ATG", 
       "AGAAG.+ATG","AAAGG.+ATG","GGAGG.+ATG","GAAGA.+ATG",
       "GGAGA.+ATG","AAGGT.+ATG","AGGAA.+ATG","AAGGA.+ATG","GTGGA.+ATG")
for (i in 1:10) {
sdframe <- rbind(sdframe,(grepl(sdseqs[i], x)))
}

#1


0  

A common R solution would use a function from the apply family to apply a function over a a vector.

常见的R解决方案将使用apply系列中的函数来在向量上应用函数。

sdseqs = c(
  "AGGAG.+ATG",
  "AGAAG.+ATG",
  "AAAGG.+ATG",
  "GGAGG.+ATG",
  "GAAGA.+ATG",
  "GGAGA.+ATG",
  "AAGGT.+ATG",
  "AGGAA.+ATG",
  "AAGGA.+ATG",
  "GTGGA.+ATG"
)

sdframe <- sapply(sdseqs, function(one.motif) {
  grepl(one.motif, x = x)
})

sdframe

     AGGAG.+ATG AGAAG.+ATG AAAGG.+ATG GGAGG.+ATG GAAGA.+ATG GGAGA.+ATG AAGGT.+ATG AGGAA.+ATG AAGGA.+ATG GTGGA.+ATG
[1,]      FALSE       TRUE      FALSE      FALSE       TRUE       TRUE       TRUE      FALSE       TRUE      FALSE
[2,]      FALSE       TRUE      FALSE      FALSE       TRUE       TRUE       TRUE      FALSE       TRUE      FALSE
[3,]      FALSE       TRUE      FALSE      FALSE       TRUE       TRUE       TRUE      FALSE       TRUE      FALSE

sdframe.t <- t(sdframe)

sdframe.t

            [,1]  [,2]  [,3]
AGGAG.+ATG FALSE FALSE FALSE
AGAAG.+ATG  TRUE  TRUE  TRUE
AAAGG.+ATG FALSE FALSE FALSE
GGAGG.+ATG FALSE FALSE FALSE
GAAGA.+ATG  TRUE  TRUE  TRUE
GGAGA.+ATG  TRUE  TRUE  TRUE
AAGGT.+ATG  TRUE  TRUE  TRUE
AGGAA.+ATG FALSE FALSE FALSE
AAGGA.+ATG  TRUE  TRUE  TRUE
GTGGA.+ATG FALSE FALSE FALSE

#2


0  

In first line in fact you do not create a data.frame. So your output is a list.

事实上,在第一行中,您不会创建data.frame。所以你的输出是一个列表。

Instead of cbind you need rbind to add rows:

而不是cbind你需要rbind来添加行:

sdframe <- data.frame()
sdseqs = c("AGGAG.+ATG", 
       "AGAAG.+ATG","AAAGG.+ATG","GGAGG.+ATG","GAAGA.+ATG",
       "GGAGA.+ATG","AAGGT.+ATG","AGGAA.+ATG","AAGGA.+ATG","GTGGA.+ATG")
for (i in 1:10) {
sdframe <- rbind(sdframe,(grepl(sdseqs[i], x)))
}