函数的作用是:对给定字符串长度的子字符串进行采样

时间:2022-02-08 21:41:29

I'm trying to write an R function that will sample a variable number of 5-element substrings, based on the length of the original string in each row of a data frame. I first calculated the number of times I'd like each draw to repeat, and would like to add this into the function so that the number of samples taken for each row is based on the "num_draws" column for that row. my thought was to use a generalized instance, and then use an apply statement outside of the function to act on each row, but I can't figure out how to set up the function to call col 3 as a generalized instance (without calling either the value of just the first row, or the value of all rows).

我正在尝试编写一个R函数,它将根据数据帧每行中原始字符串的长度对5个元素的子字符串进行采样。我首先计算了我希望每次绘制重复的次数,并希望将其添加到函数中,以便为每一行获取的样本数量基于该行的“num_draw”列。我的思想是使用一个通用的实例,然后使用一个应用声明外部函数的每一行,但我不知道如何设置函数调用坳3作为广义实例(不要求要么是第一行的值,或所有行)的价值。

example data frame:

示例数据帧:

  BP                             TF                                  num_draws
1 CGGCGCATGTTCGGTAATGA           TFTTTFTTTFFTTFTTTTTF                6
2 ATAAGATGCCCAGAGCCTTTTCATGTACTA TFTFTFTFFFFFFTTFTTTTFTTTTFFTTT      9
3 TCTTAGGAAGGATTC                FTTTTTTTTTFFFFF                     4

desired output:

期望的输出:

[1]GGCGC FTTTF 
   AATGA TTTTF 
   TTFFT TGTTC 
   TAATG TTTTT
   AATGA TTTTF   
   CGGCG TFTTT

[2]AGATG FTFTF
   ATAAG TFTFT
   ATGCC FTFFF
   GCCCA FFFFF
   ATAAG TFTFT
   GTACT TFFTT
   GCCCA FFFFF
   TGCCC TFFFF
   AGATG FTFTF

[3]TTAGG TTTTT
   CTTAG TTTTT
   GGAAG TTTTT
   GGATT TTFFF

example code:

#make example data frame
BaseP1 <- paste(sample(size = 20, x = c("A","C","T","G"), replace = TRUE), collapse = "")
BaseP2 <- paste(sample(size = 30, x = c("A","C","T","G"), replace = TRUE), collapse = "")
BaseP3 <- paste(sample(size = 15, x = c("A","C","T","G"), replace = TRUE), collapse = "")
TrueFalse1 <- paste(sample(size = 20, x = c("T","F"), replace = TRUE), collapse = "")
TrueFalse2 <- paste(sample(size = 30, x = c("T","F"), replace = TRUE), collapse = "")
TrueFalse3 <- paste(sample(size = 15, x = c("T","F"), replace = TRUE), collapse = "")
my_df <- data.frame(c(BaseP1,BaseP2,BaseP3), c(TrueFalse1, TrueFalse2, TrueFalse3)) 


#calculate number of draws by length 
frag_length<- 5 
my_df<- cbind(my_df, (round((nchar(my_df[,1]) / frag_length) * 1.5, digits = 0)))
colnames(my_df) <- c("BP", "TF", "num_draws")

#function to sample x number of draws per row (this does not work)
Fragment = function(string) {
  nStart = sample(1:(nchar(string) -5), 1)
  samp<- substr(string, nStart, nStart + 4)
replicate(n= string[,3], expr = samp)
  }


apply(my_df[,1:2], c(1,2), Fragment)

1 个解决方案

#1


2  

One option would be to change the function to have another argument n and create the nStart inside the replicate call

一个选项是将函数更改为具有另一个参数n,并在复制调用中创建nStart

Fragment = function(string, n) {
   replicate(n= n,  {nStart <- sample(1:(nchar(string) -5), 1)
                  samp <- substr(string, nStart, nStart + 4)
              })   

}

apply(my_df, 1, function(x) data.frame(lapply(x[1:2], Fragment, n = x[3])))
$`1`
#     BP    TF
#1 GGCGC FFTTF
#2 GGTAA TFFTT
#3 GCGCA TTFTT
#4 CGCAT TFFTT
#5 GGCGC FTTTF
#6 TGTTC FTTFT

#$`2`
#     BP    TF
#1 GTACT TTTTF
#2 ATAAG FTTFT
#3 GTACT TFTFF
#4 TAAGA TTTTF
#5 CCTTT FFTTF
#6 TCATG TTTTF
#7 CCAGA TFTFT
#8 TTCAT TFTFT
#9 CCCAG FTFTF

#$`3`
#     BP    TF
#1 AAGGA TTTFF
#2 AGGAT TTTTT
#3 CTTAG TFFFF
#4 TAGGA TTTFF

#1


2  

One option would be to change the function to have another argument n and create the nStart inside the replicate call

一个选项是将函数更改为具有另一个参数n,并在复制调用中创建nStart

Fragment = function(string, n) {
   replicate(n= n,  {nStart <- sample(1:(nchar(string) -5), 1)
                  samp <- substr(string, nStart, nStart + 4)
              })   

}

apply(my_df, 1, function(x) data.frame(lapply(x[1:2], Fragment, n = x[3])))
$`1`
#     BP    TF
#1 GGCGC FFTTF
#2 GGTAA TFFTT
#3 GCGCA TTFTT
#4 CGCAT TFFTT
#5 GGCGC FTTTF
#6 TGTTC FTTFT

#$`2`
#     BP    TF
#1 GTACT TTTTF
#2 ATAAG FTTFT
#3 GTACT TFTFF
#4 TAAGA TTTTF
#5 CCTTT FFTTF
#6 TCATG TTTTF
#7 CCAGA TFTFT
#8 TTCAT TFTFT
#9 CCCAG FTFTF

#$`3`
#     BP    TF
#1 AAGGA TTTFF
#2 AGGAT TTTTT
#3 CTTAG TFFFF
#4 TAGGA TTTFF