I'm trying to write an R function that will sample a variable number of 5-element substrings, based on the length of the original string in each row of a data frame. I first calculated the number of times I'd like each draw to repeat, and would like to add this into the function so that the number of samples taken for each row is based on the "num_draws" column for that row. my thought was to use a generalized instance, and then use an apply statement outside of the function to act on each row, but I can't figure out how to set up the function to call col 3 as a generalized instance (without calling either the value of just the first row, or the value of all rows).
我正在尝试编写一个R函数,它将根据数据帧每行中原始字符串的长度对5个元素的子字符串进行采样。我首先计算了我希望每次绘制重复的次数,并希望将其添加到函数中,以便为每一行获取的样本数量基于该行的“num_draw”列。我的思想是使用一个通用的实例,然后使用一个应用声明外部函数的每一行,但我不知道如何设置函数调用坳3作为广义实例(不要求要么是第一行的值,或所有行)的价值。
example data frame:
示例数据帧:
BP TF num_draws
1 CGGCGCATGTTCGGTAATGA TFTTTFTTTFFTTFTTTTTF 6
2 ATAAGATGCCCAGAGCCTTTTCATGTACTA TFTFTFTFFFFFFTTFTTTTFTTTTFFTTT 9
3 TCTTAGGAAGGATTC FTTTTTTTTTFFFFF 4
desired output:
期望的输出:
[1]GGCGC FTTTF
AATGA TTTTF
TTFFT TGTTC
TAATG TTTTT
AATGA TTTTF
CGGCG TFTTT
[2]AGATG FTFTF
ATAAG TFTFT
ATGCC FTFFF
GCCCA FFFFF
ATAAG TFTFT
GTACT TFFTT
GCCCA FFFFF
TGCCC TFFFF
AGATG FTFTF
[3]TTAGG TTTTT
CTTAG TTTTT
GGAAG TTTTT
GGATT TTFFF
example code:
#make example data frame
BaseP1 <- paste(sample(size = 20, x = c("A","C","T","G"), replace = TRUE), collapse = "")
BaseP2 <- paste(sample(size = 30, x = c("A","C","T","G"), replace = TRUE), collapse = "")
BaseP3 <- paste(sample(size = 15, x = c("A","C","T","G"), replace = TRUE), collapse = "")
TrueFalse1 <- paste(sample(size = 20, x = c("T","F"), replace = TRUE), collapse = "")
TrueFalse2 <- paste(sample(size = 30, x = c("T","F"), replace = TRUE), collapse = "")
TrueFalse3 <- paste(sample(size = 15, x = c("T","F"), replace = TRUE), collapse = "")
my_df <- data.frame(c(BaseP1,BaseP2,BaseP3), c(TrueFalse1, TrueFalse2, TrueFalse3))
#calculate number of draws by length
frag_length<- 5
my_df<- cbind(my_df, (round((nchar(my_df[,1]) / frag_length) * 1.5, digits = 0)))
colnames(my_df) <- c("BP", "TF", "num_draws")
#function to sample x number of draws per row (this does not work)
Fragment = function(string) {
nStart = sample(1:(nchar(string) -5), 1)
samp<- substr(string, nStart, nStart + 4)
replicate(n= string[,3], expr = samp)
}
apply(my_df[,1:2], c(1,2), Fragment)
1 个解决方案
#1
2
One option would be to change the function to have another argument n
and create the nStart
inside the replicate
call
一个选项是将函数更改为具有另一个参数n,并在复制调用中创建nStart
Fragment = function(string, n) {
replicate(n= n, {nStart <- sample(1:(nchar(string) -5), 1)
samp <- substr(string, nStart, nStart + 4)
})
}
apply(my_df, 1, function(x) data.frame(lapply(x[1:2], Fragment, n = x[3])))
$`1`
# BP TF
#1 GGCGC FFTTF
#2 GGTAA TFFTT
#3 GCGCA TTFTT
#4 CGCAT TFFTT
#5 GGCGC FTTTF
#6 TGTTC FTTFT
#$`2`
# BP TF
#1 GTACT TTTTF
#2 ATAAG FTTFT
#3 GTACT TFTFF
#4 TAAGA TTTTF
#5 CCTTT FFTTF
#6 TCATG TTTTF
#7 CCAGA TFTFT
#8 TTCAT TFTFT
#9 CCCAG FTFTF
#$`3`
# BP TF
#1 AAGGA TTTFF
#2 AGGAT TTTTT
#3 CTTAG TFFFF
#4 TAGGA TTTFF
#1
2
One option would be to change the function to have another argument n
and create the nStart
inside the replicate
call
一个选项是将函数更改为具有另一个参数n,并在复制调用中创建nStart
Fragment = function(string, n) {
replicate(n= n, {nStart <- sample(1:(nchar(string) -5), 1)
samp <- substr(string, nStart, nStart + 4)
})
}
apply(my_df, 1, function(x) data.frame(lapply(x[1:2], Fragment, n = x[3])))
$`1`
# BP TF
#1 GGCGC FFTTF
#2 GGTAA TFFTT
#3 GCGCA TTFTT
#4 CGCAT TFFTT
#5 GGCGC FTTTF
#6 TGTTC FTTFT
#$`2`
# BP TF
#1 GTACT TTTTF
#2 ATAAG FTTFT
#3 GTACT TFTFF
#4 TAAGA TTTTF
#5 CCTTT FFTTF
#6 TCATG TTTTF
#7 CCAGA TFTFT
#8 TTCAT TFTFT
#9 CCCAG FTFTF
#$`3`
# BP TF
#1 AAGGA TTTFF
#2 AGGAT TTTTT
#3 CTTAG TFFFF
#4 TAGGA TTTFF