I have one table with coordinates (start
, end
) of ca. 500000 fragments and another table with 60000 single coordinates that I would like to match with the former fragments. I.e., for each record from dtCoords
table I need to search a record in dtFrags
table having the same chr
and start
<=coord
<=end
(and retrieve the type
from this record of dtFrags
). Is it good idea at all to use R for this, or I should rather look to other languages?
我有一个表格的坐标(开始,结束)。 500000个片段和另一个60000单坐标表,我想与前片段匹配。即,对于来自dtCoords表的每条记录,我需要在具有相同chr的dtFrags表中搜索记录并启动<= coord <= end(并从此dtFrags记录中检索类型)。完全使用R来做这个是好主意,还是我应该选择其他语言?
Here is my example:
这是我的例子:
require(data.table)
dtFrags <- fread(
"id,chr,start,end,type
1,1,100,200,exon
2,2,300,500,intron
3,X,400,600,intron
4,2,250,600,exon
")
dtCoords <- fread(
"id,chr,coord
10,1,150
20,2,300
30,Y,500
")
At the end, I would like to have something like this:
最后,我想有这样的事情:
"idC,chr,coord,idF,type
10, 1, 150, 1, exon
20, 2, 300, 2, intron
20, 2, 300, 4, exon
30, Y, 500, NA, NA
"
I can simplify a bit the task by splitting the table to subtables by chr
, so I would concentrate only on coordinates
我可以通过chr将表拆分为子表来简化任务,所以我只关注坐标
setkey(dtCoords, 'chr')
setkey(dtFrags, 'chr')
for (chr in unique(dtCoords$chr)) {
dtCoordsSub <- dtCoords[chr];
dtFragsSub <- dtFrags[chr];
dtCoordsSub[, {
# ????
}, by=id]
}
but it's still not clear for me how should I work inside... I would be very grateful for any hints.
但是我仍然不清楚我应该如何在里面工作...我会非常感激任何提示。
UPD. just in case, I put my real table in the archive here. After unpacking to your working directory, tables can be loaded with the following code:
UPD。为了以防万一,我把我的真实表放在档案中。解压缩到工作目录后,可以使用以下代码加载表:
dtCoords <- fread("dtCoords.txt", sep="\t", header=TRUE)
dtFrags <- fread("dtFrags.txt", sep="\t", header=TRUE)
2 个解决方案
#1
7
In general, it's very appropriate to use the bioconductor package IRanges
to deal with problems related to intervals. It does so efficiently by implementing interval tree. GenomicRanges
is another package that builds on top of IRanges
, specifically for handling, well, "Genomic Ranges".
通常,使用bioconductor包IRanges来处理与间隔相关的问题是非常合适的。它通过实现区间树来有效地完成。 GenomicRanges是另一个基于IRanges的软件包,专门用于处理“Genomic Ranges”。
require(GenomicRanges)
gr1 = with(dtFrags, GRanges(Rle(factor(chr,
levels=c("1", "2", "X", "Y"))), IRanges(start, end)))
gr2 = with(dtCoords, GRanges(Rle(factor(chr,
levels=c("1", "2", "X", "Y"))), IRanges(coord, coord)))
olaps = findOverlaps(gr2, gr1)
dtCoords[, grp := seq_len(nrow(dtCoords))]
dtFrags[subjectHits(olaps), grp := queryHits(olaps)]
setkey(dtCoords, grp)
setkey(dtFrags, grp)
dtFrags[, list(grp, id, type)][dtCoords]
grp id type id.1 chr coord
1: 1 1 exon 10 1 150
2: 2 2 intron 20 2 300
3: 2 4 exon 20 2 300
4: 3 NA NA 30 Y 500
#2
3
Does this work? You can use merge
first and then subset
这有用吗?您可以先使用merge,然后再使用子集
kk<-merge(dtFrags,dtCoords,by="chr",all.x=TRUE)
> kk
chr id.x start end type id.y coord
1: 1 1 100 200 exon 10 150
2: 2 2 300 500 intron 20 300
3: 2 4 250 600 exon 20 300
4: X 3 400 600 intron NA NA
kk[coord>=start & coord<=end]
chr id.x start end type id.y coord
1: 1 1 100 200 exon 10 150
2: 2 4 250 600 exon 20 300
#1
7
In general, it's very appropriate to use the bioconductor package IRanges
to deal with problems related to intervals. It does so efficiently by implementing interval tree. GenomicRanges
is another package that builds on top of IRanges
, specifically for handling, well, "Genomic Ranges".
通常,使用bioconductor包IRanges来处理与间隔相关的问题是非常合适的。它通过实现区间树来有效地完成。 GenomicRanges是另一个基于IRanges的软件包,专门用于处理“Genomic Ranges”。
require(GenomicRanges)
gr1 = with(dtFrags, GRanges(Rle(factor(chr,
levels=c("1", "2", "X", "Y"))), IRanges(start, end)))
gr2 = with(dtCoords, GRanges(Rle(factor(chr,
levels=c("1", "2", "X", "Y"))), IRanges(coord, coord)))
olaps = findOverlaps(gr2, gr1)
dtCoords[, grp := seq_len(nrow(dtCoords))]
dtFrags[subjectHits(olaps), grp := queryHits(olaps)]
setkey(dtCoords, grp)
setkey(dtFrags, grp)
dtFrags[, list(grp, id, type)][dtCoords]
grp id type id.1 chr coord
1: 1 1 exon 10 1 150
2: 2 2 intron 20 2 300
3: 2 4 exon 20 2 300
4: 3 NA NA 30 Y 500
#2
3
Does this work? You can use merge
first and then subset
这有用吗?您可以先使用merge,然后再使用子集
kk<-merge(dtFrags,dtCoords,by="chr",all.x=TRUE)
> kk
chr id.x start end type id.y coord
1: 1 1 100 200 exon 10 150
2: 2 2 300 500 intron 20 300
3: 2 4 250 600 exon 20 300
4: X 3 400 600 intron NA NA
kk[coord>=start & coord<=end]
chr id.x start end type id.y coord
1: 1 1 100 200 exon 10 150
2: 2 4 250 600 exon 20 300