查找两个间隔数据之间的重叠范围

时间:2021-10-29 12:47:51

I have one table with coordinates (start, end) of ca. 500000 fragments and another table with 60000 single coordinates that I would like to match with the former fragments. I.e., for each record from dtCoords table I need to search a record in dtFrags table having the same chr and start<=coord<=end (and retrieve the type from this record of dtFrags). Is it good idea at all to use R for this, or I should rather look to other languages?

我有一个表格的坐标(开始,结束)。 500000个片段和另一个60000单坐标表,我想与前片段匹配。即,对于来自dtCoords表的每条记录,我需要在具有相同chr的dtFrags表中搜索记录并启动<= coord <= end(并从此dtFrags记录中检索类型)。完全使用R来做这个是好主意,还是我应该选择其他语言?

Here is my example:

这是我的例子:

require(data.table)

dtFrags <- fread(
  "id,chr,start,end,type
 1,1,100,200,exon
 2,2,300,500,intron
 3,X,400,600,intron
 4,2,250,600,exon
")

dtCoords <- fread(
"id,chr,coord
 10,1,150
 20,2,300
 30,Y,500
")

At the end, I would like to have something like this:

最后,我想有这样的事情:

"idC,chr,coord,idF,type
 10,  1,  150,  1, exon
 20,  2,  300,  2, intron
 20,  2,  300,  4, exon
 30,  Y,  500, NA, NA
"

I can simplify a bit the task by splitting the table to subtables by chr, so I would concentrate only on coordinates

我可以通过chr将表拆分为子表来简化任务,所以我只关注坐标

setkey(dtCoords, 'chr')
setkey(dtFrags,  'chr')

for (chr in unique(dtCoords$chr)) {
  dtCoordsSub <- dtCoords[chr];
  dtFragsSub  <-  dtFrags[chr];
  dtCoordsSub[, {
    # ????  
  }, by=id]  
}

but it's still not clear for me how should I work inside... I would be very grateful for any hints.

但是我仍然不清楚我应该如何在里面工作...我会非常感激任何提示。

UPD. just in case, I put my real table in the archive here. After unpacking to your working directory, tables can be loaded with the following code:

UPD。为了以防万一,我把我的真实表放在档案中。解压缩到工作目录后,可以使用以下代码加载表:

dtCoords <- fread("dtCoords.txt", sep="\t", header=TRUE)
dtFrags  <- fread("dtFrags.txt",  sep="\t", header=TRUE)

2 个解决方案

#1


7  

In general, it's very appropriate to use the bioconductor package IRanges to deal with problems related to intervals. It does so efficiently by implementing interval tree. GenomicRanges is another package that builds on top of IRanges, specifically for handling, well, "Genomic Ranges".

通常,使用bioconductor包IRanges来处理与间隔相关的问题是非常合适的。它通过实现区间树来有效地完成。 GenomicRanges是另一个基于IRanges的软件包,专门用于处理“Genomic Ranges”。

require(GenomicRanges)
gr1 = with(dtFrags, GRanges(Rle(factor(chr, 
          levels=c("1", "2", "X", "Y"))), IRanges(start, end)))
gr2 = with(dtCoords, GRanges(Rle(factor(chr, 
          levels=c("1", "2", "X", "Y"))), IRanges(coord, coord)))
olaps = findOverlaps(gr2, gr1)
dtCoords[, grp := seq_len(nrow(dtCoords))]
dtFrags[subjectHits(olaps), grp := queryHits(olaps)]
setkey(dtCoords, grp)
setkey(dtFrags, grp)
dtFrags[, list(grp, id, type)][dtCoords]

   grp id   type id.1 chr coord
1:   1  1   exon   10   1   150
2:   2  2 intron   20   2   300
3:   2  4   exon   20   2   300
4:   3 NA     NA   30   Y   500

#2


3  

Does this work? You can use merge first and then subset

这有用吗?您可以先使用merge,然后再使用子集

   kk<-merge(dtFrags,dtCoords,by="chr",all.x=TRUE)
> kk
   chr id.x start end   type id.y coord
1:   1    1   100 200   exon   10   150
2:   2    2   300 500 intron   20   300
3:   2    4   250 600   exon   20   300
4:   X    3   400 600 intron   NA    NA


 kk[coord>=start & coord<=end]
   chr id.x start end type id.y coord
1:   1    1   100 200 exon   10   150
2:   2    4   250 600 exon   20   300

#1


7  

In general, it's very appropriate to use the bioconductor package IRanges to deal with problems related to intervals. It does so efficiently by implementing interval tree. GenomicRanges is another package that builds on top of IRanges, specifically for handling, well, "Genomic Ranges".

通常,使用bioconductor包IRanges来处理与间隔相关的问题是非常合适的。它通过实现区间树来有效地完成。 GenomicRanges是另一个基于IRanges的软件包,专门用于处理“Genomic Ranges”。

require(GenomicRanges)
gr1 = with(dtFrags, GRanges(Rle(factor(chr, 
          levels=c("1", "2", "X", "Y"))), IRanges(start, end)))
gr2 = with(dtCoords, GRanges(Rle(factor(chr, 
          levels=c("1", "2", "X", "Y"))), IRanges(coord, coord)))
olaps = findOverlaps(gr2, gr1)
dtCoords[, grp := seq_len(nrow(dtCoords))]
dtFrags[subjectHits(olaps), grp := queryHits(olaps)]
setkey(dtCoords, grp)
setkey(dtFrags, grp)
dtFrags[, list(grp, id, type)][dtCoords]

   grp id   type id.1 chr coord
1:   1  1   exon   10   1   150
2:   2  2 intron   20   2   300
3:   2  4   exon   20   2   300
4:   3 NA     NA   30   Y   500

#2


3  

Does this work? You can use merge first and then subset

这有用吗?您可以先使用merge,然后再使用子集

   kk<-merge(dtFrags,dtCoords,by="chr",all.x=TRUE)
> kk
   chr id.x start end   type id.y coord
1:   1    1   100 200   exon   10   150
2:   2    2   300 500 intron   20   300
3:   2    4   250 600   exon   20   300
4:   X    3   400 600 intron   NA    NA


 kk[coord>=start & coord<=end]
   chr id.x start end type id.y coord
1:   1    1   100 200 exon   10   150
2:   2    4   250 600 exon   20   300