识别R中连续重叠的段

时间:2021-11-06 07:41:12

I need to aggregate overlapping segments into a single segment ranging all connected segments.

我需要将重叠的段聚合到一个范围内的所有连接段中。

Note that a simple foverlaps cannot detect connections between non overlapping but connected segments, see the example for clarification. If it would rain on my segments in my plot I am looking for the stretches of dry ground.

请注意,简单的foverlaps无法检测非重叠但已连接的段之间的连接,请参阅示例以进行说明。如果在我的情节中我的部分会下雨,我正在寻找干燥的地面。

So far I solve this problem by an iterative algorithm but I'm wondering if there is a more elegant and stright forward way for this problem. I'm sure not the first one to face it.

到目前为止,我通过迭代算法解决了这个问题,但我想知道这个问题是否有更优雅和更直接的前进方式。我肯定不是第一个面对它的人。

I was thinking about a non-equi rolling join, but faild to implement that

我正在考虑非平等滚动连接,但是要实现它

library(data.table)
(x <- data.table(start = c(41,43,43,47,47,48,51,52,54,55,57,59), 
                  end = c(42,44,45,53,48,50,52,55,57,56,58,60)))

#     start end
#  1:    41  42
#  2:    43  44
#  3:    43  45
#  4:    47  53
#  5:    47  48
#  6:    48  50
#  7:    51  52
#  8:    52  55
#  9:    54  57
# 10:    55  56
# 11:    57  58
# 12:    59  60

setorder(x, start)[, i := .I] # i is just a helper for plotting segments
plot(NA, xlim = range(x[,.(start,end)]), ylim = rev(range(x$i)))
do.call(segments, list(x$start, x$i, x$end, x$i))

x$grp <- c(1,3,3,2,2,2,2,2,2,2,2,4) # the grouping I am looking for
do.call(segments, list(x$start, x$i, x$end, x$i, col = x$grp))
(y <- x[, .(start = min(start), end = max(end)), k=grp])

#    grp start end
# 1:   1    41  42
# 2:   2    47  58
# 3:   3    43  45
# 4:   4    59  60

do.call(segments, list(y$start, 12.2, y$end, 12.2, col = 1:4, lwd = 3))

EDIT:

That's brilliant, thanks, cummax & cumsum do the job, Uwe's Answer is slightly better than Davids comment.

这很棒,谢谢,cummax和cumsum完成这项工作,Uwe的答案稍微好于戴维斯的评论。

  • end[.N] can get wrong results, try example data x below. max(end) is correct in all cases, and faster.

    结束[.N]可能会得到错误的结果,请尝试下面的示例数据x。 max(end)在所有情况下都是正确的,并且更快。

    x <- data.table(start = c(11866, 12696, 13813, 14011, 14041), end = c(13140, 14045, 14051, 14039, 14045))

    x < - data.table(start = c(11866,12696,13813,14011,14041),end = c(13140,14045,14051,14039,14045))

  • min(start) and start[1L] give the same (as x is ordered by start), the latter is faster.
  • min(start)和start [1L]给出相同(因为x按开始排序),后者更快。

  • grp on the fly is significantly faster, unfortunately I need grp assigned.
  • 飞行中的grp明显更快,不幸的是我需要分配grp。

  • cumsum(cummax(shift(end, fill = 0)) < start) is significantly faster than cumsum(c(0, start[-1L] > cummax(head(end, -1L)))).
  • cumsum(cummax(shift(end,fill = 0)) cummax(head(end,-1L))))。 )明显快于cumsum(c(0,start>

  • I did not test the package GenomicRanges solution.
  • 我没有测试GenomicRanges包的解决方案。

2 个解决方案

#1


6  

The OP has requested to aggregate overlapping segments into a single segment ranging all connected segments.

OP已要求将重叠段聚合成一个范围内的所有连接段。

Here is another solution which uses cummax() and cumsum() to identify groups of overlapping or adjacent segments:

这是另一种解决方案,它使用cummax()和cumsum()来识别重叠或相邻段的组:

x[order(start, end), grp := cumsum(cummax(shift(end, fill = 0)) < start)][
  , .(start = min(start), end = max(end)), by = grp]
   grp start end
1:   1    41  42
2:   2    43  45
3:   3    47  58
4:   4    59  60

Disclaimer: I have seen that clever approach somewhere else on SO but I cannot remember exactly where.

免责声明:我已经在SO上看到了其他地方的聪明方法,但我记不清究竟在哪里。

Edit:

As David Arenburg has pointed out, it is not necessary to create the grp variable separately. This can be done on-the-fly in the by = parameter:

正如David Arenburg指出的那样,没有必要单独创建grp变量。这可以在by =参数中即时完​​成:

x[order(start, end), .(start = min(start), end = max(end)), 
  by = .(grp = cumsum(cummax(shift(end, fill = 0)) < start))]

Visualisation

OP's plot can be amended to show also the aggregated segments (quick and dirty):

可以修改OP的图以显示聚合的段(快速和脏):

plot(NA, xlim = range(x[,.(start,end)]), ylim = rev(range(x$i)))
do.call(segments, list(x$start, x$i, x$end, x$i))
x[order(start, end), .(start = min(start), end = max(end)), 
  by = .(grp = cumsum(cummax(shift(end, fill = 0)) < start))][
    , segments(start, grp + 0.5, end, grp + 0.5, "red", , 4)]

识别R中连续重叠的段

#2


5  

You can try a GenomicRanges approach. In the output each row is a group.

您可以尝试GenomicRanges方法。在输出中,每行都是一个组。

library(GenomicRanges)
x_gr <- with(x, GRanges(1, IRanges(start, end)))
as.data.table(reduce(x_gr, min.gapwidth=0))[,2:3]
   start end
1:    41  42
2:    43  45
3:    47  58
4:    59  60

And a visual insepection can be done using Gviz. Here one has to know that the package has been built for biologists and genetic information. The pattern behind are DNA bases. Hence, one has to substract 1 of the segment ends to get the correct plot.

并且可以使用Gviz进行视觉检查。在这里,我们必须知道该包是为生物学家和遗传信息而建的。背后的模式是DNA碱基。因此,必须减去段末端中的一个以获得正确的绘图。

library(Gviz)
ga <- Gviz::GenomeAxisTrack()
xgr <- with(x, GRanges(1, IRanges(start, end = end - 1)))
xgr_red <- reduce(xgr, min.gapwidth=1)
ga <- GenomeAxisTrack()
GT <- lapply(xgr, GeneRegionTrack)
GT_red <- lapply(xgr_red, GeneRegionTrack, fill = "lightblue")
plotTracks(c(ga, GT, GT_red),from = min(x$start), to = max(x$start)+2)

识别R中连续重叠的段

#1


6  

The OP has requested to aggregate overlapping segments into a single segment ranging all connected segments.

OP已要求将重叠段聚合成一个范围内的所有连接段。

Here is another solution which uses cummax() and cumsum() to identify groups of overlapping or adjacent segments:

这是另一种解决方案,它使用cummax()和cumsum()来识别重叠或相邻段的组:

x[order(start, end), grp := cumsum(cummax(shift(end, fill = 0)) < start)][
  , .(start = min(start), end = max(end)), by = grp]
   grp start end
1:   1    41  42
2:   2    43  45
3:   3    47  58
4:   4    59  60

Disclaimer: I have seen that clever approach somewhere else on SO but I cannot remember exactly where.

免责声明:我已经在SO上看到了其他地方的聪明方法,但我记不清究竟在哪里。

Edit:

As David Arenburg has pointed out, it is not necessary to create the grp variable separately. This can be done on-the-fly in the by = parameter:

正如David Arenburg指出的那样,没有必要单独创建grp变量。这可以在by =参数中即时完​​成:

x[order(start, end), .(start = min(start), end = max(end)), 
  by = .(grp = cumsum(cummax(shift(end, fill = 0)) < start))]

Visualisation

OP's plot can be amended to show also the aggregated segments (quick and dirty):

可以修改OP的图以显示聚合的段(快速和脏):

plot(NA, xlim = range(x[,.(start,end)]), ylim = rev(range(x$i)))
do.call(segments, list(x$start, x$i, x$end, x$i))
x[order(start, end), .(start = min(start), end = max(end)), 
  by = .(grp = cumsum(cummax(shift(end, fill = 0)) < start))][
    , segments(start, grp + 0.5, end, grp + 0.5, "red", , 4)]

识别R中连续重叠的段

#2


5  

You can try a GenomicRanges approach. In the output each row is a group.

您可以尝试GenomicRanges方法。在输出中,每行都是一个组。

library(GenomicRanges)
x_gr <- with(x, GRanges(1, IRanges(start, end)))
as.data.table(reduce(x_gr, min.gapwidth=0))[,2:3]
   start end
1:    41  42
2:    43  45
3:    47  58
4:    59  60

And a visual insepection can be done using Gviz. Here one has to know that the package has been built for biologists and genetic information. The pattern behind are DNA bases. Hence, one has to substract 1 of the segment ends to get the correct plot.

并且可以使用Gviz进行视觉检查。在这里,我们必须知道该包是为生物学家和遗传信息而建的。背后的模式是DNA碱基。因此,必须减去段末端中的一个以获得正确的绘图。

library(Gviz)
ga <- Gviz::GenomeAxisTrack()
xgr <- with(x, GRanges(1, IRanges(start, end = end - 1)))
xgr_red <- reduce(xgr, min.gapwidth=1)
ga <- GenomeAxisTrack()
GT <- lapply(xgr, GeneRegionTrack)
GT_red <- lapply(xgr_red, GeneRegionTrack, fill = "lightblue")
plotTracks(c(ga, GT, GT_red),from = min(x$start), to = max(x$start)+2)

识别R中连续重叠的段