I need to aggregate overlapping segments into a single segment ranging all connected segments.
我需要将重叠的段聚合到一个范围内的所有连接段中。
Note that a simple foverlaps cannot detect connections between non overlapping but connected segments, see the example for clarification. If it would rain on my segments in my plot I am looking for the stretches of dry ground.
请注意,简单的foverlaps无法检测非重叠但已连接的段之间的连接,请参阅示例以进行说明。如果在我的情节中我的部分会下雨,我正在寻找干燥的地面。
So far I solve this problem by an iterative algorithm but I'm wondering if there is a more elegant and stright forward way for this problem. I'm sure not the first one to face it.
到目前为止,我通过迭代算法解决了这个问题,但我想知道这个问题是否有更优雅和更直接的前进方式。我肯定不是第一个面对它的人。
I was thinking about a non-equi rolling join, but faild to implement that
我正在考虑非平等滚动连接,但是要实现它
library(data.table)
(x <- data.table(start = c(41,43,43,47,47,48,51,52,54,55,57,59),
end = c(42,44,45,53,48,50,52,55,57,56,58,60)))
# start end
# 1: 41 42
# 2: 43 44
# 3: 43 45
# 4: 47 53
# 5: 47 48
# 6: 48 50
# 7: 51 52
# 8: 52 55
# 9: 54 57
# 10: 55 56
# 11: 57 58
# 12: 59 60
setorder(x, start)[, i := .I] # i is just a helper for plotting segments
plot(NA, xlim = range(x[,.(start,end)]), ylim = rev(range(x$i)))
do.call(segments, list(x$start, x$i, x$end, x$i))
x$grp <- c(1,3,3,2,2,2,2,2,2,2,2,4) # the grouping I am looking for
do.call(segments, list(x$start, x$i, x$end, x$i, col = x$grp))
(y <- x[, .(start = min(start), end = max(end)), k=grp])
# grp start end
# 1: 1 41 42
# 2: 2 47 58
# 3: 3 43 45
# 4: 4 59 60
do.call(segments, list(y$start, 12.2, y$end, 12.2, col = 1:4, lwd = 3))
EDIT:
That's brilliant, thanks, cummax & cumsum do the job, Uwe's Answer is slightly better than Davids comment.
这很棒,谢谢,cummax和cumsum完成这项工作,Uwe的答案稍微好于戴维斯的评论。
-
end[.N]
can get wrong results, try example datax
below.max(end)
is correct in all cases, and faster.结束[.N]可能会得到错误的结果,请尝试下面的示例数据x。 max(end)在所有情况下都是正确的,并且更快。
x <- data.table(start = c(11866, 12696, 13813, 14011, 14041), end = c(13140, 14045, 14051, 14039, 14045))
x < - data.table(start = c(11866,12696,13813,14011,14041),end = c(13140,14045,14051,14039,14045))
-
min(start)
andstart[1L]
give the same (asx
is ordered by start), the latter is faster. - grp on the fly is significantly faster, unfortunately I need grp assigned.
-
cumsum(cummax(shift(end, fill = 0)) < start)
is significantly faster thancumsum(c(0, start[-1L] > cummax(head(end, -1L))))
. - I did not test the package GenomicRanges solution.
min(start)和start [1L]给出相同(因为x按开始排序),后者更快。
飞行中的grp明显更快,不幸的是我需要分配grp。
cumsum(cummax(shift(end,fill = 0))
我没有测试GenomicRanges包的解决方案。
2 个解决方案
#1
6
The OP has requested to aggregate overlapping segments into a single segment ranging all connected segments.
OP已要求将重叠段聚合成一个范围内的所有连接段。
Here is another solution which uses cummax()
and cumsum()
to identify groups of overlapping or adjacent segments:
这是另一种解决方案,它使用cummax()和cumsum()来识别重叠或相邻段的组:
x[order(start, end), grp := cumsum(cummax(shift(end, fill = 0)) < start)][
, .(start = min(start), end = max(end)), by = grp]
grp start end 1: 1 41 42 2: 2 43 45 3: 3 47 58 4: 4 59 60
Disclaimer: I have seen that clever approach somewhere else on SO but I cannot remember exactly where.
免责声明:我已经在SO上看到了其他地方的聪明方法,但我记不清究竟在哪里。
Edit:
As David Arenburg has pointed out, it is not necessary to create the grp
variable separately. This can be done on-the-fly in the by =
parameter:
正如David Arenburg指出的那样,没有必要单独创建grp变量。这可以在by =参数中即时完成:
x[order(start, end), .(start = min(start), end = max(end)),
by = .(grp = cumsum(cummax(shift(end, fill = 0)) < start))]
Visualisation
OP's plot can be amended to show also the aggregated segments (quick and dirty):
可以修改OP的图以显示聚合的段(快速和脏):
plot(NA, xlim = range(x[,.(start,end)]), ylim = rev(range(x$i)))
do.call(segments, list(x$start, x$i, x$end, x$i))
x[order(start, end), .(start = min(start), end = max(end)),
by = .(grp = cumsum(cummax(shift(end, fill = 0)) < start))][
, segments(start, grp + 0.5, end, grp + 0.5, "red", , 4)]
#2
5
You can try a GenomicRanges
approach. In the output each row is a group.
您可以尝试GenomicRanges方法。在输出中,每行都是一个组。
library(GenomicRanges)
x_gr <- with(x, GRanges(1, IRanges(start, end)))
as.data.table(reduce(x_gr, min.gapwidth=0))[,2:3]
start end
1: 41 42
2: 43 45
3: 47 58
4: 59 60
And a visual insepection can be done using Gviz
. Here one has to know that the package has been built for biologists and genetic information. The pattern behind are DNA bases. Hence, one has to substract 1 of the segment ends to get the correct plot.
并且可以使用Gviz进行视觉检查。在这里,我们必须知道该包是为生物学家和遗传信息而建的。背后的模式是DNA碱基。因此,必须减去段末端中的一个以获得正确的绘图。
library(Gviz)
ga <- Gviz::GenomeAxisTrack()
xgr <- with(x, GRanges(1, IRanges(start, end = end - 1)))
xgr_red <- reduce(xgr, min.gapwidth=1)
ga <- GenomeAxisTrack()
GT <- lapply(xgr, GeneRegionTrack)
GT_red <- lapply(xgr_red, GeneRegionTrack, fill = "lightblue")
plotTracks(c(ga, GT, GT_red),from = min(x$start), to = max(x$start)+2)
#1
6
The OP has requested to aggregate overlapping segments into a single segment ranging all connected segments.
OP已要求将重叠段聚合成一个范围内的所有连接段。
Here is another solution which uses cummax()
and cumsum()
to identify groups of overlapping or adjacent segments:
这是另一种解决方案,它使用cummax()和cumsum()来识别重叠或相邻段的组:
x[order(start, end), grp := cumsum(cummax(shift(end, fill = 0)) < start)][
, .(start = min(start), end = max(end)), by = grp]
grp start end 1: 1 41 42 2: 2 43 45 3: 3 47 58 4: 4 59 60
Disclaimer: I have seen that clever approach somewhere else on SO but I cannot remember exactly where.
免责声明:我已经在SO上看到了其他地方的聪明方法,但我记不清究竟在哪里。
Edit:
As David Arenburg has pointed out, it is not necessary to create the grp
variable separately. This can be done on-the-fly in the by =
parameter:
正如David Arenburg指出的那样,没有必要单独创建grp变量。这可以在by =参数中即时完成:
x[order(start, end), .(start = min(start), end = max(end)),
by = .(grp = cumsum(cummax(shift(end, fill = 0)) < start))]
Visualisation
OP's plot can be amended to show also the aggregated segments (quick and dirty):
可以修改OP的图以显示聚合的段(快速和脏):
plot(NA, xlim = range(x[,.(start,end)]), ylim = rev(range(x$i)))
do.call(segments, list(x$start, x$i, x$end, x$i))
x[order(start, end), .(start = min(start), end = max(end)),
by = .(grp = cumsum(cummax(shift(end, fill = 0)) < start))][
, segments(start, grp + 0.5, end, grp + 0.5, "red", , 4)]
#2
5
You can try a GenomicRanges
approach. In the output each row is a group.
您可以尝试GenomicRanges方法。在输出中,每行都是一个组。
library(GenomicRanges)
x_gr <- with(x, GRanges(1, IRanges(start, end)))
as.data.table(reduce(x_gr, min.gapwidth=0))[,2:3]
start end
1: 41 42
2: 43 45
3: 47 58
4: 59 60
And a visual insepection can be done using Gviz
. Here one has to know that the package has been built for biologists and genetic information. The pattern behind are DNA bases. Hence, one has to substract 1 of the segment ends to get the correct plot.
并且可以使用Gviz进行视觉检查。在这里,我们必须知道该包是为生物学家和遗传信息而建的。背后的模式是DNA碱基。因此,必须减去段末端中的一个以获得正确的绘图。
library(Gviz)
ga <- Gviz::GenomeAxisTrack()
xgr <- with(x, GRanges(1, IRanges(start, end = end - 1)))
xgr_red <- reduce(xgr, min.gapwidth=1)
ga <- GenomeAxisTrack()
GT <- lapply(xgr, GeneRegionTrack)
GT_red <- lapply(xgr_red, GeneRegionTrack, fill = "lightblue")
plotTracks(c(ga, GT, GT_red),from = min(x$start), to = max(x$start)+2)