最大的n每组引用,间隔为R或SQL。

时间:2021-05-29 12:33:32

I have described my (non-trivial) problem described below. This is my first post, now in a modified version. Any inputs or proposed solutions would be helpful.

我描述了下面描述的(非平凡的)问题。这是我的第一个帖子,现在是修改后的版本。任何输入或建议的解决方案都是有帮助的。

There are several dimensions to this: determination of an optimal solution of small scale problem (several suggestions already below), time (data.table solution below seems to check the box) and memory management. The problem is about tags that enumerated in one table and represented by clusters in another(same cluster if within 30bp on same strand).

这其中有几个方面:确定小尺度问题的最佳解决方案(以下几点建议)、时间(数据)。下面的表解决方案似乎检查了盒子)和内存管理。这个问题是关于在一个表中枚举的标记,并在另一个表中以集群的形式表示(如果在同一条链上的30个bp中,相同的集群)。

The challenge boils do to determining an efficient procedure of allocating a given tag to the appropriate interval. We are dealing with genomic data which means thats that tag coordinates are determined by start position, end position (=start position + 1), chromosome (25 values in full data set) and strand (position is on either the plus or minus strand over the double-stranded DNA). Clusters are thus non-overlapping on the same strand, but cluster coordinates can overlap if their intervals are on different strands complicates things.

挑战的关键在于确定一个有效的程序,将给定的标签分配到适当的时间间隔。我们正在处理基因组数据,这意味着标签坐标是由起始位置、结束位置(=起始位置+ 1)、染色体(完整数据集的25个值)和链(在双链DNA上的正负链上的位置)决定的。因此,集群在同一条链上是不重叠的,但是如果它们在不同的链上的间隔使事情复杂化,那么集群坐标就可以重叠。

This is a modified version of my post from Jan 9th, that better encapsulates the inherent difficulty of the problem. A fast solution that solves the small scale problem is shown later. If anyone want to work on the full dataset, let me know. Many thanks in advance!

这是我在1月9日发布的修改后的文章,它更好地概括了问题的固有困难。一个解决小范围问题的快速解决方案稍后会显示出来。如果有人想要处理完整的数据集,请告诉我。提前感谢!

Regards,

问候,

Nick Clark

尼克•克拉克

Background The issue involves intervals and greatest n per group. I have two tables containing clustered gene coordinates (clusters) and the input data (tags). The clusters table contains the summed tags from each covered non-overlapping interval on the same strand from the tags table. The full clusters table has 1.6 mln rows. The tags table about 4 mln rows, so the solution should ideally be vectorised. See below for some sample data. The setting is a data set on transcription initiation sites (CAGE) for human.

背景问题涉及区间和最大n /组。我有两个表,其中包含集群的基因坐标(集群)和输入数据(标记)。该集群表包含从标签表中同一条链上的每个覆盖的非重叠间隔的求和标记。完整的集群表有1.6个mln行。标签表大约有4个mln行,因此理想的解决方案应该是矢量化的。下面是一些示例数据。该设置是针对人类的转录起始位点(CAGE)的数据集。

As I am working in R, I am look for a solution based in R or SQL. I have made previous unsuccessful attempts both via the plyr and sqldf packages in R.

当我在R工作时,我正在寻找一个基于R或SQL的解决方案。在R中,通过plyr和sqldf包,我曾经尝试过失败。

The Challenge What I am is missing is a row in the clustered table identifying the start coordinate from the input data table associated with the largest tag contribution.

我所缺少的是在集群表中的一行,它标识与最大标记贡献关联的输入数据表的开始坐标。

Note that 1) clusters from different strands can have overlapping coordinates, 2) chr / chr_clst can take on 25 different values (not shown in example), 3) a solution needs to account for both strand and chr / chr_clst.

注意,来自不同链的簇可以有重叠的坐标,2)chr / chr_clst可以接受25个不同的值(例如,没有显示),3)一个解决方案需要考虑到链和chr / chr_clst。

My Ideal Solution: Vectorized R code or improvement on SQL statement below. A version of the solution below, that accounts for the memory problem would do the trick. As would an improved sql statement that efficiently determines the appropriate row from the clusters table.

我的理想的解决方案是:矢量R代码或SQL语句的改进。下面的解决方案的一个版本,说明内存问题将会解决这个问题。与改进的sql语句一样,有效地从集群表中确定适当的行。

Status so far Here is the apparently best solution so far. Hat tip and cool points to user1935457 for the code and to mnel for subsequent suggested modification. The snag here is that moving from the toy example to the fill scale tables crashes both R (and R Studio) due to excessive demands on memory.

到目前为止,这是迄今为止最好的解决方案。Hat tip和cool points to user1935457为代码和mnel的后续建议修改。这里的问题是,由于对内存的过度需求,从玩具示例转移到填充表会导致R(和R Studio)崩溃。

# Convert sample data provided in question
clusters <- as.data.table(clusters)
tags <- as.data.table(tags)

# Rename chr and strand for easier joining
setnames(clusters, c("chr_clst", "strand_clst"), c("chr", "strand"))

# Set key on each table for next step
setkey(clusters, chr, strand)
setkey(tags, chr, strand)

# Merge on the keys
tmp <- merge(clusters, tags, by = c("chr", "strand"))

# Find index (in merged table, tmp) of largest tag_count in each
# group subject to start_clst <= end <= end_clst
idx <- tmp[between(end, start_clst, end_clst),
       list(IDX=.I[which.max(tag_count)]),
       by=list(chr, start_clst,end_clst,strand)]$IDX

# Get those rows from merged table
tmp[idx]

I initially created a basic SQL query using the sqldf package in R (this version finds the max and not the coordinate associated with the max). The query takes forever to run, despite placing (hopefully) appropriate indices on both tables.

我最初在R中使用sqldf包创建了一个基本的SQL查询(这个版本找到的是max,而不是与max相关的坐标)。尽管在两个表上都放置了(希望)适当的索引,查询仍然需要永远运行。

output_tablename <- sqldf(c(
"create index ai1 on clusters(chr_clst, start_clst, end_clst, strand_clst)",
"create index ai2 on tags(chr, start, end, strand)",
"select a.chr_clst, a.start_clst, a.end_clst, a.strand_clst, sum(b.tags)
from main.clusters a
inner join main.tags b on a.chr_clst=b.chr and a.strand_clst = b.strand 
and b.end between a.start_clst and a.end_clst
group by a.chr_clst, a.start_clst, a.end_clst, a.strand_clst
order by a.chr_clst, a.start_clst, a.end_clst, a.strand_clst"
))

Table structure
clusters: chr_clst, start_clst, end_clst, strand_clst, tags_clst.
tags: chr, start, end, strand, tag_count.

表结构集群:chr_clst、start_clst、end_clst、strand_clst、tags_clst。标签:chr, start, end, strand, tag_count。

Sample Data in R format Let me know if anyone want to work on the full dataset, let me know.

R格式的样本数据让我知道如果有人想要研究完整的数据集,请告诉我。

clusters:

集群:

chr_clst <- c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1")
start_clst <- c(568911, 569233, 569454, 569793, 569877, 569926, 569972, 570048, 570166, 713987)
end_clst <- c(568941, 569256, 569484, 569803, 569926, 569952, 569973, 570095, 570167, 714049)
strand_clst <- c("+", "+", "+", "+", "+", "-", "+", "+", "+", "-")
tags_clst <- c(37, 4, 6, 3, 80, 25, 1, 4, 1, 46)

clusters <- data.frame(cbind(chr_clst, start_clst, end_clst, strand_clst, tags_clst))
clusters$start_clst <- as.numeric(as.character(clusters$start_clst))
clusters$end_clst <- as.numeric(as.character(clusters$end_clst))
clusters$tags_clst <- as.numeric(as.character(clusters$tags_clst))
rm(chr_clst, start_clst, end_clst, start_clst, strand_clst, tags_clst)

tags:

标签:

chr <- c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1")

start <- c(568911, 568912, 568913, 568913, 568914, 568916, 568917, 568918, 568929, 
568929, 568932, 568933, 568935, 568937, 568939, 568940, 568940, 569233, 569247, 
569255, 569454, 569469, 569471, 569475, 569483, 569793, 569802, 569877, 569880, 
569887, 569889, 569890, 569891, 569893, 569894, 569895, 569895, 569896, 569897, 
569898, 569898, 569899, 569900, 569900, 569901, 569901, 569903, 569905, 569906, 
569907, 569907, 569908, 569908, 569909, 569910, 569910, 569911, 569911, 569912, 
569914, 569914, 569915, 569916, 569917, 569918, 569919, 569920, 569920, 569925, 
569926, 569936, 569938, 569939, 569939, 569940, 569941, 569941, 569942, 569942, 
569943, 569944, 569948, 569948, 569951, 569972, 570048, 570057, 570078, 570094, 
570166, 713987, 713989, 713995, 714001, 714001, 714007, 714008, 714010, 714011, 
714011, 714011, 714013, 714015, 714015, 714017, 714018, 714019, 714023, 714025, 
714029, 714034, 714034, 714037, 714038, 714039, 714039, 714040, 714042, 714048, 
714048)

end <- c(568912, 568913, 568914, 568914, 568915, 568917, 568918, 568919, 568930, 
568930, 568933, 568934, 568936, 568938, 568940, 568941, 568941, 569234, 569248,
569256, 569455, 569470, 569472, 569476, 569484, 569794, 569803, 569878, 569881, 
569888, 569890, 569891, 569892, 569894, 569895, 569896, 569896, 569897, 569898, 
569899, 569899, 569900, 569901, 569901, 569902, 569902, 569904, 569906, 569907, 
569908, 569908, 569909, 569909, 569910, 569911, 569911, 569912, 569912, 569913, 
569915, 569915, 569916, 569917, 569918, 569919, 569920, 569921, 569921, 569926, 
569927, 569937, 569939, 569940, 569940, 569941, 569942, 569942, 569943, 569943, 
569944, 569945, 569949, 569949, 569952, 569973, 570049, 570058, 570079, 570095, 
570167, 713988, 713990, 713996, 714002, 714002, 714008, 714009, 714011, 714012, 
714012, 714012, 714014, 714016, 714016, 714018, 714019, 714020, 714024, 714026, 
714030, 714035, 714035, 714038, 714039, 714040, 714040, 714041, 714043, 714049, 
714049)

strand <- c("+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", 
"+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", 
"+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", 
"+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", 
"+", "+", "+", "+", "+", "+", "+", "+", "-", "-", "-", "-", "-", "-", "-", "-", 
"-", "-", "-", "-", "-", "-", "-", "+", "+", "+", "+", "+", "+", "-", "-", "-", 
"-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", 
"-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-")

tag_count <- c(1, 1, 1, 2, 3, 2, 3, 1, 1, 1, 1, 1, 2, 1, 6, 2, 8, 1, 1, 2, 1, 1, 2, 
1, 1, 2, 1, 1, 1, 1, 1, 2, 2, 4, 4, 1, 1, 1, 1, 1, 3, 2, 1, 1, 2, 4, 2, 4, 2, 4, 
1, 1, 1, 1, 3, 2, 1, 3, 1, 2, 3, 1, 1, 3, 2, 1, 1, 1, 5, 1, 2, 1, 2, 1, 1, 2, 2, 
4, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 3, 2, 4, 2, 1, 1, 1, 
2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 4, 1, 2)

tags <- data.frame(cbind(chr, start, end, strand, tag_count))    
tags$start <- as.numeric(as.character(tags$start))
tags$end <- as.numeric(as.character(tags$end))
tags$tag_count <- as.numeric(as.character(tags$tag_count))
rm(chr, start, end, strand, tag_count)

4 个解决方案

#1


2  

Here's an attempt with the data.table package:

这里是对数据的尝试。表包:

# Convert sample data provided in question
clusters <- as.data.table(clusters)
tags <- as.data.table(tags)

# Rename chr and strand for easier joining
setnames(clusters, c("chr_clst", "strand_clst"), c("chr", "strand"))

# Set key on each table for next step
setkey(clusters, chr, strand)
setkey(tags, chr, strand)

# Merge on the keys
tmp <- merge(clusters, tags, by = c("chr", "strand"))

# Find index (in merged table, tmp) of largest tag_count in each
# group subject to start_clst <= end <= end_clst
idx <- tmp[between(end, start_clst, end_clst),
           list(IDX=.I[which.max(tag_count)]),
           by=list(chr, start_clst,end_clst,strand)]$IDX

# Get those rows from merged table
tmp[idx]

Output of last line:

输出的最后一行:

     chr strand start_clst end_clst tags_clst  start    end tag_count
 1: chr1      -     569926   569952        25 569942 569943         4
 2: chr1      -     713987   714049        46 714011 714012         4
 3: chr1      +     568911   568941        37 568940 568941         8
 4: chr1      +     569233   569256         4 569255 569256         2
 5: chr1      +     569454   569484         6 569471 569472         2
 6: chr1      +     569793   569803         3 569793 569794         2
 7: chr1      +     569877   569926        80 569925 569926         5
 8: chr1      +     569972   569973         1 569972 569973         1
 9: chr1      +     570048   570095         4 570048 570049         1
10: chr1      +     570166   570167         1 570166 570167         1

Edit

Based on the memory issues discussed in comments below, here's another attempt. I use the intervals package to find overlapping intervals between the two tables. You could also explore parallelizing the for loop to gain speed.

根据下面评论中讨论的内存问题,这里是另一个尝试。我使用间隔包在两个表之间找到重叠的间隔。您还可以探索并行化for循环以获得速度。

require(data.table)
require(intervals)
clusters <- data.table(clusters)
tags <- data.table(tags)

#  Find all unique combinations of chr and strand...
setkey(clusters, chr_clst, strand_clst)
setkey(tags, chr, strand)

unique.keys <- unique(rbind(clusters[, key(clusters), with=FALSE],
                            tags[, key(tags), with=FALSE], use.names=FALSE))

# ... and then work on each pair individually to avoid creating
# enormous objects in memory
result.list <- vector("list", nrow(unique.keys))
for(i in seq_len(nrow(unique.keys))) {
  tmp.clst <- clusters[unique.keys[i]]
  tmp.tags <- tags[unique.keys[i]]

  # Keep track of each row for later
  tmp.clst[, row.id := seq_len(nrow(tmp.clst))]
  tmp.tags[, row.id := seq_len(nrow(tmp.tags))]

  # Use intervals package to find all overlapping [start, end] 
  # intervals between the two tables
  clst.intervals <- Intervals(tmp.clst[, list(start_clst, end_clst)],
                              type = "Z")
  tags.intervals <- Intervals(tmp.tags[, list(start, end)],
                              type = "Z")
  rownames(tags.intervals) <- tmp.tags$row.id

  # This goes to C++ code in intervals package; 
  # I didn't spend too much time looking over how it works
  overlaps <- interval_overlap(tags.intervals,
                               clst.intervals,
                               check_valid = FALSE)

  # Retrieve rows from clusters table with overlaps and add a column
  # indicating which intervals in tags table they overlapped with
  matches <- lapply(as.integer(names(overlaps)), function(n) {
    ans <- tmp.clst[overlaps[[n]]]
    ans[, match.in.tags := n]
  })

  # List back to one table...
  matches <- rbindlist(matches)

  # ... and join each match from tags to its relevant row from tags
  setkey(matches, match.in.tags)
  setkey(tmp.tags, row.id)

  # add the rows for max of tag_count by start_clst and
  # end_clst from this particular unique key to master list...
  result.list[[i]] <- tmp.tags[matches][, .SD[which.max(tag_count)],
                                        by = list(start_clst, end_clst)]
}

# and concatenate master list into none table,
# getting rid of the helper columns
rbindlist(result.list)[, c("row.id", "row.id.1") := NULL][]

The last line gives:

给最后一行:

    start_clst end_clst  chr strand  start    end tag_count chr_clst strand_clst tags_clst
 1:     569926   569952 chr1      - 569942 569943         4     chr1           -        25
 2:     713987   714049 chr1      - 714011 714012         4     chr1           -        46
 3:     568911   568941 chr1      + 568940 568941         8     chr1           +        37
 4:     569233   569256 chr1      + 569255 569256         2     chr1           +         4
 5:     569454   569484 chr1      + 569471 569472         2     chr1           +         6
 6:     569793   569803 chr1      + 569793 569794         2     chr1           +         3
 7:     569877   569926 chr1      + 569925 569926         5     chr1           +        80
 8:     569972   569973 chr1      + 569972 569973         1     chr1           +         1
 9:     570048   570095 chr1      + 570048 570049         1     chr1           +         4
10:     570166   570167 chr1      + 570166 570167         1     chr1           +         1

#2


2  

Just a quick answer to give some hints, building on other answers and comments.

只是简单的回答一些提示,建立在其他的答案和评论上。

If X[Y] (or merge(X,Y)) returns a large number of rows, larger than max(nrow(X),nrow(Y)) (such as nrow(X)*nrow(Y) for example) then X[Y][where] (i.e. X[Y] followed by a subset) isn't going to help. The final result is much smaller, but it has to create the large X[Y] first.

如果X[Y](或合并(X,Y))返回大量的行,大于max(nrow(X),nrow(Y))(例如nrow(X)*nrow(Y)),那么X[Y][在哪里](即X[Y]后面跟着一个子集)是不会有帮助的。最终的结果要小得多,但它必须首先创建一个大的X[Y]。

If ranges are required, then one way is w = X[Y,roll=TRUE,which=TRUE] or w=X[Y,mult="first",which=TRUE] or something like that, maybe twice for first and last. Once you've got row locations (w) of each range you can seq or vecseq between beginning and end, and then select the result. There are some examples in other S.O. questions in this tag. It would be nice to build this into data.table of course, and there is a feature request for it that proposes that join columns could themselves be 2 column list columns containing the bounds of the range query per row per column.

如果需要取值范围,那么一种方法是w=X[Y, roll=TRUE, =TRUE]或w=X[Y,mult="first", =TRUE]或类似的东西,可能是第一次和最后一次。一旦你获得了每个范围的行位置(w),你可以在开始和结束之间进行seq或vecseq,然后选择结果。在其他的S.O.问题中也有一些例子。把它建立成数据是很好的。当然,还有一个特性请求,它提出连接列可以是两个列列表列,其中包含每行每行的范围查询的范围。

Or, by-without-by can be used. This is where j is evaluated for each row of i when there is no by clause. Search ?data.table for by-without-by and see the examples. That's how you can stick to the cartesian-then-subset thinking, without actually creating the entire cartesian result first. Something like: X[Y,.SD[start<=i.value & i.value<=end]]. That's probably slower than the roll or mult approach, due to the 3 vectors scans (&, <= and <=), though. But at least it would avoid the large memory allocation. Note that i. prefix can be used to refer to non join i columns explicitly. The between function in data.table could be coded in C to do that much more efficiently, similar to the clamp function in Rcpp. But currently between() is written as R vector scans, and therefore just as slow.

或者,顺便说一句,可以使用。这是j在没有被子句的情况下对每一行i进行计算的地方。数据搜索?。这张表是由by- by- by和see示例组成的。这就是为什么你可以坚持笛卡尔-当时的子集思维,而不是首先创建整个笛卡尔结果。类似:X(Y,.SD(< =我开始。值& i.value < =]]。这可能比滚动或mult方法慢,因为3个向量扫描(&,<=和<=)。但至少它能避免大内存分配。注意,i.前缀可用于显式地引用非连接i列。数据之间的函数。表可以用C编码,这样做更有效,类似于Rcpp中的钳位功能。但是,目前在()之间的是R向量扫描,因此也同样是缓慢的。

Hope that helps. I tried to explain the current thinking, right or wrong as it may be.

希望有帮助。我试着解释现在的想法,也许是对的,也可能是错的。

And we'll improve data.table to catch cartesian allocations with a graceful error giving some tips as mentioned in comments [EDIT: allow.cartesian=FALSE now added in v1.8.7]. Thanks!

和我们将改进数据。用一个优雅的错误来捕捉笛卡尔的分配,给出一些注释(编辑:允许)。在v1.8.7中添加了“cartesian=FALSE”。谢谢!


To expand on paragraph 2 :

扩大第2款:

setkey(clusters,chr,strand,end_clst)
setkey(tags,chr,strand,end)

begin = tags[clusters[,list(chr,strand,start_clst)],roll=-Inf,mult="first",which=TRUE]
end = tags[clusters[,list(chr,strand,end_clst)],roll=+Inf,mult="last",which=TRUE]

idx = mapply(function(x,y){.i=seq.int(x,y); .i[ which.max(tags$tag_count[.i]) ]}, begin, end)
cbind(clusters, tags[idx])
     chr start_clst end_clst strand tags_clst  chr  start    end strand tag_count
 1: chr1     569926   569952      -        25 chr1 569942 569943      -         4
 2: chr1     713987   714049      -        46 chr1 714011 714012      -         4
 3: chr1     568911   568941      +        37 chr1 568940 568941      +         8
 4: chr1     569233   569256      +         4 chr1 569255 569256      +         2
 5: chr1     569454   569484      +         6 chr1 569471 569472      +         2
 6: chr1     569793   569803      +         3 chr1 569793 569794      +         2
 7: chr1     569877   569926      +        80 chr1 569925 569926      +         5
 8: chr1     569972   569973      +         1 chr1 569972 569973      +         1
 9: chr1     570048   570095      +         4 chr1 570048 570049      +         1
10: chr1     570166   570167      +         1 chr1 570166 570167      +         1

This avoids the cartsian memory allocation issue mentioned in other answers and comments. It uses the following new feature in v1.8.7 :

这避免了在其他答案和注释中提到的cartsian内存分配问题。它使用了v1.8.7中的以下新特性:

o In addition to TRUE / FALSE, roll may now be a positive number (roll forwards/LOCF) or negative number (roll backwards/NOCB). A finite number limits the distance a value is rolled (limited staleness). roll=TRUE and roll=+Inf are equivalent.
rollends is a new parameter holding two logicals. The first observation is rolled backwards if rollends[1] is TRUE. The last observation is rolled forwards if rollends[2] is TRUE. If roll is a finite number, the same limit applies to the ends.

o除了TRUE / FALSE之外,roll现在可能是一个正数(roll forward /LOCF)或负数(rollback /NOCB)。有限数量限制了一个值被滚动的距离(有限的过时)。=TRUE和roll=+Inf是等价的。rollends是一个包含两个逻辑的新参数。如果rollends[1]是正确的,则第一个观察是向后滚动的。如果rollends[2]是正确的,那么最后的观察将会向前滚动。如果滚动是一个有限的数字,同样的限制适用于结束。

#3


1  

Here is one proposal using apply:

这里有一个建议:

transform(
  clusters,
  start = apply(clusters[c("chr_clst", "start_clst", "end_clst", "strand_clst")],
                1, function(x) {
                     tmp <- tags[tags$start >= as.numeric(x[2]) &
                                 tags$end <= as.numeric(x[3]) & 
                                 tags$chr == x[1] & 
                                 tags$strand == x[4], c("tag_count", "start")]
                     tmp$start[which.max(tmp$tag_count)]}))

Basically, for each row of clusters the function looks for the largest value of tag_count inside the relevant subset of tags. The appropriate value in start is selected. These new vector of start values is used as a new colum for clusters.

基本上,对于每一行的集群,函数都在标签的相关子集内寻找最大的tag_count值。选择开始时的适当值。这些新的起始值向量被用作集群的新colum。

The result:

结果:

   chr_clst start_clst end_clst strand_clst tags_clst  start
1      chr1     568911   568941           +        37 568940
2      chr1     569233   569256           +         4 569255
3      chr1     569454   569484           +         6 569471
4      chr1     569793   569803           +         3 569793
5      chr1     569877   569926           +        80 569925
6      chr1     569926   569952           -        25 569942
7      chr1     569972   569973           +         1 569972
8      chr1     570048   570095           +         4 570048
9      chr1     570166   570167           +         1 570166
10     chr1     713987   714049           -        46 714011   

#4


1  

This can be done quite efficiently using foverlaps() function from data.table which has been available since v1.9.4:

使用foverlaps()函数可以非常有效地使用数据。自v1.9.4以来一直可用的表:

require(data.table) #v1.9.4+
setDT(clusters, key=c("chr_clst", "strand_clst", "start_clst", "end_clst"))
setDT(tags, key=c("chr", "strand", "start", "end"))

ans = foverlaps(clusters, tags)[, .SD[which.max(tag_count)], by=.(chr_clst, strand_clst, start_clst, end_clst)]

#     chr_clst strand_clst start_clst end_clst  start    end tag_count tags_clst
#  1:     chr1           -     569926   569952 569942 569943         4        25
#  2:     chr1           -     713987   714049 714011 714012         4        46
#  3:     chr1           +     568911   568941 568940 568941         8        37
#  4:     chr1           +     569233   569256 569255 569256         2         4
#  5:     chr1           +     569454   569484 569471 569472         2         6
#  6:     chr1           +     569793   569803 569793 569794         2         3
#  7:     chr1           +     569877   569926 569925 569926         5        80
#  8:     chr1           +     569972   569973 569972 569973         1         1
#  9:     chr1           +     570048   570095 570048 570049         1         4
# 10:     chr1           +     570166   570167 570166 570167         1         1

foverlaps() will also eventually be able to perform overlapping range joins without having to key them first, similar to the new on= (v1.9.6+) argument.

foverlaps()也将最终能够执行重叠的范围连接,而不必首先键入它们,类似于新的on= (v1.9.6+)参数。

#1


2  

Here's an attempt with the data.table package:

这里是对数据的尝试。表包:

# Convert sample data provided in question
clusters <- as.data.table(clusters)
tags <- as.data.table(tags)

# Rename chr and strand for easier joining
setnames(clusters, c("chr_clst", "strand_clst"), c("chr", "strand"))

# Set key on each table for next step
setkey(clusters, chr, strand)
setkey(tags, chr, strand)

# Merge on the keys
tmp <- merge(clusters, tags, by = c("chr", "strand"))

# Find index (in merged table, tmp) of largest tag_count in each
# group subject to start_clst <= end <= end_clst
idx <- tmp[between(end, start_clst, end_clst),
           list(IDX=.I[which.max(tag_count)]),
           by=list(chr, start_clst,end_clst,strand)]$IDX

# Get those rows from merged table
tmp[idx]

Output of last line:

输出的最后一行:

     chr strand start_clst end_clst tags_clst  start    end tag_count
 1: chr1      -     569926   569952        25 569942 569943         4
 2: chr1      -     713987   714049        46 714011 714012         4
 3: chr1      +     568911   568941        37 568940 568941         8
 4: chr1      +     569233   569256         4 569255 569256         2
 5: chr1      +     569454   569484         6 569471 569472         2
 6: chr1      +     569793   569803         3 569793 569794         2
 7: chr1      +     569877   569926        80 569925 569926         5
 8: chr1      +     569972   569973         1 569972 569973         1
 9: chr1      +     570048   570095         4 570048 570049         1
10: chr1      +     570166   570167         1 570166 570167         1

Edit

Based on the memory issues discussed in comments below, here's another attempt. I use the intervals package to find overlapping intervals between the two tables. You could also explore parallelizing the for loop to gain speed.

根据下面评论中讨论的内存问题,这里是另一个尝试。我使用间隔包在两个表之间找到重叠的间隔。您还可以探索并行化for循环以获得速度。

require(data.table)
require(intervals)
clusters <- data.table(clusters)
tags <- data.table(tags)

#  Find all unique combinations of chr and strand...
setkey(clusters, chr_clst, strand_clst)
setkey(tags, chr, strand)

unique.keys <- unique(rbind(clusters[, key(clusters), with=FALSE],
                            tags[, key(tags), with=FALSE], use.names=FALSE))

# ... and then work on each pair individually to avoid creating
# enormous objects in memory
result.list <- vector("list", nrow(unique.keys))
for(i in seq_len(nrow(unique.keys))) {
  tmp.clst <- clusters[unique.keys[i]]
  tmp.tags <- tags[unique.keys[i]]

  # Keep track of each row for later
  tmp.clst[, row.id := seq_len(nrow(tmp.clst))]
  tmp.tags[, row.id := seq_len(nrow(tmp.tags))]

  # Use intervals package to find all overlapping [start, end] 
  # intervals between the two tables
  clst.intervals <- Intervals(tmp.clst[, list(start_clst, end_clst)],
                              type = "Z")
  tags.intervals <- Intervals(tmp.tags[, list(start, end)],
                              type = "Z")
  rownames(tags.intervals) <- tmp.tags$row.id

  # This goes to C++ code in intervals package; 
  # I didn't spend too much time looking over how it works
  overlaps <- interval_overlap(tags.intervals,
                               clst.intervals,
                               check_valid = FALSE)

  # Retrieve rows from clusters table with overlaps and add a column
  # indicating which intervals in tags table they overlapped with
  matches <- lapply(as.integer(names(overlaps)), function(n) {
    ans <- tmp.clst[overlaps[[n]]]
    ans[, match.in.tags := n]
  })

  # List back to one table...
  matches <- rbindlist(matches)

  # ... and join each match from tags to its relevant row from tags
  setkey(matches, match.in.tags)
  setkey(tmp.tags, row.id)

  # add the rows for max of tag_count by start_clst and
  # end_clst from this particular unique key to master list...
  result.list[[i]] <- tmp.tags[matches][, .SD[which.max(tag_count)],
                                        by = list(start_clst, end_clst)]
}

# and concatenate master list into none table,
# getting rid of the helper columns
rbindlist(result.list)[, c("row.id", "row.id.1") := NULL][]

The last line gives:

给最后一行:

    start_clst end_clst  chr strand  start    end tag_count chr_clst strand_clst tags_clst
 1:     569926   569952 chr1      - 569942 569943         4     chr1           -        25
 2:     713987   714049 chr1      - 714011 714012         4     chr1           -        46
 3:     568911   568941 chr1      + 568940 568941         8     chr1           +        37
 4:     569233   569256 chr1      + 569255 569256         2     chr1           +         4
 5:     569454   569484 chr1      + 569471 569472         2     chr1           +         6
 6:     569793   569803 chr1      + 569793 569794         2     chr1           +         3
 7:     569877   569926 chr1      + 569925 569926         5     chr1           +        80
 8:     569972   569973 chr1      + 569972 569973         1     chr1           +         1
 9:     570048   570095 chr1      + 570048 570049         1     chr1           +         4
10:     570166   570167 chr1      + 570166 570167         1     chr1           +         1

#2


2  

Just a quick answer to give some hints, building on other answers and comments.

只是简单的回答一些提示,建立在其他的答案和评论上。

If X[Y] (or merge(X,Y)) returns a large number of rows, larger than max(nrow(X),nrow(Y)) (such as nrow(X)*nrow(Y) for example) then X[Y][where] (i.e. X[Y] followed by a subset) isn't going to help. The final result is much smaller, but it has to create the large X[Y] first.

如果X[Y](或合并(X,Y))返回大量的行,大于max(nrow(X),nrow(Y))(例如nrow(X)*nrow(Y)),那么X[Y][在哪里](即X[Y]后面跟着一个子集)是不会有帮助的。最终的结果要小得多,但它必须首先创建一个大的X[Y]。

If ranges are required, then one way is w = X[Y,roll=TRUE,which=TRUE] or w=X[Y,mult="first",which=TRUE] or something like that, maybe twice for first and last. Once you've got row locations (w) of each range you can seq or vecseq between beginning and end, and then select the result. There are some examples in other S.O. questions in this tag. It would be nice to build this into data.table of course, and there is a feature request for it that proposes that join columns could themselves be 2 column list columns containing the bounds of the range query per row per column.

如果需要取值范围,那么一种方法是w=X[Y, roll=TRUE, =TRUE]或w=X[Y,mult="first", =TRUE]或类似的东西,可能是第一次和最后一次。一旦你获得了每个范围的行位置(w),你可以在开始和结束之间进行seq或vecseq,然后选择结果。在其他的S.O.问题中也有一些例子。把它建立成数据是很好的。当然,还有一个特性请求,它提出连接列可以是两个列列表列,其中包含每行每行的范围查询的范围。

Or, by-without-by can be used. This is where j is evaluated for each row of i when there is no by clause. Search ?data.table for by-without-by and see the examples. That's how you can stick to the cartesian-then-subset thinking, without actually creating the entire cartesian result first. Something like: X[Y,.SD[start<=i.value & i.value<=end]]. That's probably slower than the roll or mult approach, due to the 3 vectors scans (&, <= and <=), though. But at least it would avoid the large memory allocation. Note that i. prefix can be used to refer to non join i columns explicitly. The between function in data.table could be coded in C to do that much more efficiently, similar to the clamp function in Rcpp. But currently between() is written as R vector scans, and therefore just as slow.

或者,顺便说一句,可以使用。这是j在没有被子句的情况下对每一行i进行计算的地方。数据搜索?。这张表是由by- by- by和see示例组成的。这就是为什么你可以坚持笛卡尔-当时的子集思维,而不是首先创建整个笛卡尔结果。类似:X(Y,.SD(< =我开始。值& i.value < =]]。这可能比滚动或mult方法慢,因为3个向量扫描(&,<=和<=)。但至少它能避免大内存分配。注意,i.前缀可用于显式地引用非连接i列。数据之间的函数。表可以用C编码,这样做更有效,类似于Rcpp中的钳位功能。但是,目前在()之间的是R向量扫描,因此也同样是缓慢的。

Hope that helps. I tried to explain the current thinking, right or wrong as it may be.

希望有帮助。我试着解释现在的想法,也许是对的,也可能是错的。

And we'll improve data.table to catch cartesian allocations with a graceful error giving some tips as mentioned in comments [EDIT: allow.cartesian=FALSE now added in v1.8.7]. Thanks!

和我们将改进数据。用一个优雅的错误来捕捉笛卡尔的分配,给出一些注释(编辑:允许)。在v1.8.7中添加了“cartesian=FALSE”。谢谢!


To expand on paragraph 2 :

扩大第2款:

setkey(clusters,chr,strand,end_clst)
setkey(tags,chr,strand,end)

begin = tags[clusters[,list(chr,strand,start_clst)],roll=-Inf,mult="first",which=TRUE]
end = tags[clusters[,list(chr,strand,end_clst)],roll=+Inf,mult="last",which=TRUE]

idx = mapply(function(x,y){.i=seq.int(x,y); .i[ which.max(tags$tag_count[.i]) ]}, begin, end)
cbind(clusters, tags[idx])
     chr start_clst end_clst strand tags_clst  chr  start    end strand tag_count
 1: chr1     569926   569952      -        25 chr1 569942 569943      -         4
 2: chr1     713987   714049      -        46 chr1 714011 714012      -         4
 3: chr1     568911   568941      +        37 chr1 568940 568941      +         8
 4: chr1     569233   569256      +         4 chr1 569255 569256      +         2
 5: chr1     569454   569484      +         6 chr1 569471 569472      +         2
 6: chr1     569793   569803      +         3 chr1 569793 569794      +         2
 7: chr1     569877   569926      +        80 chr1 569925 569926      +         5
 8: chr1     569972   569973      +         1 chr1 569972 569973      +         1
 9: chr1     570048   570095      +         4 chr1 570048 570049      +         1
10: chr1     570166   570167      +         1 chr1 570166 570167      +         1

This avoids the cartsian memory allocation issue mentioned in other answers and comments. It uses the following new feature in v1.8.7 :

这避免了在其他答案和注释中提到的cartsian内存分配问题。它使用了v1.8.7中的以下新特性:

o In addition to TRUE / FALSE, roll may now be a positive number (roll forwards/LOCF) or negative number (roll backwards/NOCB). A finite number limits the distance a value is rolled (limited staleness). roll=TRUE and roll=+Inf are equivalent.
rollends is a new parameter holding two logicals. The first observation is rolled backwards if rollends[1] is TRUE. The last observation is rolled forwards if rollends[2] is TRUE. If roll is a finite number, the same limit applies to the ends.

o除了TRUE / FALSE之外,roll现在可能是一个正数(roll forward /LOCF)或负数(rollback /NOCB)。有限数量限制了一个值被滚动的距离(有限的过时)。=TRUE和roll=+Inf是等价的。rollends是一个包含两个逻辑的新参数。如果rollends[1]是正确的,则第一个观察是向后滚动的。如果rollends[2]是正确的,那么最后的观察将会向前滚动。如果滚动是一个有限的数字,同样的限制适用于结束。

#3


1  

Here is one proposal using apply:

这里有一个建议:

transform(
  clusters,
  start = apply(clusters[c("chr_clst", "start_clst", "end_clst", "strand_clst")],
                1, function(x) {
                     tmp <- tags[tags$start >= as.numeric(x[2]) &
                                 tags$end <= as.numeric(x[3]) & 
                                 tags$chr == x[1] & 
                                 tags$strand == x[4], c("tag_count", "start")]
                     tmp$start[which.max(tmp$tag_count)]}))

Basically, for each row of clusters the function looks for the largest value of tag_count inside the relevant subset of tags. The appropriate value in start is selected. These new vector of start values is used as a new colum for clusters.

基本上,对于每一行的集群,函数都在标签的相关子集内寻找最大的tag_count值。选择开始时的适当值。这些新的起始值向量被用作集群的新colum。

The result:

结果:

   chr_clst start_clst end_clst strand_clst tags_clst  start
1      chr1     568911   568941           +        37 568940
2      chr1     569233   569256           +         4 569255
3      chr1     569454   569484           +         6 569471
4      chr1     569793   569803           +         3 569793
5      chr1     569877   569926           +        80 569925
6      chr1     569926   569952           -        25 569942
7      chr1     569972   569973           +         1 569972
8      chr1     570048   570095           +         4 570048
9      chr1     570166   570167           +         1 570166
10     chr1     713987   714049           -        46 714011   

#4


1  

This can be done quite efficiently using foverlaps() function from data.table which has been available since v1.9.4:

使用foverlaps()函数可以非常有效地使用数据。自v1.9.4以来一直可用的表:

require(data.table) #v1.9.4+
setDT(clusters, key=c("chr_clst", "strand_clst", "start_clst", "end_clst"))
setDT(tags, key=c("chr", "strand", "start", "end"))

ans = foverlaps(clusters, tags)[, .SD[which.max(tag_count)], by=.(chr_clst, strand_clst, start_clst, end_clst)]

#     chr_clst strand_clst start_clst end_clst  start    end tag_count tags_clst
#  1:     chr1           -     569926   569952 569942 569943         4        25
#  2:     chr1           -     713987   714049 714011 714012         4        46
#  3:     chr1           +     568911   568941 568940 568941         8        37
#  4:     chr1           +     569233   569256 569255 569256         2         4
#  5:     chr1           +     569454   569484 569471 569472         2         6
#  6:     chr1           +     569793   569803 569793 569794         2         3
#  7:     chr1           +     569877   569926 569925 569926         5        80
#  8:     chr1           +     569972   569973 569972 569973         1         1
#  9:     chr1           +     570048   570095 570048 570049         1         4
# 10:     chr1           +     570166   570167 570166 570167         1         1

foverlaps() will also eventually be able to perform overlapping range joins without having to key them first, similar to the new on= (v1.9.6+) argument.

foverlaps()也将最终能够执行重叠的范围连接,而不必首先键入它们,类似于新的on= (v1.9.6+)参数。