从长格式到宽格式的有效方法——类似于dcast

时间:2021-09-18 04:26:42

This question pertains to creating "wide" tables similar to tables you could create using dcast from reshape2. I know this has been discussed many times before, but my question pertains to how to make the process more efficient. I have provided several examples below which might make the question seem lengthy, but most of it is just test code for benchmarking

这个问题涉及到创建“宽”表,类似于使用reshape2中的dcast创建的表。我知道之前已经讨论过很多次了,但是我的问题是关于如何使这个过程更有效。我在下面提供了几个示例,这些示例可能会使问题显得冗长,但其中大部分只是用于基准测试的测试代码

Starting with a simple example,

从一个简单的例子开始,

> z <- data.table(col1=c(1,1,2,3,4), col2=c(10,10,20,20,30), 
                  col3=c(5,2,2.3,2.4,100), col4=c("a","a","b","c","a"))

> z
     col1 col2  col3 col4
1:    1   10   5.0    a      # col1 = 1, col2 = 10
2:    1   10   2.0    a      # col1 = 1, col2 = 10
3:    2   20   2.3    b
4:    3   20   2.4    c
5:    4   30 100.0    a

We need to create a "wide" table that will have the values of the col4 column as column names and the value of the sum(col3) for each combination of col1 and col2.

我们需要创建一个“宽”表,该表将col4列的值作为列名,并为col1和col2的每个组合创建sum(col3)的值。

> ulist = unique(z$col4) # These will be the additional column names

# Create long table with sum
> z2 <- z[,list(sumcol=sum(col3)), by='col1,col2,col4']

# Pivot the long table
> z2 <- z2[,as.list((sumcol[match(ulist,col4)])), by=c("col1","col2")]

# Add column names
> setnames(z2[],c("col1","col2",ulist))

> z2
   col1 col2   a   b   c
1:    1   10   7  NA  NA  # a = 5.0 + 2.0 = 7 corresponding to col1=1, col2=10
2:    2   20  NA 2.3  NA
3:    3   20  NA  NA 2.4
4:    4   30 100  NA  NA

The issue I have is that while the above method is fine for smaller tables, it's virtually impossible to run them (unless you are fine with waiting x hours maybe) on very large tables.

我的问题是,虽然上面的方法适用于较小的表,但实际上不可能在非常大的表上运行它们(除非您愿意等待x小时)。

This, I believe is likely related to the fact that the pivoted / wide table is of a much larger size than the original tables since each row in the wide table has n columns corresponding to the unique values of the pivot column no matter whether there is any value that corresponds to that cell (these are the NA values above). The size of the new table is therefore often 2x+ that of the original "long" table.

这个,我相信可能是相关的旋转/宽表是一个更大的规模比原来的表自宽表中的每一行n列对应的独特价值主列不管是否有值对应于细胞(这些是NA值)。因此,新表的大小通常是原来的“长”表的2x+。

My original table has ~ 500 million rows, about 20 unique values. I have attempted to run the above using only 5 million rows and it takes forever in R (too long to wait for it to complete).

我的原始表有~ 5亿行,大约20个唯一值。我试图用500万行来运行上面的操作,而R(等待它完成的时间太长了)。

For benchmarking purposes, the example (using 5 million rows) - completes in about 1 minute using production rdbms systems running multithreaded. It completes in about 8 "seconds" using single core using KDB+/Q (http://www.kx.com). It might not be a fair comparison, but gives a sense that it is possible to do these operations much faster using alternative means. KDB+ doesn't have sparse rows, so it is allocating memory for all the cells and still much faster than anything else I have tried.

为了进行基准测试,这个示例(使用了500万行)使用运行多线程的生产rdbms系统在大约1分钟内完成。使用KDB+/Q (http://www.kx.com)使用单核完成大约8秒。这可能不是一个公平的比较,但它给人的感觉是,使用替代方法可以更快地完成这些操作。KDB+没有稀疏的行,所以它为所有的单元分配内存,而且仍然比我尝试过的任何东西都快得多。

What I need however, is an R solution :) and so far, I haven't found an efficient way to perform similar operations.

但是,我需要的是一个R解决方案),到目前为止,我还没有找到一种有效的方法来执行类似的操作。

If you have had experience and could reflect upon any alternative / more optimal solution, I'd be interested in knowing the same. A sample code is provided below. You can vary the value for n to simulate the results. The unique values for the pivot column (column c3) have been fixed at 25.

如果你有经验,并可以考虑任何替代/更优的解决方案,我很有兴趣知道同样的问题。下面提供了一个示例代码。你可以改变n的值来模拟结果。主列(列c3)的惟一值在25处固定。

n = 100 # Increase this to benchmark

z <- data.table(c1=sample(1:10000,n,replace=T),
    c2=sample(1:100000,n,replace=T),
    c3=sample(1:25,n,replace=T),
    price=runif(n)*10)

c3.unique <- 1:25

z <- z[,list(sumprice=sum(price)), by='c1,c2,c3'][,as.list((sumprice[match(c3.unique,c3)])), by='c1,c2']
setnames(z[], c("c1","c2",c3.unique))

Thanks,

谢谢,

  • Raj.
  • 拉吉。

1 个解决方案

#1


4  

For n=1e6 the following takes about 10 seconds with plain dcast and about 4 seconds with dcast.data.table:

对于n=1e6,使用普通dcast需要大约10秒,使用dcast.data5秒。

library(reshape2)

dcast(z[, sum(price), by = list(c1, c2, c3)], c1 + c2 ~ c3)

# or with 1.8.11
dcast.data.table(z, c1 + c2 ~ c3, fun = sum)

#1


4  

For n=1e6 the following takes about 10 seconds with plain dcast and about 4 seconds with dcast.data.table:

对于n=1e6,使用普通dcast需要大约10秒,使用dcast.data5秒。

library(reshape2)

dcast(z[, sum(price), by = list(c1, c2, c3)], c1 + c2 ~ c3)

# or with 1.8.11
dcast.data.table(z, c1 + c2 ~ c3, fun = sum)