I have a really big problem and looping through the data.table to do what I want is too slow, so I am trying to get around looping. Let assume I have a data.table as follows:
我有一个非常大的问题,并循环data.table来做我想要的太慢,所以我试图绕过循环。假设我有一个data.table如下:
a <- data.table(i = c(1,2,3), j = c(2,2,6), k = list(c("a","b"),c("a","c"),c("b")))
> a
i j k
1: 1 2 a,b
2: 2 2 a,c
3: 3 6 b
And I want to group based on the values in k. So something like this:
我想根据k中的值进行分组。所以这样的事情:
a[, sum(j), by = k]
right now I am getting the following error:
现在我收到以下错误:
Error in `[.data.table`(a, , sum(i), by = k) :
The items in the 'by' or 'keyby' list are length (2,2,1). Each must be same length as rows in x or number of rows returned by i (3).
The answer I am looking for is to group first all the rows having "a" in column k and calculate sum(j) and then all rows having "b" and so on. So the desired answer would be:
我正在寻找的答案是首先将列k中具有“a”的所有行分组并计算sum(j),然后计算所有具有“b”的行,依此类推。所以答案是:
k V1
a 4
b 8
c 2
Any hint how to do it efficiently? I cant melt the column K by repeating the rows since the size of the data.table would be too big for my case.
任何提示如何有效地做到这一点?我不能通过重复行来融化列K,因为data.table的大小对我来说太大了。
3 个解决方案
#1
8
I think this might work:
我认为这可能有效:
a[, .(k = unlist(k)), by=.(i,j)][,sum(j),by=k]
k V1
1: a 4
2: b 8
3: c 2
#2
4
If we are using tidyr
, a compact option would be
如果我们使用tidyr,那么紧凑的选择就是
library(tidyr)
unnest(a, k)[, sum(j) ,k]
# k V1
#1: a 4
#2: b 8
#3: c 2
Or using the dplyr/tidyr
pipes
或者使用dplyr / tidyr管道
unnest(a, k) %>%
group_by(k) %>%
summarise(V1 = sum(j))
# k V1
# <chr> <dbl>
#1 a 4
#2 b 8
#3 c 2
#3
2
Since by-group operations can be slow, I'd consider...
由于分组操作可能很慢,我会考虑......
dat = a[rep(1:.N, lengths(k)), c(.SD, .(k = unlist(a$k))), .SDcols=setdiff(names(a), "k")]
i j k
1: 1 2 a
2: 1 2 b
3: 2 2 a
4: 2 2 c
5: 3 6 b
We're repeating rows of cols i:j
to match the unlisted k
. The data should be kept in this format instead of using a list column, probably. From there, as in @MikeyMike's answer, we can dat[, sum(j), by=k]
.
我们重复cols i:j的行以匹配未列出的k。数据应该以这种格式保存,而不是使用列表列。从那里开始,就像在@ MikeyMike的回答中一样,我们可以使用dat [,sum(j),by = k]。
In data.table 1.9.7+, we can similarly do
在data.table 1.9.7+中,我们也可以这样做
dat = a[, c(.SD[rep(.I, lengths(k))], .(k = unlist(k))), .SDcols=i:j]
#1
8
I think this might work:
我认为这可能有效:
a[, .(k = unlist(k)), by=.(i,j)][,sum(j),by=k]
k V1
1: a 4
2: b 8
3: c 2
#2
4
If we are using tidyr
, a compact option would be
如果我们使用tidyr,那么紧凑的选择就是
library(tidyr)
unnest(a, k)[, sum(j) ,k]
# k V1
#1: a 4
#2: b 8
#3: c 2
Or using the dplyr/tidyr
pipes
或者使用dplyr / tidyr管道
unnest(a, k) %>%
group_by(k) %>%
summarise(V1 = sum(j))
# k V1
# <chr> <dbl>
#1 a 4
#2 b 8
#3 c 2
#3
2
Since by-group operations can be slow, I'd consider...
由于分组操作可能很慢,我会考虑......
dat = a[rep(1:.N, lengths(k)), c(.SD, .(k = unlist(a$k))), .SDcols=setdiff(names(a), "k")]
i j k
1: 1 2 a
2: 1 2 b
3: 2 2 a
4: 2 2 c
5: 3 6 b
We're repeating rows of cols i:j
to match the unlisted k
. The data should be kept in this format instead of using a list column, probably. From there, as in @MikeyMike's answer, we can dat[, sum(j), by=k]
.
我们重复cols i:j的行以匹配未列出的k。数据应该以这种格式保存,而不是使用列表列。从那里开始,就像在@ MikeyMike的回答中一样,我们可以使用dat [,sum(j),by = k]。
In data.table 1.9.7+, we can similarly do
在data.table 1.9.7+中,我们也可以这样做
dat = a[, c(.SD[rep(.I, lengths(k))], .(k = unlist(k))), .SDcols=i:j]