I have a large (millions of rows) melted data.table
with the usual melt
-style unrolling in the variable
and value
columns. I need to cast the table in wide form (rolling the variables up). The problem is that the data table also has a list column called data
, which I need to preserve. This makes it impossible to use reshape2
because dcast
cannot deal with non-atomic columns. Therefore, I need to do the rolling up myself.
我有一个大的(数百万行)熔化的data.table,通常在变量和值列中展开熔化样式。我需要以宽泛的形式转换表(滚动变量)。问题是数据表还有一个名为data的列表列,我需要保留它。这使得无法使用reshape2,因为dcast无法处理非原子列。因此,我需要自己卷起来。
The answer from a previous question about working with melted data tables does not apply here because of the list column.
由于列表列,上一个关于使用熔化数据表的问题的答案在这里不适用。
I am not satisfied with the solution I've come up with. I'm looking for suggestions for a simpler/faster implementation.
我对我提出的解决方案不满意。我正在寻找更简单/更快实现的建议。
x <- LETTERS[1:3]
dt <- data.table(
x=rep(x, each=2),
y='d',
data=list(list(), list(), list(), list(), list(), list()),
variable=rep(c('var.1', 'var.2'), 3),
value=seq(1,6)
)
# Column template set up
list_template <- Reduce(
function(l, col) { l[[col]] <- col; l },
unique(dt$variable),
list())
# Expression set up
q <- substitute({
l <- lapply(
list_template,
function(col) .SD[variable==as.character(col)]$value)
l$data = .SD[1,]$data
l
}, list(list_template=list_template))
# Roll up
dt[, eval(q), by=list(x, y)]
x y var.1 var.2 data
1: A d 1 2 <list>
2: B d 3 4 <list>
3: C d 5 6 <list>
2 个解决方案
#1
1
I have somewhat cheating method that might do the trick - importantly, I assume that each x,y,list combination is unique! If not, please disregard.
我有一些欺骗方法可能会有所作为 - 重要的是,我假设每个x,y,列表组合是唯一的!如果没有,请忽略。
I'm going to create two separate datatables, the first which is dcasted without the data list objects, and the second which has only the unique data list objects and a key. Then just merge them together to get the desired result.
我将创建两个单独的数据表,第一个是没有数据列表对象的数据,第二个只有唯一的数据列表对象和一个键。然后将它们合并在一起以获得所需的结果。
require(data.table)
require(stringr)
require(reshape2)
x <- LETTERS[1:3]
dt <- data.table(
x=rep(x, each=2),
y='d',
data=list(list("a","b"), list("c","d")),
variable=rep(c('var.1', 'var.2'), 3),
value=seq(1,6)
)
# First create the dcasted datatable without the pesky list objects:
dt_nolist <- dt[,list(x,y,variable,value)]
dt_dcast <- data.table(dcast(dt_nolist,x+y~variable,value.var="value")
,key=c("x","y"))
# Second: create a datatable with only unique "groups" of x,y, list
dt_list <- dt[,list(x,y,data)]
# Rows are duplicated so I'd like to use unique() to get rid of them, but
# unique() doesn't work when there's list objects in the data.table.
# Instead so I cheat by applying a value to each row within an x,y "group"
# that is unique within EACH group, but present within EVERY group.
# Then just simply subselect based on that unique value.
# I've chosen rank(), but no doubt there's other options
dt_list <- dt_list[,rank:=rank(str_c(x,y),ties.method="first"),by=str_c(x,y)]
# now keep only one row per x,y "group"
dt_list <- dt_list[rank==1]
setkeyv(dt_list,c("x","y"))
# drop the rank since we no longer need it
dt_list[,rank:=NULL]
# Finally just merge back together
dt_final <- merge(dt_dcast,dt_list)
#2
1
This old question piqued my curiosity as data.table
has been improved sigificantly since 2013.
这个古老的问题激起了我的好奇心,因为自2013年以来,数据表已经大大改善。
However, even with data.table
version 1.11.4
但是,即使使用data.table版本1.11.4
dcast(dt, x + y + data ~ variable)
still returns an error
仍然会返回错误
Columns specified in formula can not be of type list
公式中指定的列不能是类型列表
The workaround follows the general outline of jonsedar's answer :
解决方法遵循jonsedar答案的大致轮廓:
- Reshape the non-list columns from long to wide format
- 将非列表列从长格式重新格式化为宽格式
- Aggregate the list column
data
grouped byx
andy
- 聚合按x和y分组的列表列数据
- Join the two partial results on
x
andy
- 在x和y上加入两个部分结果
but uses the features of the actual data.table
syntax, e.g., the on
parameter:
但使用实际data.table语法的功能,例如on参数:
dcast(dt, x + y ~ variable)[
dt[, .(data = .(first(data))), by = .(x, y)], on = .(x, y)]
x y var.1 var.2 data 1: A d 1 2 <list> 2: B d 3 4 <list> 3: C d 5 6 <list>
The list column data
is aggregated by taking the first element. This is in line with OP's code line
通过获取第一个元素来聚合列表列数据。这符合OP的代码行
l$data = .SD[1,]$data
which also picks the first element.
这也选择了第一个元素。
#1
1
I have somewhat cheating method that might do the trick - importantly, I assume that each x,y,list combination is unique! If not, please disregard.
我有一些欺骗方法可能会有所作为 - 重要的是,我假设每个x,y,列表组合是唯一的!如果没有,请忽略。
I'm going to create two separate datatables, the first which is dcasted without the data list objects, and the second which has only the unique data list objects and a key. Then just merge them together to get the desired result.
我将创建两个单独的数据表,第一个是没有数据列表对象的数据,第二个只有唯一的数据列表对象和一个键。然后将它们合并在一起以获得所需的结果。
require(data.table)
require(stringr)
require(reshape2)
x <- LETTERS[1:3]
dt <- data.table(
x=rep(x, each=2),
y='d',
data=list(list("a","b"), list("c","d")),
variable=rep(c('var.1', 'var.2'), 3),
value=seq(1,6)
)
# First create the dcasted datatable without the pesky list objects:
dt_nolist <- dt[,list(x,y,variable,value)]
dt_dcast <- data.table(dcast(dt_nolist,x+y~variable,value.var="value")
,key=c("x","y"))
# Second: create a datatable with only unique "groups" of x,y, list
dt_list <- dt[,list(x,y,data)]
# Rows are duplicated so I'd like to use unique() to get rid of them, but
# unique() doesn't work when there's list objects in the data.table.
# Instead so I cheat by applying a value to each row within an x,y "group"
# that is unique within EACH group, but present within EVERY group.
# Then just simply subselect based on that unique value.
# I've chosen rank(), but no doubt there's other options
dt_list <- dt_list[,rank:=rank(str_c(x,y),ties.method="first"),by=str_c(x,y)]
# now keep only one row per x,y "group"
dt_list <- dt_list[rank==1]
setkeyv(dt_list,c("x","y"))
# drop the rank since we no longer need it
dt_list[,rank:=NULL]
# Finally just merge back together
dt_final <- merge(dt_dcast,dt_list)
#2
1
This old question piqued my curiosity as data.table
has been improved sigificantly since 2013.
这个古老的问题激起了我的好奇心,因为自2013年以来,数据表已经大大改善。
However, even with data.table
version 1.11.4
但是,即使使用data.table版本1.11.4
dcast(dt, x + y + data ~ variable)
still returns an error
仍然会返回错误
Columns specified in formula can not be of type list
公式中指定的列不能是类型列表
The workaround follows the general outline of jonsedar's answer :
解决方法遵循jonsedar答案的大致轮廓:
- Reshape the non-list columns from long to wide format
- 将非列表列从长格式重新格式化为宽格式
- Aggregate the list column
data
grouped byx
andy
- 聚合按x和y分组的列表列数据
- Join the two partial results on
x
andy
- 在x和y上加入两个部分结果
but uses the features of the actual data.table
syntax, e.g., the on
parameter:
但使用实际data.table语法的功能,例如on参数:
dcast(dt, x + y ~ variable)[
dt[, .(data = .(first(data))), by = .(x, y)], on = .(x, y)]
x y var.1 var.2 data 1: A d 1 2 <list> 2: B d 3 4 <list> 3: C d 5 6 <list>
The list column data
is aggregated by taking the first element. This is in line with OP's code line
通过获取第一个元素来聚合列表列数据。这符合OP的代码行
l$data = .SD[1,]$data
which also picks the first element.
这也选择了第一个元素。