I have a large data.table (9 M lines) with two columns: fcombined and value fcombined is a factor, but its actually the result of interacting two factors. The question now is what is the most efficient way to split up the one factor column in two again? I have already come up with a solution that works ok, but maybe there is more straight forward way that i have missed. The working example is:
我有一个大数据。表(9 M行)包含两列:fcombination和value fcombination是一个因素,但它实际上是两个因素相互作用的结果。现在的问题是,什么是最有效的方法将一个因子列分解为2 ?我已经提出了一个可行的解决方案,但也许我已经错过了更直接的方法。工作的例子是:
library(stringr)
f1=1:20
f2=1:20
g=expand.grid(f1,f2)
combinedfactor=as.factor(paste(g$Var1,g$Var2,sep="_"))
largedata=1:10^6
DT=data.table(fcombined=combinedfactor,value=largedata)
splitfactorcol=function(res,colname,splitby="_",namesofnewcols){#the nr. of cols retained is length(namesofnewcols)
helptable=data.table(.factid=seq_along(levels(res[[colname]])) ,str_split_fixed(levels(res[[colname]]),splitby,length(namesofnewcols)))
setnames(helptable,colnames(helptable),c(".factid",namesofnewcols))
setkey(helptable,.factid)
res$.factid=unclass(res[[colname]])
setkey(res,.factid)
m=merge(res,helptable)
m$.factid=NULL
m
}
splitfactorcol(DT,"fcombined",splitby="_",c("f1","f2"))
1 个解决方案
#1
3
I think this does the trick and is about 5x faster.
我想这是一个技巧,速度大约是5x。
setkey(DT, fcombined)
DT[DT[, data.table(fcombined = levels(fcombined),
do.call(rbind, strsplit(levels(fcombined), "_")))]]
I split the levels and then simply merged that result back into the original data.table
.
我将级别拆分,然后将结果简单地合并到原始数据表中。
Btw, in my tests strsplit
was about 2x faster (for this task) than the stringr
function.
顺便说一句,在我的测试中,strsplit(对于这个任务)比stringr函数快了2倍。
#1
3
I think this does the trick and is about 5x faster.
我想这是一个技巧,速度大约是5x。
setkey(DT, fcombined)
DT[DT[, data.table(fcombined = levels(fcombined),
do.call(rbind, strsplit(levels(fcombined), "_")))]]
I split the levels and then simply merged that result back into the original data.table
.
我将级别拆分,然后将结果简单地合并到原始数据表中。
Btw, in my tests strsplit
was about 2x faster (for this task) than the stringr
function.
顺便说一句,在我的测试中,strsplit(对于这个任务)比stringr函数快了2倍。