I have a script that reads in data from a CSV file into a data.table
and then splits the text in one column into several new columns. I am currently using the lapply
and strsplit
functions to do this. Here's an example:
我有一个从CSV文件读取数据到数据的脚本。然后将文本分割为一列,并将其分成几个新列。我目前正在使用lapply和strsplit函数来实现这一点。这里有一个例子:
library("data.table")df = data.table(PREFIX = c("A_B","A_C","A_D","B_A","B_C","B_D"), VALUE = 1:6)dt = as.data.table(df)# split PREFIX into new columnsdt$PX = as.character(lapply(strsplit(as.character(dt$PREFIX), split="_"), "[", 1))dt$PY = as.character(lapply(strsplit(as.character(dt$PREFIX), split="_"), "[", 2))dt # PREFIX VALUE PX PY# 1: A_B 1 A B# 2: A_C 2 A C# 3: A_D 3 A D# 4: B_A 4 B A# 5: B_C 5 B C# 6: B_D 6 B D
In the example above the column PREFIX
is split into two new columns PX
and PY
on the "_" character.
在上面的示例中,列前缀在“_”字符上被分割为两个新的列PX和PY。
Even though this works just fine, I was wondering if there is a better (more efficient) way to do this using data.table
. My real datasets have >=10M+ rows, so time/memory efficiency becomes really important.
尽管这很好,但我想知道是否有更好(更有效)的方法使用data.table来实现这一点。我的真实数据集有>=10M+行,因此时间/内存效率变得非常重要。
UPDATE:
Following @Frank's suggestion I created a larger test case and used the suggested commands, but the stringr::str_split_fixed
takes a lot longer than the original method.
按照@Frank的建议,我创建了一个更大的测试用例并使用了建议的命令,但是stringr::str_split_fixed要比原来的方法花费更多的时间。
library("data.table")library("stringr")system.time ({ df = data.table(PREFIX = rep(c("A_B","A_C","A_D","B_A","B_C","B_D"), 1000000), VALUE = rep(1:6, 1000000)) dt = data.table(df)})# user system elapsed # 0.682 0.075 0.758 system.time({ dt[, c("PX","PY") := data.table(str_split_fixed(PREFIX,"_",2))] })# user system elapsed # 738.283 3.103 741.674 rm(dt)system.time ( { df = data.table(PREFIX = rep(c("A_B","A_C","A_D","B_A","B_C","B_D"), 1000000), VALUE = rep(1:6, 1000000) ) dt = as.data.table(df)})# user system elapsed # 0.123 0.000 0.123 # split PREFIX into new columnssystem.time ({ dt$PX = as.character(lapply(strsplit(as.character(dt$PREFIX), split="_"), "[", 1)) dt$PY = as.character(lapply(strsplit(as.character(dt$PREFIX), split="_"), "[", 2))})# user system elapsed # 33.185 0.000 33.191
So the str_split_fixed
method takes about 20X times longer.
所以str_split_fixed方法要花20倍的时间。
4 个解决方案
#1
70
Update: From version 1.9.6 (on CRAN as of Sep'15), we can use the function tstrsplit()
to get the results directly (and in a much more efficient manner):
更新:从1.9.6版本(从9月15日的CRAN上)开始,我们可以使用函数tstrsplit()直接获得结果(并且以更有效的方式):
require(data.table) ## v1.9.6+dt[, c("PX", "PY") := tstrsplit(PREFIX, "_", fixed=TRUE)]# PREFIX VALUE PX PY# 1: A_B 1 A B# 2: A_C 2 A C# 3: A_D 3 A D# 4: B_A 4 B A# 5: B_C 5 B C# 6: B_D 6 B D
tstrsplit()
basically is a wrapper for transpose(strsplit())
, where transpose()
function, also recently implemented, transposes a list. Please see ?tstrsplit()
and ?transpose()
for examples.
tstrsplit()基本上是转置(strsplit())的包装器,其中的转置()函数(也是最近实现的)将转置一个列表。请参见?tstrsplit()和?转置()。
See history for old answers.
从历史中寻找古老的答案。
#2
12
I add answer for someone who do not use data.table
v1.9.5 and also want an one line solution.
我为不使用数据的人添加答案。表v1.9.5,还需要一个单行解决方案。
dt[, c('PX','PY') := do.call(Map, c(f = c, strsplit(PREFIX, '-'))) ]
#3
5
Using splitstackshape
package:
使用splitstackshape包:
library(splitstackshape)cSplit(df, splitCols = "PREFIX", sep = "_", direction = "wide", drop = FALSE)# PREFIX VALUE PREFIX_1 PREFIX_2# 1: A_B 1 A B# 2: A_C 2 A C# 3: A_D 3 A D# 4: B_A 4 B A# 5: B_C 5 B C# 6: B_D 6 B D
#4
0
With tidyr the solution is:
使用tidyr,解决方案是:
separate(df,col = "PREFIX",into = c("PX", "PY"), sep = "_")
#1
70
Update: From version 1.9.6 (on CRAN as of Sep'15), we can use the function tstrsplit()
to get the results directly (and in a much more efficient manner):
更新:从1.9.6版本(从9月15日的CRAN上)开始,我们可以使用函数tstrsplit()直接获得结果(并且以更有效的方式):
require(data.table) ## v1.9.6+dt[, c("PX", "PY") := tstrsplit(PREFIX, "_", fixed=TRUE)]# PREFIX VALUE PX PY# 1: A_B 1 A B# 2: A_C 2 A C# 3: A_D 3 A D# 4: B_A 4 B A# 5: B_C 5 B C# 6: B_D 6 B D
tstrsplit()
basically is a wrapper for transpose(strsplit())
, where transpose()
function, also recently implemented, transposes a list. Please see ?tstrsplit()
and ?transpose()
for examples.
tstrsplit()基本上是转置(strsplit())的包装器,其中的转置()函数(也是最近实现的)将转置一个列表。请参见?tstrsplit()和?转置()。
See history for old answers.
从历史中寻找古老的答案。
#2
12
I add answer for someone who do not use data.table
v1.9.5 and also want an one line solution.
我为不使用数据的人添加答案。表v1.9.5,还需要一个单行解决方案。
dt[, c('PX','PY') := do.call(Map, c(f = c, strsplit(PREFIX, '-'))) ]
#3
5
Using splitstackshape
package:
使用splitstackshape包:
library(splitstackshape)cSplit(df, splitCols = "PREFIX", sep = "_", direction = "wide", drop = FALSE)# PREFIX VALUE PREFIX_1 PREFIX_2# 1: A_B 1 A B# 2: A_C 2 A C# 3: A_D 3 A D# 4: B_A 4 B A# 5: B_C 5 B C# 6: B_D 6 B D
#4
0
With tidyr the solution is:
使用tidyr,解决方案是:
separate(df,col = "PREFIX",into = c("PX", "PY"), sep = "_")