在数据中分割文本字符串。表列

I have a script that reads in data from a CSV file into a data.table and then splits the text in one column into several new columns. I am currently using the lapply and strsplit functions to do this. Here's an example:

我有一个从CSV文件读取数据到数据的脚本。然后将文本分割为一列，并将其分成几个新列。我目前正在使用lapply和strsplit函数来实现这一点。这里有一个例子:

library("data.table")df = data.table(PREFIX = c("A_B","A_C","A_D","B_A","B_C","B_D"),                VALUE  = 1:6)dt = as.data.table(df)# split PREFIX into new columnsdt$PX = as.character(lapply(strsplit(as.character(dt$PREFIX), split="_"), "[", 1))dt$PY = as.character(lapply(strsplit(as.character(dt$PREFIX), split="_"), "[", 2))dt #    PREFIX VALUE PX PY# 1:    A_B     1  A  B# 2:    A_C     2  A  C# 3:    A_D     3  A  D# 4:    B_A     4  B  A# 5:    B_C     5  B  C# 6:    B_D     6  B  D

In the example above the column PREFIX is split into two new columns PX and PY on the "_" character.

在上面的示例中，列前缀在“_”字符上被分割为两个新的列PX和PY。

Even though this works just fine, I was wondering if there is a better (more efficient) way to do this using data.table. My real datasets have >=10M+ rows, so time/memory efficiency becomes really important.

尽管这很好，但我想知道是否有更好(更有效)的方法使用data.table来实现这一点。我的真实数据集有>=10M+行，因此时间/内存效率变得非常重要。

UPDATE:

Following @Frank's suggestion I created a larger test case and used the suggested commands, but the stringr::str_split_fixed takes a lot longer than the original method.

按照@Frank的建议，我创建了一个更大的测试用例并使用了建议的命令，但是stringr::str_split_fixed要比原来的方法花费更多的时间。

library("data.table")library("stringr")system.time ({    df = data.table(PREFIX = rep(c("A_B","A_C","A_D","B_A","B_C","B_D"), 1000000),                    VALUE  = rep(1:6, 1000000))    dt = data.table(df)})#   user  system elapsed #  0.682   0.075   0.758 system.time({ dt[, c("PX","PY") := data.table(str_split_fixed(PREFIX,"_",2))] })#    user  system elapsed # 738.283   3.103 741.674 rm(dt)system.time ( {    df = data.table(PREFIX = rep(c("A_B","A_C","A_D","B_A","B_C","B_D"), 1000000),                     VALUE = rep(1:6, 1000000) )    dt = as.data.table(df)})#    user  system elapsed #   0.123   0.000   0.123 # split PREFIX into new columnssystem.time ({    dt$PX = as.character(lapply(strsplit(as.character(dt$PREFIX), split="_"), "[", 1))    dt$PY = as.character(lapply(strsplit(as.character(dt$PREFIX), split="_"), "[", 2))})#    user  system elapsed #  33.185   0.000  33.191

So the str_split_fixed method takes about 20X times longer.

所以str_split_fixed方法要花20倍的时间。

4 个解决方案

#1

Update: From version 1.9.6 (on CRAN as of Sep'15), we can use the function tstrsplit() to get the results directly (and in a much more efficient manner):

更新:从1.9.6版本(从9月15日的CRAN上)开始，我们可以使用函数tstrsplit()直接获得结果(并且以更有效的方式):

require(data.table) ## v1.9.6+dt[, c("PX", "PY") := tstrsplit(PREFIX, "_", fixed=TRUE)]#    PREFIX VALUE PX PY# 1:    A_B     1  A  B# 2:    A_C     2  A  C# 3:    A_D     3  A  D# 4:    B_A     4  B  A# 5:    B_C     5  B  C# 6:    B_D     6  B  D

tstrsplit() basically is a wrapper for transpose(strsplit()), where transpose() function, also recently implemented, transposes a list. Please see ?tstrsplit() and ?transpose() for examples.

tstrsplit()基本上是转置(strsplit())的包装器，其中的转置()函数(也是最近实现的)将转置一个列表。请参见?tstrsplit()和?转置()。

See history for old answers.

从历史中寻找古老的答案。

#2

I add answer for someone who do not use data.table v1.9.5 and also want an one line solution.

我为不使用数据的人添加答案。表v1.9.5，还需要一个单行解决方案。

dt[, c('PX','PY') := do.call(Map, c(f = c, strsplit(PREFIX, '-'))) ]

#3

Using splitstackshape package:

使用splitstackshape包:

library(splitstackshape)cSplit(df, splitCols = "PREFIX", sep = "_", direction = "wide", drop = FALSE)#    PREFIX VALUE PREFIX_1 PREFIX_2# 1:    A_B     1        A        B# 2:    A_C     2        A        C# 3:    A_D     3        A        D# 4:    B_A     4        B        A# 5:    B_C     5        B        C# 6:    B_D     6        B        D

#4

With tidyr the solution is:

使用tidyr，解决方案是:

separate(df,col = "PREFIX",into = c("PX", "PY"), sep = "_")

#1