R:在第一个分隔符出现时快速字符串分割

时间:2021-03-23 22:08:13

I have a file with ~ 40 million rows that I need to split based on the first comma delimiter.

我有一个大约有4千万行的文件,我需要根据第一个逗号分隔符进行拆分。

The following using the stringr function str_split_fixed works well but is very slow.

以下使用stringr函数str_split_fixed运行良好,但速度很慢。

library(data.table)
library(stringr)

df1 <- data.frame(id = 1:1000, letter1 = rep(letters[sample(1:25,1000, replace = T)], 40))
df1$combCol1 <- paste(df1$id, ',',df1$letter1, sep = '')
df1$combCol2 <- paste(df1$combCol1, ',', df1$combCol1, sep = '')

st1 <- str_split_fixed(df1$combCol2, ',', 2)

Any suggestions for a faster way to do this?

有什么建议可以更快地完成这项工作吗?

1 个解决方案

#1


8  

Update

The stri_split_fixed function in more recent versions of "stringi" have a simplify argument that can be set to TRUE to return a matrix. Thus, the updated solution would be:

更新版本的“stringi”中的stri_split_fixed函数有一个简化参数,可以设置为TRUE以返回矩阵。因此,更新的解决方案将是:

stri_split_fixed(df1$combCol2, ",", 2, simplify = TRUE)

Original answer (with updated benchmarks)

If you are comfortable with the "stringr" syntax and don't want to veer too far from it, but you also want to benefit from a speed boost, try the "stringi" package instead:

如果您对“stringr”语法感到满意并且不想偏离它太远,但您也希望从速度提升中受益,请尝试使用“stringi”包:

library(stringr)
library(stringi)
system.time(temp1 <- str_split_fixed(df1$combCol2, ',', 2))
#    user  system elapsed 
#    3.25    0.00    3.25 
system.time(temp2a <- do.call(rbind, stri_split_fixed(df1$combCol2, ",", 2)))
#    user  system elapsed 
#    0.04    0.00    0.05 
system.time(temp2b <- stri_split_fixed(df1$combCol2, ",", 2, simplify = TRUE))
#    user  system elapsed 
#    0.01    0.00    0.01

Most of the "stringr" functions have "stringi" parallels, but as can be seen from this example, the "stringi" output required one extra step of binding the data to create the output as a matrix instead of as a list.

大多数“stringr”函数都有“stringi”并行,但从这个例子中可以看出,“stringi”输出需要一个额外的步骤来绑定数据,以创建输出作为矩阵而不是列表。


Here's how it compares with @RichardScriven's suggestion in the comments:

以下是与评论中@ RichardScriven建议的比较:

fun1a <- function() do.call(rbind, stri_split_fixed(df1$combCol2, ",", 2))
fun1b <- function() stri_split_fixed(df1$combCol2, ",", 2, simplify = TRUE)
fun2 <- function() {
  do.call(rbind, regmatches(df1$combCol2, regexpr(",", df1$combCol2), 
                            invert = TRUE))
} 

library(microbenchmark)
microbenchmark(fun1a(), fun1b(), fun2(), times = 10)
# Unit: milliseconds
#     expr       min        lq      mean    median        uq       max neval
#  fun1a()  42.72647  46.35848  59.56948  51.94796  69.29920  98.46330    10
#  fun1b()  17.55183  18.59337  20.09049  18.84907  22.09419  26.85343    10
#   fun2() 370.82055 404.23115 434.62582 439.54923 476.02889 480.97912    10

#1


8  

Update

The stri_split_fixed function in more recent versions of "stringi" have a simplify argument that can be set to TRUE to return a matrix. Thus, the updated solution would be:

更新版本的“stringi”中的stri_split_fixed函数有一个简化参数,可以设置为TRUE以返回矩阵。因此,更新的解决方案将是:

stri_split_fixed(df1$combCol2, ",", 2, simplify = TRUE)

Original answer (with updated benchmarks)

If you are comfortable with the "stringr" syntax and don't want to veer too far from it, but you also want to benefit from a speed boost, try the "stringi" package instead:

如果您对“stringr”语法感到满意并且不想偏离它太远,但您也希望从速度提升中受益,请尝试使用“stringi”包:

library(stringr)
library(stringi)
system.time(temp1 <- str_split_fixed(df1$combCol2, ',', 2))
#    user  system elapsed 
#    3.25    0.00    3.25 
system.time(temp2a <- do.call(rbind, stri_split_fixed(df1$combCol2, ",", 2)))
#    user  system elapsed 
#    0.04    0.00    0.05 
system.time(temp2b <- stri_split_fixed(df1$combCol2, ",", 2, simplify = TRUE))
#    user  system elapsed 
#    0.01    0.00    0.01

Most of the "stringr" functions have "stringi" parallels, but as can be seen from this example, the "stringi" output required one extra step of binding the data to create the output as a matrix instead of as a list.

大多数“stringr”函数都有“stringi”并行,但从这个例子中可以看出,“stringi”输出需要一个额外的步骤来绑定数据,以创建输出作为矩阵而不是列表。


Here's how it compares with @RichardScriven's suggestion in the comments:

以下是与评论中@ RichardScriven建议的比较:

fun1a <- function() do.call(rbind, stri_split_fixed(df1$combCol2, ",", 2))
fun1b <- function() stri_split_fixed(df1$combCol2, ",", 2, simplify = TRUE)
fun2 <- function() {
  do.call(rbind, regmatches(df1$combCol2, regexpr(",", df1$combCol2), 
                            invert = TRUE))
} 

library(microbenchmark)
microbenchmark(fun1a(), fun1b(), fun2(), times = 10)
# Unit: milliseconds
#     expr       min        lq      mean    median        uq       max neval
#  fun1a()  42.72647  46.35848  59.56948  51.94796  69.29920  98.46330    10
#  fun1b()  17.55183  18.59337  20.09049  18.84907  22.09419  26.85343    10
#   fun2() 370.82055 404.23115 434.62582 439.54923 476.02889 480.97912    10