I have a file with ~ 40 million rows that I need to split based on the first comma delimiter.
我有一个大约有4千万行的文件,我需要根据第一个逗号分隔符进行拆分。
The following using the stringr
function str_split_fixed
works well but is very slow.
以下使用stringr函数str_split_fixed运行良好,但速度很慢。
library(data.table)
library(stringr)
df1 <- data.frame(id = 1:1000, letter1 = rep(letters[sample(1:25,1000, replace = T)], 40))
df1$combCol1 <- paste(df1$id, ',',df1$letter1, sep = '')
df1$combCol2 <- paste(df1$combCol1, ',', df1$combCol1, sep = '')
st1 <- str_split_fixed(df1$combCol2, ',', 2)
Any suggestions for a faster way to do this?
有什么建议可以更快地完成这项工作吗?
1 个解决方案
#1
8
Update
The stri_split_fixed
function in more recent versions of "stringi" have a simplify
argument that can be set to TRUE
to return a matrix. Thus, the updated solution would be:
更新版本的“stringi”中的stri_split_fixed函数有一个简化参数,可以设置为TRUE以返回矩阵。因此,更新的解决方案将是:
stri_split_fixed(df1$combCol2, ",", 2, simplify = TRUE)
Original answer (with updated benchmarks)
If you are comfortable with the "stringr" syntax and don't want to veer too far from it, but you also want to benefit from a speed boost, try the "stringi" package instead:
如果您对“stringr”语法感到满意并且不想偏离它太远,但您也希望从速度提升中受益,请尝试使用“stringi”包:
library(stringr)
library(stringi)
system.time(temp1 <- str_split_fixed(df1$combCol2, ',', 2))
# user system elapsed
# 3.25 0.00 3.25
system.time(temp2a <- do.call(rbind, stri_split_fixed(df1$combCol2, ",", 2)))
# user system elapsed
# 0.04 0.00 0.05
system.time(temp2b <- stri_split_fixed(df1$combCol2, ",", 2, simplify = TRUE))
# user system elapsed
# 0.01 0.00 0.01
Most of the "stringr" functions have "stringi" parallels, but as can be seen from this example, the "stringi" output required one extra step of binding the data to create the output as a matrix instead of as a list.
大多数“stringr”函数都有“stringi”并行,但从这个例子中可以看出,“stringi”输出需要一个额外的步骤来绑定数据,以创建输出作为矩阵而不是列表。
Here's how it compares with @RichardScriven's suggestion in the comments:
以下是与评论中@ RichardScriven建议的比较:
fun1a <- function() do.call(rbind, stri_split_fixed(df1$combCol2, ",", 2))
fun1b <- function() stri_split_fixed(df1$combCol2, ",", 2, simplify = TRUE)
fun2 <- function() {
do.call(rbind, regmatches(df1$combCol2, regexpr(",", df1$combCol2),
invert = TRUE))
}
library(microbenchmark)
microbenchmark(fun1a(), fun1b(), fun2(), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fun1a() 42.72647 46.35848 59.56948 51.94796 69.29920 98.46330 10
# fun1b() 17.55183 18.59337 20.09049 18.84907 22.09419 26.85343 10
# fun2() 370.82055 404.23115 434.62582 439.54923 476.02889 480.97912 10
#1
8
Update
The stri_split_fixed
function in more recent versions of "stringi" have a simplify
argument that can be set to TRUE
to return a matrix. Thus, the updated solution would be:
更新版本的“stringi”中的stri_split_fixed函数有一个简化参数,可以设置为TRUE以返回矩阵。因此,更新的解决方案将是:
stri_split_fixed(df1$combCol2, ",", 2, simplify = TRUE)
Original answer (with updated benchmarks)
If you are comfortable with the "stringr" syntax and don't want to veer too far from it, but you also want to benefit from a speed boost, try the "stringi" package instead:
如果您对“stringr”语法感到满意并且不想偏离它太远,但您也希望从速度提升中受益,请尝试使用“stringi”包:
library(stringr)
library(stringi)
system.time(temp1 <- str_split_fixed(df1$combCol2, ',', 2))
# user system elapsed
# 3.25 0.00 3.25
system.time(temp2a <- do.call(rbind, stri_split_fixed(df1$combCol2, ",", 2)))
# user system elapsed
# 0.04 0.00 0.05
system.time(temp2b <- stri_split_fixed(df1$combCol2, ",", 2, simplify = TRUE))
# user system elapsed
# 0.01 0.00 0.01
Most of the "stringr" functions have "stringi" parallels, but as can be seen from this example, the "stringi" output required one extra step of binding the data to create the output as a matrix instead of as a list.
大多数“stringr”函数都有“stringi”并行,但从这个例子中可以看出,“stringi”输出需要一个额外的步骤来绑定数据,以创建输出作为矩阵而不是列表。
Here's how it compares with @RichardScriven's suggestion in the comments:
以下是与评论中@ RichardScriven建议的比较:
fun1a <- function() do.call(rbind, stri_split_fixed(df1$combCol2, ",", 2))
fun1b <- function() stri_split_fixed(df1$combCol2, ",", 2, simplify = TRUE)
fun2 <- function() {
do.call(rbind, regmatches(df1$combCol2, regexpr(",", df1$combCol2),
invert = TRUE))
}
library(microbenchmark)
microbenchmark(fun1a(), fun1b(), fun2(), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fun1a() 42.72647 46.35848 59.56948 51.94796 69.29920 98.46330 10
# fun1b() 17.55183 18.59337 20.09049 18.84907 22.09419 26.85343 10
# fun2() 370.82055 404.23115 434.62582 439.54923 476.02889 480.97912 10