R:数据框(表)中替换字符串的有效方法

时间:2021-03-14 01:11:09

Here's the code for the issue:

下面是这个问题的代码:

set.seed(1234)
y <- 1e7

renamer <- function(text){
  text[grep("ac", text)] <- "aaa"
  text[grep("gf", text)] <- "bbb"
  text[grep("er", text)] <- "ccc"
  text[grep("hy", text)] <- "ddd"
  text[grep("nh", text)] <- "eee"
  text[grep("oi", text)] <- "fff"
  text[grep("nu", text)] <- "ggg"
  text[grep("vf", text)] <- "hhh"
  text[grep("cd", text)] <- "iii"
  text[grep("po", text)] <- "jjj"
  return(text)
}

smp <- NULL
for(i in 1:100){
  smp <- c(smp, paste0(sample(letters, 15, T), collapse= ""))
}

df <- data.table(a = sample(smp, y, T))

# > system.time(renamer(text = df$a))
# user  system elapsed 
# 15.54    0.08   15.70 

Problem: there's a large data set that requires most of their values replaced in a time efficient manner. My code does the trick.. however, I really could use a faster solution.

问题:有一个大的数据集需要大多数的值以一种高效的方式替换。我的代码很管用。但是,我真的可以使用一个更快的解决方案。

Note that there are reoccurring values. And... (as it sometimes happens) while I was writing this question, I probably came up with solution which includes converting column to factor and replacing level values. But I decided to leave this question anyways, as someone else might need a help on this problem or there is some clever alternative solution.

注意,这里有重复的值。和…当我在写这个问题的时候,我可能想到了一个解决方案,它包括将列转换为因子和替换级别值。但是我还是决定把这个问题抛在一边,因为其他人可能需要帮助解决这个问题,或者有一些聪明的替代方案。

Here's a factor solution for benchmark:

这里有一个关于benchmark的因子解决方案:

# > system.time({
#   +   df$a <- factor(df$a)
#   +   levels(df$a) <- renamer(levels(df$a))
#   +   df$a <- as.character(df$a)
#   + })
# user  system elapsed 
# 1.25    0.14    1.42 

1 个解决方案

#1


2  

I would suggest creating a simple lookup table and use the excellent stringi::stri_detect_fixed function (gives me ~X100 speedup)

我建议创建一个简单的查找表,并使用优秀的stringi::stri_detect_fixed函数(给我~X100加速)

library(data.table)
library(stringi)

Lookup <- c("ac", "gf", "er", "hy", "nh", "oi", "nu", "vf", "cd", "po")
Rename <- substring(paste(rep(letters[1:10], each = 3), collapse = ""), 
                    seq(1, 30 ,3), seq(3, 30, 3))


system.time(setDT(df)[, Result := Rename[stri_detect_fixed(a, Lookup)], by = a])
# user  system elapsed 
# 0.10    0.05    0.14 

#1


2  

I would suggest creating a simple lookup table and use the excellent stringi::stri_detect_fixed function (gives me ~X100 speedup)

我建议创建一个简单的查找表,并使用优秀的stringi::stri_detect_fixed函数(给我~X100加速)

library(data.table)
library(stringi)

Lookup <- c("ac", "gf", "er", "hy", "nh", "oi", "nu", "vf", "cd", "po")
Rename <- substring(paste(rep(letters[1:10], each = 3), collapse = ""), 
                    seq(1, 30 ,3), seq(3, 30, 3))


system.time(setDT(df)[, Result := Rename[stri_detect_fixed(a, Lookup)], by = a])
# user  system elapsed 
# 0.10    0.05    0.14