The gist of the argument is the following:
论证的要点如下:
A function that I wrote, takes into consideration one argument, an alphanumeric string, and should output a string where the values of each element of this alphanumeric string are switched for some 'mapping'. MRE as follows:
我编写的函数考虑了一个参数,一个字母数字字符串,并且应该输出一个字符串,其中该字母数字字符串的每个元素的值被切换以进行某些“映射”。 MRE如下:
#This is the original and switches value map
map = data.table(mapped = c(0:35), original = c(0:9,LETTERS))
#the function that I'm using:
as_numbers <- function(string) {
#split string unlisted
vector_unlisted <- unlist(strsplit(string,""))
#match the string in vector
for (i in 1:length(vector_unlisted)) {
vector_unlisted[i] <- subset(map, map$original==vector_unlisted[i])[[1]][1]
}
vector_unlisted <- paste0(vector_unlisted, collapse = "")
return(vector_unlisted)
}
I am trying to move away from the for loop
for something that increases performance, as the function works, but it is pretty slow for the amount of elements I have supplied in this form:
我正试图摆脱for循环以获得提高性能的东西,因为该函数可以正常工作,但是对于我在这种形式中提供的元素数量而言,这是非常缓慢的:
unlist(lapply(dat$alphanum, function(x) as_numbers(x)))
An example of the input strings would be:549300JV8KEETQJYUG13
. This should result in a string like 5493001931820141429261934301613
输入字符串的示例是:549300JV8KEETQJYUG13。这应该会产生一个像5493001931820141429261934301613这样的字符串
Supplying just one string in this case:
在这种情况下只提供一个字符串:
> as_numbers("549300JV8KEETQJYUG13")
[1] "5493001931820141429261934301613"
4 个解决方案
#1
6
Using Reduce
and gsub
, you could define the following function
使用Reduce和gsub,您可以定义以下函数
replacer <- function(x) Reduce(function(x,r) gsub(map$original[r],
map$mapped[r], x, fixed=T), seq_len(nrow(map)),x)
# Let's test it
replacer("549300JV8KEETQJYUG13")
#[1] "5493001931820141429261934301613"
#2
18
We can use base conversion:
我们可以使用基本转换:
#input and expected output
x <- "549300JV8KEETQJYUG13"
# "5493001931820141429261934301613"
#output
res <- paste0(strtoi(unlist(strsplit(x, "")), base = 36), collapse = "")
#test output
as_numbers(x) == res
# [1] TRUE
Performance
Since this post is about performance, here is benchmarking* for 3 solutions:
由于这篇文章是关于性能的,因此这里有3个解决方案的基准*:
#input set up
map = data.table(mapped = c(0:35), original = c(0:9,LETTERS))
x <- rep(c("549300JV8KEETQJYUG13", "5493V8KE300J"), 1000)
#define functions
base_f <- function(string) {
sapply(string, function(x) {
paste0(strtoi(unlist(strsplit(x, "")), base = 36), collapse = "")
})
}
match_f <- function(string) {
mapped <- map$mapped
original <- map$original
sapply(strsplit(string, ""), function(y) {
paste0(mapped[match(y, original)], collapse= "")})
}
reduce_f <- function(string) {
Reduce(function(string,r)
gsub(map$original[r],
map$mapped[r], string, fixed = TRUE),
seq_len(nrow(map)), string)
}
#test if all return same output
all(base_f(x) == match_f(x))
# [1] TRUE
all(base_f(x) == reduce_f(x))
# [1] TRUE
library(rbenchmark)
benchmark(replications = 1000,
base_f(x),
match_f(x),
reduce_f(x))
# test replications elapsed relative user.self sys.self user.child sys.child
# 1 base_f(x) 1000 22.15 4.683 22.12 0 NA NA
# 2 match_f(x) 1000 19.18 4.055 19.11 0 NA NA
# 3 reduce_f(x) 1000 4.73 1.000 4.72 0 NA NA
*Note: microbenchmark() keeps throwing warnings, hence used rbenchmark() instead. Feel free to test with other libraries and update this post.
*注意:microbenchmark()不断抛出警告,因此使用rbenchmark()代替。随意测试其他库并更新这篇文章。
#3
4
Seems like a merge:
似乎是合并:
map[as.data.table(unlist(strsplit(string, ""))),
.(mapped), on = c(original = "V1")][ , paste0(mapped, collapse = "")]
Note that both "D1" and "1V" will be mapped to "131"...
请注意,“D1”和“1V”都将映射到“131”......
On your example output is: "5493001931820141429261934301613"
在您的示例输出是:“5493001931820141429261934301613”
You can use sep = "."
if you actually want this to be a reversible mapping...
你可以使用sep =“。”如果你真的希望这是一个可逆的映射......
#4
4
I would use match
:
我会用匹配:
as_numbers <- function(string) {
lapply(strsplit(string, ""), function(y) {
paste0(map$mapped[match(y, map$original)], collapse= "")})
}
as_numbers(c("549300JV8KEETQJYUG13", "5493V8KE300J"))
#[[1]]
#[1] "5493001931820141429261934301613"
#
#[[2]]
#[1] "5493318201430019"
Added an lapply
call to handle length > 1 input correctly.
添加了一个lapply调用来正确处理长度> 1的输入。
If you need further speed up, you can store map$mapped
and map$original
in separate vectors and use them in the match
call instead of map$...
so you don't need to subset the data.frame/data.table so many times (which is quite costly).
如果你需要进一步加速,你可以存储map $ mapped并将$ original映射到单独的向量中并在匹配调用中使用它们而不是map $ ...所以你不需要对data.frame / data.table进行子集化这么多次(这是非常昂贵的)。
Since the Q was about performance, here's a benchmark of two of the solutions:
由于Q是关于性能的,因此这里有两个解决方案的基准:
map = data.table(mapped = c(0:35), original = c(0:9,LETTERS))
x <- rep(c("549300JV8KEETQJYUG13", "5493V8KE300J"), 1000)
ascii_func <- function(string) {
lapply(string, function(x) {
x_ascii <- strtoi(charToRaw(x), 16)
paste(ifelse(x_ascii >= 65 & x_ascii <= 90,
x_ascii - 55, x_ascii - 48),
collapse = "")
})
}
match_func <- function(string) {
mapped <- map$mapped
original <- map$original
lapply(strsplit(string, ""), function(y) {
paste0(mapped[match(y, original)], collapse= "")})
}
library(microbenchmark)
microbenchmark(ascii_func(x), match_func(x), times = 25L)
#Unit: milliseconds
# expr min lq mean median uq max neval
# ascii_func(x) 83.47 92.55 96.91 96.82 103.06 112.07 25
# match_func(x) 24.30 24.74 26.86 26.11 28.67 31.55 25
identical(ascii_func(x), match_func(x))
#[1] TRUE
#1
6
Using Reduce
and gsub
, you could define the following function
使用Reduce和gsub,您可以定义以下函数
replacer <- function(x) Reduce(function(x,r) gsub(map$original[r],
map$mapped[r], x, fixed=T), seq_len(nrow(map)),x)
# Let's test it
replacer("549300JV8KEETQJYUG13")
#[1] "5493001931820141429261934301613"
#2
18
We can use base conversion:
我们可以使用基本转换:
#input and expected output
x <- "549300JV8KEETQJYUG13"
# "5493001931820141429261934301613"
#output
res <- paste0(strtoi(unlist(strsplit(x, "")), base = 36), collapse = "")
#test output
as_numbers(x) == res
# [1] TRUE
Performance
Since this post is about performance, here is benchmarking* for 3 solutions:
由于这篇文章是关于性能的,因此这里有3个解决方案的基准*:
#input set up
map = data.table(mapped = c(0:35), original = c(0:9,LETTERS))
x <- rep(c("549300JV8KEETQJYUG13", "5493V8KE300J"), 1000)
#define functions
base_f <- function(string) {
sapply(string, function(x) {
paste0(strtoi(unlist(strsplit(x, "")), base = 36), collapse = "")
})
}
match_f <- function(string) {
mapped <- map$mapped
original <- map$original
sapply(strsplit(string, ""), function(y) {
paste0(mapped[match(y, original)], collapse= "")})
}
reduce_f <- function(string) {
Reduce(function(string,r)
gsub(map$original[r],
map$mapped[r], string, fixed = TRUE),
seq_len(nrow(map)), string)
}
#test if all return same output
all(base_f(x) == match_f(x))
# [1] TRUE
all(base_f(x) == reduce_f(x))
# [1] TRUE
library(rbenchmark)
benchmark(replications = 1000,
base_f(x),
match_f(x),
reduce_f(x))
# test replications elapsed relative user.self sys.self user.child sys.child
# 1 base_f(x) 1000 22.15 4.683 22.12 0 NA NA
# 2 match_f(x) 1000 19.18 4.055 19.11 0 NA NA
# 3 reduce_f(x) 1000 4.73 1.000 4.72 0 NA NA
*Note: microbenchmark() keeps throwing warnings, hence used rbenchmark() instead. Feel free to test with other libraries and update this post.
*注意:microbenchmark()不断抛出警告,因此使用rbenchmark()代替。随意测试其他库并更新这篇文章。
#3
4
Seems like a merge:
似乎是合并:
map[as.data.table(unlist(strsplit(string, ""))),
.(mapped), on = c(original = "V1")][ , paste0(mapped, collapse = "")]
Note that both "D1" and "1V" will be mapped to "131"...
请注意,“D1”和“1V”都将映射到“131”......
On your example output is: "5493001931820141429261934301613"
在您的示例输出是:“5493001931820141429261934301613”
You can use sep = "."
if you actually want this to be a reversible mapping...
你可以使用sep =“。”如果你真的希望这是一个可逆的映射......
#4
4
I would use match
:
我会用匹配:
as_numbers <- function(string) {
lapply(strsplit(string, ""), function(y) {
paste0(map$mapped[match(y, map$original)], collapse= "")})
}
as_numbers(c("549300JV8KEETQJYUG13", "5493V8KE300J"))
#[[1]]
#[1] "5493001931820141429261934301613"
#
#[[2]]
#[1] "5493318201430019"
Added an lapply
call to handle length > 1 input correctly.
添加了一个lapply调用来正确处理长度> 1的输入。
If you need further speed up, you can store map$mapped
and map$original
in separate vectors and use them in the match
call instead of map$...
so you don't need to subset the data.frame/data.table so many times (which is quite costly).
如果你需要进一步加速,你可以存储map $ mapped并将$ original映射到单独的向量中并在匹配调用中使用它们而不是map $ ...所以你不需要对data.frame / data.table进行子集化这么多次(这是非常昂贵的)。
Since the Q was about performance, here's a benchmark of two of the solutions:
由于Q是关于性能的,因此这里有两个解决方案的基准:
map = data.table(mapped = c(0:35), original = c(0:9,LETTERS))
x <- rep(c("549300JV8KEETQJYUG13", "5493V8KE300J"), 1000)
ascii_func <- function(string) {
lapply(string, function(x) {
x_ascii <- strtoi(charToRaw(x), 16)
paste(ifelse(x_ascii >= 65 & x_ascii <= 90,
x_ascii - 55, x_ascii - 48),
collapse = "")
})
}
match_func <- function(string) {
mapped <- map$mapped
original <- map$original
lapply(strsplit(string, ""), function(y) {
paste0(mapped[match(y, original)], collapse= "")})
}
library(microbenchmark)
microbenchmark(ascii_func(x), match_func(x), times = 25L)
#Unit: milliseconds
# expr min lq mean median uq max neval
# ascii_func(x) 83.47 92.55 96.91 96.82 103.06 112.07 25
# match_func(x) 24.30 24.74 26.86 26.11 28.67 31.55 25
identical(ascii_func(x), match_func(x))
#[1] TRUE