I am trying to find a simple way to use something like Perl's hash functions in R (essentially caching), as I intended to do both Perl-style hashing and write my own memoisation of calculations. However, others have beaten me to the punch and have packages for memoisation. The more I dig, the more I find, e.g.memoise
and R.cache
, but differences aren't readily clear. In addition, it's not clear how else one can get Perl-style hashes (or Python-style dictionaries) and write one's own memoization, other than to use the hash
package, which doesn't seem to underpin the two memoization packages.
我正在尝试寻找一种简单的方法来在R中使用Perl的哈希函数(本质上是缓存),因为我打算同时进行Perl风格的哈希,并编写自己的计算记忆。然而,其他人抢先了我一步,给我准备了一些备忘录。我挖得越多,找到的就越多,比如,memand R。缓存,但是差异并不明显。此外,除了使用散列包之外,我们还不清楚如何获得perl风格的散列(或python风格的字典)并编写自己的内存化。
Since I can find no information on CRAN or elsewhere to distinguish between the options, perhaps this should be a community wiki question on SO: What are the options for memoization and caching in R, and what are their differences?
由于我找不到关于CRAN或其他地方的信息来区分选项,也许这应该是一个社区wiki问题:在R中进行内存化和缓存的选项是什么,它们的区别是什么?
As a basis for comparison, here is a list of the options I've found. Also, it seems to me that all depend on hashing, so I'll note the hashing options as well. Key/value storage is somewhat related, but opens a huge can of worms regarding DB systems (e.g. BerkeleyDB, Redis, MemcacheDB and scores of others).
作为比较的基础,下面是我找到的选项列表。而且,在我看来,所有这些都取决于哈希,因此我也将注意哈希选项。键/值存储在某种程度上是相关的,但是对于DB系统(例如BerkeleyDB、Redis、MemcacheDB和其他许多系统)来说,它打开了一个巨大的蠕虫罐头。
It looks like the options are:
看起来选项是:
Hashing
- digest - provides hashing for arbitrary R objects.
- 摘要-为任意R对象提供哈希。
Memoization
- memoise - a very simple tool for memoization of functions.
- memoise -一种非常简单的函数记忆工具。
- R.cache - offers more functionality for memoization, though it seems some of the functions lack examples.
- R。缓存——提供了更多的内存化功能,尽管有些函数似乎缺少示例。
Caching
- hash - Provides caching functionality akin to Perl's hashes and Python dictionaries.
- 散列—提供类似于Perl的散列和Python字典的缓存功能。
Key/value storage
These are basic options for external storage of R objects.
这些是R对象的外部存储的基本选项。
Checkpointing
- cacher - this seems to be more akin to checkpointing.
- cacher -这似乎更类似于检查点。
-
CodeDepends - An OmegaHat project that underpins
cacher
and provides some useful functionality. - CodeDepends - OmegaHat项目,支持cacher和提供一些有用的功能。
- DMTCP (not an R package) - appears to support checkpointing in a bunch of languages, and a developer recently sought assistance testing DMTCP checkpointing in R.
- DMTCP(不是R包)——似乎支持一组语言中的检查点,最近,一位开发人员在R中请求帮助测试DMTCP检查点。
Other
- Base R supports: named vectors and lists, row and column names of data frames, and names of items in environments. It seems to me that using a list is a bit of a kludge. (There's also
pairlist
, but it is deprecated.) - 基本R支持:命名的向量和列表,数据帧的行和列名称,以及环境中的项目名称。在我看来,使用列表有点笨拙。(还有一对列表,但不赞成)
- The data.table package supports rapid lookups of elements in a data table.
- 数据。表包支持快速查找数据表中的元素。
Use case
Although I'm mostly interested in knowing the options, I have two basic use cases that arise:
虽然我最感兴趣的是知道选项,但我有两个基本的用例出现:
- Caching: Simple counting of strings. [Note: This isn't for NLP, but general use, so NLP libraries are overkill; tables are inadequate because I prefer not to wait until the entire set of strings are loaded into memory. Perl-style hashes are at the right level of utility.]
- 缓存:简单的字符串计数。[注意:这不是针对NLP的,而是通用的,所以NLP库是多余的;表是不够的,因为我不喜欢等到所有字符串都加载到内存中。perl风格的散列在实用程序的正确级别上。
- Memoization of monstrous calculations.
- 记忆的巨大的计算。
These really arise because I'm digging in to the profiling of some slooooow code and I'd really like to just count simple strings and see if I can speed up some calculations via memoization. Being able to hash the input values, even if I don't memoize, would let me see if memoization can help.
这些真的出现了,因为我正在深入分析一些slooooow代码,我真的想要数一下简单的字符串,看看我是否可以通过记忆化来加快一些计算。能够哈希输入值,即使我不记忆,也会让我看看记忆化是否有帮助。
Note 1: The CRAN Task View on Reproducible Research lists a couple of the packages (cacher
and R.cache
), but there is no elaboration on usage options.
注1:可重复性研究的CRAN任务视图列出了几个包(cacher和R.cache),但是没有详细说明使用选项。
Note 2: To aid others looking for related code, here a few notes on some of the authors or packages. Some of the authors use SO. :)
注意2:为了帮助其他人查找相关代码,这里有一些作者或包的注释。有些作者这样使用。:)
- Dirk Eddelbuettel:
digest
- a lot of other packages depend on this. - Dirk Eddelbuettel: digest——许多其他包都依赖于此。
- Roger Peng:
cacher
,filehash
,stashR
- these address different problems in different ways; see Roger's site for more packages. - 罗杰·彭:cacher, filehash, stashR——这些以不同的方式解决不同的问题;有关更多的包,请参见Roger的站点。
- Christopher Brown:
hash
- Seems to be a useful package, but the links to ODG are down, unfortunately. - Christopher Brown: hash -似乎是一个有用的包,但不幸的是,ODG的链接已经关闭。
- Henrik Bengtsson:
R.cache
& Hadley Wickham:memoise
-- it's not yet clear when to prefer one package over the other. - Henrik Bengtsson:R。cache & Hadley Wickham: memoise——现在还不清楚哪种包装更受欢迎。
Note 3: Some people use memoise/memoisation others use memoize/memoization. Just a note if you're searching around. Henrik uses "z" and Hadley uses "s".
注3:有些人使用模因化/模因化,有些人使用模因化/模因化。如果你在搜索的话,只需要注意一下。亨里克用“z”,哈德利用“s”。
3 个解决方案
#1
9
For simple counting of strings (and not using table
or similar), a multiset data structure seems like a good fit. The environment
object can be used to emulate this.
对于简单的字符串计数(不使用表或类似的数据),多集数据结构似乎很适合。可以使用environment对象来模拟这一点。
# Define the insert function for a multiset
msetInsert <- function(mset, s) {
if (exists(s, mset, inherits=FALSE)) {
mset[[s]] <- mset[[s]] + 1L
} else {
mset[[s]] <- 1L
}
}
# First we generate a bunch of strings
n <- 1e5L # Total number of strings
nus <- 1e3L # Number of unique strings
ustrs <- paste("Str", seq_len(nus))
set.seed(42)
strs <- sample(ustrs, n, replace=TRUE)
# Now we use an environment as our multiset
mset <- new.env(TRUE, emptyenv()) # Ensure hashing is enabled
# ...and insert the strings one by one...
for (s in strs) {
msetInsert(mset, s)
}
# Now we should have nus unique strings in the multiset
identical(nus, length(mset))
# And the names should be correct
identical(sort(ustrs), sort(names(as.list(mset))))
# ...And an example of getting the count for a specific string
mset[["Str 3"]] # "Str 3" instance count (97)
#2
9
I did not have luck with memoise
because it gave too deep recursive
problem to some function of a packaged I tried with. With R.cache
I had better luck. Following is more annotated code I adapted from R.cache
documentation. The code shows different options to do caching.
我对memoise并不是很满意,因为它给我试用过的软件包的某些功能带来了太多的递归问题。与R。缓存我运气更好。下面是我改编自R的注释代码。缓存文档。代码显示了执行缓存的不同选项。
# Workaround to avoid question when loading R.cache library
dir.create(path="~/.Rcache", showWarnings=F)
library("R.cache")
setCacheRootPath(path="./.Rcache") # Create .Rcache at current working dir
# In case we need the cache path, but not used in this example.
cache.root = getCacheRootPath()
simulate <- function(mean, sd) {
# 1. Try to load cached data, if already generated
key <- list(mean, sd)
data <- loadCache(key)
if (!is.null(data)) {
cat("Loaded cached data\n")
return(data);
}
# 2. If not available, generate it.
cat("Generating data from scratch...")
data <- rnorm(1000, mean=mean, sd=sd)
Sys.sleep(1) # Emulate slow algorithm
cat("ok\n")
saveCache(data, key=key, comment="simulate()")
data;
}
data <- simulate(2.3, 3.0)
data <- simulate(2.3, 3.5)
a = 2.3
b = 3.0
data <- simulate(a, b) # Will load cached data, params are checked by value
# Clean up
file.remove(findCache(key=list(2.3,3.0)))
file.remove(findCache(key=list(2.3,3.5)))
simulate2 <- function(mean, sd) {
data <- rnorm(1000, mean=mean, sd=sd)
Sys.sleep(1) # Emulate slow algorithm
cat("Done generating data from scratch\n")
data;
}
# Easy step to memoize a function
# aslo possible to resassign function name.
This would work with any functions from external packages.
mzs <- addMemoization(simulate2)
data <- mzs(2.3, 3.0)
data <- mzs(2.3, 3.5)
data <- mzs(2.3, 3.0) # Will load cached data
# aslo possible to resassign function name.
# but different memoizations of the same
# function will return the same cache result
# if input params are the same
simulate2 <- addMemoization(simulate2)
data <- simulate2(2.3, 3.0)
# If the expression being evaluated depends on
# "input" objects, then these must be be specified
# explicitly as "key" objects.
for (ii in 1:2) {
for (kk in 1:3) {
cat(sprintf("Iteration #%d:\n", kk))
res <- evalWithMemoization({
cat("Evaluating expression...")
a <- kk
Sys.sleep(1)
cat("done\n")
a
}, key=list(kk=kk))
# expressions inside 'res' are skipped on the repeated run
print(res)
# Sanity checks
stopifnot(a == kk)
# Clean up
rm(a)
} # for (kk ...)
} # for (ii ...)
#3
1
Related to @biocyperman solution. R.cache has a wrapping function for avoiding the loading, saving and evaluation of the cache. See the modified function:
@biocyperman相关解决方案。R。缓存有一个包装函数,可以避免加载、保存和评估缓存。看到修改后的功能:
R.cache provide a wrapper for loading, evaluating, saving. You can simplify your code like that:
R。缓存为加载、计算和保存提供了一个包装器。可以这样简化代码:
simulate <- function(mean, sd) {
key <- list(mean, sd)
data <- evalWithMemoization(key = key, expr = {
cat("Generating data from scratch...")
data <- rnorm(1000, mean=mean, sd=sd)
Sys.sleep(1) # Emulate slow algorithm
cat("ok\n")
data})
}
#1
9
For simple counting of strings (and not using table
or similar), a multiset data structure seems like a good fit. The environment
object can be used to emulate this.
对于简单的字符串计数(不使用表或类似的数据),多集数据结构似乎很适合。可以使用environment对象来模拟这一点。
# Define the insert function for a multiset
msetInsert <- function(mset, s) {
if (exists(s, mset, inherits=FALSE)) {
mset[[s]] <- mset[[s]] + 1L
} else {
mset[[s]] <- 1L
}
}
# First we generate a bunch of strings
n <- 1e5L # Total number of strings
nus <- 1e3L # Number of unique strings
ustrs <- paste("Str", seq_len(nus))
set.seed(42)
strs <- sample(ustrs, n, replace=TRUE)
# Now we use an environment as our multiset
mset <- new.env(TRUE, emptyenv()) # Ensure hashing is enabled
# ...and insert the strings one by one...
for (s in strs) {
msetInsert(mset, s)
}
# Now we should have nus unique strings in the multiset
identical(nus, length(mset))
# And the names should be correct
identical(sort(ustrs), sort(names(as.list(mset))))
# ...And an example of getting the count for a specific string
mset[["Str 3"]] # "Str 3" instance count (97)
#2
9
I did not have luck with memoise
because it gave too deep recursive
problem to some function of a packaged I tried with. With R.cache
I had better luck. Following is more annotated code I adapted from R.cache
documentation. The code shows different options to do caching.
我对memoise并不是很满意,因为它给我试用过的软件包的某些功能带来了太多的递归问题。与R。缓存我运气更好。下面是我改编自R的注释代码。缓存文档。代码显示了执行缓存的不同选项。
# Workaround to avoid question when loading R.cache library
dir.create(path="~/.Rcache", showWarnings=F)
library("R.cache")
setCacheRootPath(path="./.Rcache") # Create .Rcache at current working dir
# In case we need the cache path, but not used in this example.
cache.root = getCacheRootPath()
simulate <- function(mean, sd) {
# 1. Try to load cached data, if already generated
key <- list(mean, sd)
data <- loadCache(key)
if (!is.null(data)) {
cat("Loaded cached data\n")
return(data);
}
# 2. If not available, generate it.
cat("Generating data from scratch...")
data <- rnorm(1000, mean=mean, sd=sd)
Sys.sleep(1) # Emulate slow algorithm
cat("ok\n")
saveCache(data, key=key, comment="simulate()")
data;
}
data <- simulate(2.3, 3.0)
data <- simulate(2.3, 3.5)
a = 2.3
b = 3.0
data <- simulate(a, b) # Will load cached data, params are checked by value
# Clean up
file.remove(findCache(key=list(2.3,3.0)))
file.remove(findCache(key=list(2.3,3.5)))
simulate2 <- function(mean, sd) {
data <- rnorm(1000, mean=mean, sd=sd)
Sys.sleep(1) # Emulate slow algorithm
cat("Done generating data from scratch\n")
data;
}
# Easy step to memoize a function
# aslo possible to resassign function name.
This would work with any functions from external packages.
mzs <- addMemoization(simulate2)
data <- mzs(2.3, 3.0)
data <- mzs(2.3, 3.5)
data <- mzs(2.3, 3.0) # Will load cached data
# aslo possible to resassign function name.
# but different memoizations of the same
# function will return the same cache result
# if input params are the same
simulate2 <- addMemoization(simulate2)
data <- simulate2(2.3, 3.0)
# If the expression being evaluated depends on
# "input" objects, then these must be be specified
# explicitly as "key" objects.
for (ii in 1:2) {
for (kk in 1:3) {
cat(sprintf("Iteration #%d:\n", kk))
res <- evalWithMemoization({
cat("Evaluating expression...")
a <- kk
Sys.sleep(1)
cat("done\n")
a
}, key=list(kk=kk))
# expressions inside 'res' are skipped on the repeated run
print(res)
# Sanity checks
stopifnot(a == kk)
# Clean up
rm(a)
} # for (kk ...)
} # for (ii ...)
#3
1
Related to @biocyperman solution. R.cache has a wrapping function for avoiding the loading, saving and evaluation of the cache. See the modified function:
@biocyperman相关解决方案。R。缓存有一个包装函数,可以避免加载、保存和评估缓存。看到修改后的功能:
R.cache provide a wrapper for loading, evaluating, saving. You can simplify your code like that:
R。缓存为加载、计算和保存提供了一个包装器。可以这样简化代码:
simulate <- function(mean, sd) {
key <- list(mean, sd)
data <- evalWithMemoization(key = key, expr = {
cat("Generating data from scratch...")
data <- rnorm(1000, mean=mean, sd=sd)
Sys.sleep(1) # Emulate slow algorithm
cat("ok\n")
data})
}