确保在R环境中重现性

时间:2021-04-03 23:35:34

I work in a computational biology lab, where we have several folks working on multiple projects, mostly in R (which is what I care about for this post). In the past, people would simply develop their code for each project, which may or may not involve boilerplate code copied over from previous projects. One thing that I've pushed over the years was to bring some centralized structure to this mess and have people identify common patterns such that we can turn these repeated/common blocks of code into packages for all of the many reasons one might think that is a good thing to do. So now our folks are using a mix of centralized packages/routines within their project specific scripts.

我在一个计算生物学实验室工作,我们有几个人在多个项目上工作,主要是在R(这就是我所关心的这个岗位)。在过去,人们只需要为每个项目开发他们的代码,这可能涉及也可能不涉及从以前的项目复制的样板代码。多年来一件事,我将带一些集中式结构这个烂摊子,让人们识别常见的模式,这样我们可以把这些重复/共同的代码块打包所有的许多原因之一可能会认为是一件好事。因此,现在我们的员工在他们的项目特定脚本中使用集中的包/例程。

There's one gotcha here. We have a mandate from the powers that be that every script for every project need to be 100% reproducible over time to the best of our ability (and this includes 100% of all code we have direct access to, including our packages). That is, if I call function foo in package bar with parameter A to get result X today, 4 years from now I should get the exact same result. (erroneous output due to bugs is excepted here)

这里有一个问题。我们得到了权力的授权,即每个项目的每个脚本都需要在一段时间内100%的可重现,以发挥我们的最大能力(这包括100%的我们可以直接访问的所有代码,包括我们的包)。也就是说,如果我调用函数foo在package bar中带有参数A来得到结果X,从现在算起,4年后我应该会得到完全相同的结果。(错误导致的错误输出在这里除外)

The topic of reproducibility has come up now and then in R within various circles, but typically it seems to be discussed in terms of reproducibility of process (e.g. vignettes). This is not the same thing - I can run a vignette today and then run the same code 6 months from now using updated packages and receive wildly different results.

在不同的圈子里,关于可再现性的话题不时出现在R中,但通常它似乎是在过程的可再现性方面被讨论的(例如小插图)。这不是同一件事——我今天可以运行一个小插图,然后使用更新的包运行6个月后的相同代码,得到的结果大不相同。

The solution that's been agreed upon (which I'm not a fan of) is that if a function or package needs to be changed in a non-backwards compatible change that it simply gets a new name. Thus, if we needed to radically change function foo(), it'd be called foo2(), and if that needs a radical change it gets called foo3(). This ensures that any script that called foo() will always get the original result, while allowing things to march forward within the package repository. It works, but I really dislike this - it seems aesthetically extremely cluttered, and I worry that it will lead to mass confusion over time having packages bar, bar2, bar3, bar4 ... functions foo1, foo2, foo3, etc.

已经达成一致的解决方案(我不喜欢)是,如果一个函数或包需要在不向后兼容的更改中进行更改,那么它只需要获得一个新名称。因此,如果我们需要从根本上更改函数foo(),它将被称为foo2(),如果需要彻底更改,它将被称为foo3()。这将确保任何调用foo()的脚本都将始终获得原始结果,同时允许在包存储库中进行操作。它起作用,但我真的不喜欢它——它看起来极其杂乱,而且我担心它会随着时间的推移导致大量的混乱,包bar, bar2, bar3, bar4……函数foo1、foo2、foo3等。

The problem is that I haven't come up with an alternate solution that's really better. One possibility would be to note version numbers of packages, R, etc and make sure those are loaded, but that has multiple problems - not the least of which is that it relies on proper package versioning discipline and that's prone to error. Also, this alternative was already rejected ;) Ideally what we'd have is some sort of notion of devel & release as most of these changes tend to happen earlier on and then level off with changes happening much less frequently. OTOH what devel really means here is "not actually in a package yet" (which we do), but it can be hard to determine exactly at what point is the right one to transport stuff over. Invariably the moment you think you're safe, that's when you realize you're not.

问题是我没有找到更好的替代方案。一种可能是注意包的版本号、R等,并确保它们被加载,但这有多个问题——其中最重要的一点是,它依赖于适当的包版本控制规则,这很容易出错。而且,这个替代方案已经被否决了;理想情况下,我们所拥有的是一种关于devel & release的概念,因为大部分的变化往往发生在更早的时候,然后随着变化的发生而降低。OTOH, devel真正的意思是“实际上还没有在一个包中”(我们确实这么做),但是很难确定在哪个点运输东西是正确的。总是在你认为自己安全的那一刻,也就是你意识到自己不安全的那一刻。

So with all this in mind, I'm curious if anyone else out there has dealt with similar situations, and how they might have resolved things.

因此,考虑到这些,我很好奇,是否有其他人处理过类似的情况,以及他们如何解决问题。

edit: just to be clear, by non-backwards compatible, I'm not just talking about APIs and such, but also outputs for a given set of inputs.

编辑:明确地说,通过非向后兼容,我不仅仅是在讨论api之类的东西,而且还讨论给定一组输入的输出。

9 个解决方案

#1


20  

This is indeed an important thing to think about and I think ultimately requires the institutionalization of a couple of different processes.

这确实是一个需要考虑的重要问题,我认为最终需要一些不同过程的制度化。

  1. Version Control (svn, git, bzr, cvs, etc)
  2. 版本控制(svn、git、bzr、cvs等)
  3. Unit Tests
  4. 单元测试

My first reaction is that you need to institutionalize some sort of code management system. This will make it easier, because the old version of foo() is still available, if you really want it. From what you have said, it sounds like you need to package up your common functions and institute some sort of a release schedule. Scripts which require backward compatibility must include the package name and release information. This way it is possible to ALWAYS obtain foo() exactly as it was when the script was written. You should also make sure people only use official release versions in their work, because otherwise this could become quite a pain.

我的第一反应是您需要将某种代码管理系统制度化。这将使它更容易,因为如果您确实需要,旧版本的foo()仍然可用。根据您所说的,您似乎需要打包常见的功能并制定某种发布计划。需要向后兼容的脚本必须包含包名和发布信息。通过这种方式,可以始终获得与编写脚本时完全相同的foo()。您还应该确保人们在他们的工作中只使用官方版本,否则这将是非常痛苦的。

I agree, having a collection of foo:foo99 is doomed to failure. But at least it will be a gloriously confusing failure. Aesthetics aside, it will drive you all bonkers. If foo2() is an improvement (more accurate, faster, etc) of foo(), then it should be called foo() and released for use according to your company-wide release schedule. If it does something different, it is no longer foo(). It might be fooo() or superFoo() or fooMe(), but it ain't foo().

我同意,收藏foo:foo99是注定要失败的。但至少这将是一个令人眼花缭乱的失败。撇开美学不谈,它会让你发疯的。如果foo2()是foo()的改进(更准确、更快等),那么它应该被命名为foo()并根据公司范围内的发布计划发布。如果它做了一些不同的事情,它就不再是foo()了。可能是fooo()或superFoo()或fooMe(),但不是foo()。

Finally, you need to start testing your functions. (Unit Tests) For each function that is published and made available for others, you should have a clearly defined test suite. Unless someone fixes a bug in foo(), the results should stay the same. If someone fixes a bug, then the results should be more accurate and will probably more desirable in most cases. If you do need to reproduce the old, incorrect, results, you can dig out an old version of foo() from your version control system. By instituting rigorous unit tests, you will know if/when the results of foo have changed. This knowledge should help minimize the number of foo() functions you need. Rather than create a version every time someone tweaks something, you can test the new version to see whether or not the results conform to expectations. But, this is tricky, because you have to make sure that your tests cover anything the function is ever likely to see, including bizarre edge cases. In a research setting, I would imagine that could become a challenge.

最后,需要开始测试函数。(单元测试)对于每个已经发布并提供给其他人的函数,您应该有一个明确定义的测试套件。除非有人修复了foo()中的错误,否则结果应该保持不变。如果有人修复了错误,那么结果应该更准确,在大多数情况下可能更可取。如果您确实需要复制旧的、不正确的结果,您可以从您的版本控制系统中挖掘出一个旧版本的foo()。通过建立严格的单元测试,您将知道foo的结果是否/何时发生了变化。这些知识应该有助于减少foo()函数的数量。你可以测试新版本,看看结果是否符合预期,而不是每次都创建一个版本。但是,这很棘手,因为您必须确保您的测试覆盖了函数可能看到的任何内容,包括奇怪的边缘情况。在研究环境中,我认为这可能会成为一个挑战。

#2


8  

I'm not sure about integrating it with R, but Sumatra might be worth looking into. It appears to allow you to keep track of code and results. So if you need to go back an re-run that simulation from 4 years ago, the code should be there.

我不确定是否要把它和R结合起来,但是苏门答腊岛也许值得一看。它似乎允许您跟踪代码和结果。所以如果你需要重新运行4年前的模拟,代码应该在那里。

#3


5  

Well, ask yourself how you would do that in any other language. There's really nothing more to it than good bookkeeping I'm afraid:

好吧,问问你自己,用其他语言怎么做。恐怕没有什么比良好的簿记更重要的了:

  • record version numbers of all software involved
  • 所有相关软件的记录版本号。
  • put the code in manageable chunks, say in packages.
  • 将代码放在可管理的块中,比如包中。
  • make sure you have all software/packages involved still available in 5 years.
  • 确保所有的软件/软件包在5年内仍然可用。

R can easily be made portable, including all installed packages. Keep a portable version of R together with the used packages, the code and the data on a CD-ROM for each analysis, and you're sure you can reproduce whenever you want. OK, you miss the OS, but can't have them all. In any case, if the OS makes a difference important enough to call the analysis not reproducible, the problem is very likely your analysis. You don't want to tell anybody your result is dependent on the version of Windows you use, do you?

可以很容易地使R具有可移植性,包括所有已安装的包。将可移植的R版本与使用过的包、代码和数据保存在一个CD-ROM中,以便进行每次分析,您可以随时复制。好吧,你错过了操作系统,但不可能全部都拥有。在任何情况下,如果操作系统产生的差异非常重要,以至于可以将分析称为不可复制的,那么问题很可能是您的分析。你不想告诉任何人你的结果取决于你使用的Windows版本,是吗?

PS : please get into peoples head that they should never ever in their life copy-paste code. They should wrap it in functions and use those. A whole lot easier and far less error-prone. I mean, what's the difference between copying

PS:请让人们记住,他们不应该在一生中复制粘贴代码。他们应该用函数把它包装起来。更容易,更不容易出错。我的意思是,复制的区别是什么

x <- read.table("sometable")
y <- ColSums(x)/4.3

and adjusting the values, or typing

调整值,或者输入

myfun <- function(i,j){
  x <- read.table(i)
  y <- ColSums(x)/j
}

Saves you and a lot of other people a whole lot of copy-paste trouble. (How so, object not found? What object?)

为你和其他人省去了很多复制粘贴的麻烦。(怎么说,没有找到对象?什么对象?)

#4


5  

Whenever you want to freeze your code in a way that needs to be reproducible "forever", e.g., when your paper has been published, the safest way to do this is to create a virtual machine containing all your code and data and the software needed to run it (including the operating system). There's an example here on the University of Washington site.

每当你想冻结您的代码,需要可再生的“永远”,例如,当你的文章被发表,最安全的方法就是创建一个包含所有代码和数据和虚拟机运行它所需的软件(包括操作系统)。华盛顿大学网站上有个例子。

#5


3  

This is exactly the kind of thinking that causes Microsoft to maintain bug compatibility in Excel. Rather than attempting to conform to such a request you should be doing your best to show that it's not a good idea.

这正是导致微软在Excel中保持bug兼容性的原因。与其试图遵从这样的要求,你应该尽力表现出这不是一个好主意。

This thinking means that all errors remain errors in order to maintain consistency. It's thinking transferred from corporate bureaucracy and has no business in a science lab.

这种想法意味着所有的错误都是错误的,为了保持一致性。这是一种从公司官僚机构转移过来的思维方式,在科学实验室里没有任何业务。

The only way to do this is to save the copy of all your packages and version of R with your code. There's no central corporation beholden to bug compatibility that's going to take care of that for you.

这样做的唯一方法是将所有包和R版本的副本与代码一起保存。没有任何*公司对bug兼容性心存感激,它们会帮你解决这个问题。

#6


3  

What if a change in result is due to a change in your operating system? Perhaps Microsoft fix a bug in Windows XP for Windows 7 and then when you upgrade - all your outputs are different.

如果结果的更改是由于操作系统的更改,该怎么办?也许微软在Windows XP系统中修复了一个漏洞,当你升级时,你的输出都是不同的。

If you want to handle this then I think the best way of working is to keep snapshots of virtual machines when you close out an analysis, and store the VM images for later use. Of course in five years time you won't have a license to run Windows XP so that's another problem - one solved by using an open-source operating system, such as Linux.

如果您想处理这个问题,那么我认为最好的工作方式是在结束分析时保存虚拟机的快照,并存储VM映像供以后使用。当然,五年后你将没有运行Windows XP的许可证,这是另一个问题——一个通过使用开源操作系统解决的问题,比如Linux。

#7


2  

I would go with docker images.
This is pretty convenient way to reproduce OS and all dependencies.
You build an image and later can deploy it any time to docker, it will be fully configured.
You can find multiple R docker images available, so you can easily build your image upon them.
Having already built image you can use it to deploy to Test environment and later to Production.

我会选择docker图像。这是复制操作系统和所有依赖项的非常方便的方法。您构建一个映像,稍后可以将它部署到docker上,它将被完全配置。您可以找到多个R docker映像,因此您可以轻松地在它们之上构建您的映像。已经构建好映像后,您可以使用它来部署到测试环境,然后再部署到生产环境。

#8


1  

This may be a late answer, but I have found it useful to create a generic wrapper like the following, especially when iterating quickly in my development of a new function:

这可能是一个迟来的答案,但我发现创建如下这样的通用包装器是有用的,尤其是在我开发新函数时快速迭代时:

myFunction <- function(..., version = "latest"){
  if((version == "latest") || (version == 6)){
    return(myFunction06(...))
  } ...
  if((version == 1)){
    return(myFunction01(...))
  }
 }

Then, code should simply state which version it wants. Once the actual function stabilizes, I remove support for the older versions of the function, and a quick search through my code lets me find any offending calls. Use of "latest" means I can assure that the caller and the function match some fairly fixed definitions.

然后,代码应该简单地声明它想要的版本。一旦实际的函数稳定下来,我就删除了对旧版本函数的支持,通过代码的快速搜索,我可以找到任何违规的调用。使用“最新”意味着我可以保证调用者和函数匹配一些相当固定的定义。

Naturally, all code is maintained in a version control system, so even when I remove the earlier code, it is only from the currently available source. I can reproduce any behavior from any point in time, including errors, as long as the data from that point in time is obtainable.

当然,所有代码都是在版本控制系统中维护的,所以即使我删除了前面的代码,它也只是从当前可用的源代码中维护的。我可以从任何时间点复制任何行为,包括错误,只要该时间点的数据是可获得的。

#9


1  

A solution might be to use S4 methods and letting R's internal dispatcher do the work for you (see example below). That way, you're somewhat "bulletproof" with respect to being able to systematically update your code without running the risk of breaking something.

解决方案可能是使用S4方法,并让R的内部分派器为您完成工作(参见下面的示例)。这样,您就有点“防弹”了,因为您能够系统地更新您的代码,而不必冒破坏某些东西的风险。

Key benefits

The key thing here is that S4 methods support multiple dispatch.

这里的关键是S4方法支持多分派。

That way your function will always be foo (as opposed to having to keep track of foo1, foo2 etc.) while new functionality can be easily implemented (by adding respective methods) without touching "old" methods (that other people/packages might rely on).

这样,您的函数将始终是foo(而不是必须跟踪foo1、foo2等),而新功能可以轻松实现(通过添加各自的方法),而无需接触“旧”方法(其他人员/包可能依赖的方法)。

Key functions you'll need:

你需要的关键功能:

  • setGeneric
  • setGeneric
  • setMethod
  • setMethod
  • setRefClass (S4 Reference Classes; personal recommendation) or setClass (S4 Class; I wouldn't use them for the reason described in the "Additional remarks" at the very end)
  • setRefClass(S4引用类;或setClass (S4类);我不会使用它们的原因在“附加评论”在最后)

The "downsides"

  • You need to switch from a S3 to a S4 logic

    您需要从S3切换到S4逻辑

  • This implies that you need to write a bit more code than what you might be used to (generic method definitions, method definitions and possibly own class defitions (see example below). But this "buys" yourself and your code much more structure and makes it more robust.

    这意味着您需要编写比您可能使用的代码多一点的代码(通用的方法定义、方法定义,可能还有自己的类替换(参见下面的示例)。但这“购买”了你自己和你的代码更多的结构,使它更健壮。

  • It might also imply that you'll eventually dig deeper and deeper into the world of Object-Oriented Programming or Object-Oriented Design. While I personally consider this to be a good thing (my personal rule of thumb: the more complex/distributed your application, the better you're off using OOP), some would consider these approaches to be R-untypic (I strongly disagree as R does have superb OO-features that are maintained by the Core Team) or "unsuited" for R (this might be true depending on how much you rely on "non-OOP" packages/code). If you're willing to go that way, you might want to familiarize yourself with the SOLID principles of Object-Oriented Design. You also might want to check out the following books: Clean Coder and The Pragmatic Programmer.

    它还可能意味着您最终将深入到面向对象编程或面向对象设计的世界中。而我个人认为这是一件好事(我个人的经验法则:/分布式应用程序越复杂,越好你使用OOP),有些人会认为这些方法是R-untypic(我强烈反对R有高超的oo功能维护的核心团队)或“不合适的”R(这可能是真正的多少取决于你依靠“non-OOP”包/代码)。如果您愿意这样做,您可能希望熟悉面向对象设计的坚实原则。您可能还想看看以下书籍:Clean Coder和Pragmatic Programmer。

  • If computational efficiency (e.g. when estimating statistical models) is really critical, using S4 methods and S4 Reference Classes might slow you down a bit. After all, there's more code involved compared to S3. But I'd recommend testing the impact of this from case to case via system.time() and/or microbenchmark::microbenchmark() instead of picking "ideological" sides (S3 vs. S4).

    如果计算效率(例如在估计统计模型时)非常重要,使用S4方法和S4引用类可能会使您的计算速度慢一些。毕竟,与S3相比,涉及的代码更多。但是我建议通过system.time()和/或microbenchmark:::microbenchmark()来测试不同情况下的影响,而不是选择“意识形态”方面(S3和S4)。


Example

Initial function

Let's suppose you're in department A and someone in your team started out with creating a function called foo()

假设你在部门A团队中有人开始创建一个名为foo()的函数

foo <- function(x, y) {
    x + y
}
foo(x=10, y=20)

First change request

You would like to be able to extend it without breaking "old" code that relies on foo().

您希望能够在不破坏依赖foo()的“旧”代码的情况下扩展它。

Now, I think we all agree that this can be quite hard to do.

现在,我想我们都同意这是很难做到的。

You either need to explicitly modify the source code of foo() (each time running the risk that you break something that already used to work; this violates the "O" in SOLID: Open Closed-Principle) or you need to come with alternative names such as foo1, foo2 etc (really hard to keep track of which function is doing what).

您要么需要显式地修改foo()的源代码(每次运行破坏已经使用过的东西的风险;这违反了SOLID中的“O”:Open closed principle)或您需要提供其他名称,如foo1、foo2等(很难跟踪哪个函数在做什么)。

foo <- function(x, y, type=c("old", "new")) {
    type <- match.arg(type, choices=c("old", "new")) 
    if (type == "old") {
        x + y
    } else if (type == "new") {
        x * y    
    }
}
foo(x=10, y=20)
[1] 30
foo(x=10, y=20, type="new")
[1] 200

foo1 <- function(x, y) {
    x * y
}
foo1(x=10, y=20)
[1] 200

Let's see how S4 methods and multiple dispatch can really help us out here.

让我们看看S4方法和多重分派是如何在这里帮助我们的。

Generic method

You need to start out by turning foo() into a generic method.

您需要从将foo()转换为泛型方法开始。

setGeneric(
    name="foo",
    signature=c("x", "y", ".ctx", ".ns"),
    def=function(x, y, ..., .ctx, .ns) {
        standardGeneric("foo")
    }
)

In simplified words: a generic method itself doesn't do anything yet. It's simply a precondition in order to be able to specifiy "actual" methods for its signature arguments that do something useful.

简而言之:泛型方法本身还没有做任何事情。它只是一个先决条件,以便能够为其签名参数指定“实际”方法,以完成一些有用的工作。

Signature arguments

签名的参数

The degree of flexiblity with respect to the original problem is directly linked to the number of signature arguments that you declare (signature=c("x", "y", ".ctx", ".ns")): the more signature arguments, the more flexiblity you have but the more complex your code might get as well (with respect to how much code you have to write).

对于原始问题的灵活性程度直接与您声明的签名参数的数量相关(签名=c(“x”、“y”、“”)。ctx“,”.ns”):签名参数越多,灵活性就越强,但是代码可能也会越复杂(关于需要编写多少代码)。

Again, in simplified words: signature arguments (and it's classes) are used by the method dispatcher to retrieve the correct method that's doing the actual work.

同样,用简单的词:方法分派器使用签名参数(以及它的类)来检索正在执行实际工作的正确方法。

Think of the method dispatcher being like the clerk in a ski rental business: you present him an arbitrary large set of signature information (i.e. information that "clearly distinguish you from others": your age, height, shoe size and skill level) and he uses that information to provide you with the right equipment to hit the slopes. Think of R's method dispatcher as beeing the clerk that has access to the storage room of the ski rental. But instead of ski equipment it will return methods.

认为该方法的调度程序像滑雪租赁业务的职员:你给他任意大的签名信息(即“明显区分你从别人”的信息:你的年龄,身高,鞋码和技能水平),他利用这些信息为您提供合适的设备山坡上。把R的方法调度程序想象成能够进入滑雪场储藏室的职员。但它将返回方法,而不是滑雪设备。

Notice that we said that our "old" arguments x and y are from now on supposed to be signature arguments while there are also two new arguments: .ctx and .ns. I'll get to these in a minute. It's those arguments that will provide us with the flexibility that we're after.

注意,我们说过,我们的“旧”参数x和y从现在开始应该是签名参数,同时还有两个新参数:.ctx和.ns。我马上就会讲到这些。正是这些论点为我们提供了我们所追求的灵活性。

Initial method definition

We now define a "variant" (a method) of the generic method for the following "signature scenario":

我们现在为以下“签名场景”定义通用方法的“变体”(方法):

  1. x is numeric
  2. x是数字
  3. y is numeric
  4. y是数字
  5. .ctx will just not be provided when calling the method and is thus missing
  6. .ctx在调用方法时将不会被提供,因此会丢失
  7. .ns will just not be provided when calling the method and is thus missing
  8. .ns在调用方法时将不会被提供,因此会丢失

Think of it as registering your signature information with explicit equipment of the ski rental. Once you did that and ask for your equipment, the only thing the clerk has to do is to go to the storage room and look up which equipment is linked to your personal information.

可以把它看作是用滑雪租赁的显式设备注册您的签名信息。一旦你这么做了并且要求你的设备,职员所要做的唯一的事情就是去储藏室查找与你的个人信息相关联的设备。

setMethod(
    f="foo", 
    signature=signature(x="numeric", y="numeric", .ctx="missing", .ns="missing"), 
    definition=function(x, y, ..., .ctx, .ns) {
        x + y
    }
)

When we call foo with this "signature scenario" (asking for the method that we registered for this scenario), the method dispatcher knows exactly which actual method it needs to get out of the storage room:

当我们用这个“签名场景”调用foo时(请求我们为这个场景注册的方法),方法分派器确切地知道它需要从存储空间中取出哪个实际的方法:

foo(x=10, y=20)
[1] 30

First update

Now someone from department B comes along, looks at foo(), likes it but decides that foo() needs to be updated (x * y instead of x + y) if it is to be used in his department.

现在B部门的人来了,看着foo(),喜欢它,但是决定如果要在他的部门中使用foo(),那么需要更新(x * y而不是x + y)。

That's when .ctx (short for context) comes into play: it's an argument by which we are able to distinguish application contexts.

这时.ctx (context的缩写)就发挥作用了:它是我们能够区分应用程序上下文的参数。

Definining a class that represents the new application context

定义一个表示新应用程序上下文的类。

setRefClass("ApplicationContextDepartmentB")

When calling foo(), we'll provide it with an instance of this class (.ctx=new("ApplicationContextDepartmentB"))

在调用foo()时,我们将提供该类的实例(.ctx=new(“ApplicationContextDepartmentB”))

Definining a new method for the new application context

定义新应用程序上下文的新方法

Notice how we register signature argument .ctx to our new class ApplicationContextDepartmentB:

注意我们如何将签名参数.ctx注册到新的类ApplicationContextDepartmentB:

setMethod(
    f="foo", 
    signature=signature(x="numeric", y="numeric", 
        .ctx="ApplicationContextDepartmentB", .ns="missing"), 
    definition=function(x, y, ..., .ctx, .ns) {
        out <- x * y
        attributes(out)$description <- "I'm different from the original foo()"
        return(out)
    }
)

That way, the method dispatcher knows exactly that it should return the "new" method instead of the "old" method when we call foo() like this:

这样,当我们调用foo()时,方法分派器确切地知道它应该返回“new”方法而不是“old”方法:

foo(x=1, y=10, .ctx=new("ApplicationContextDepartmentB"))
[1] 10
attr(,"description")
[1] "I'm different from the original foo()"

The "old" method is not affected at all:

“旧”方法完全不受影响:

foo(x=1, y=10)
[1] 30

Second update

Suppose that someone from department C comes along and suggests yet another "configuration" or version for foo(). You can easily provide that withouth breaking anything that you've realized for departments A and B so far by following the same routine as for department B.

假设来自C部门的某人过来建议另一个foo()的“配置”或版本。你可以很容易地为A和B部门提供这种服务,而不会破坏到目前为止你为A和B部门所认识到的任何东西。

But we'll even take it one step further here: we'll define two additional classes that let us distinguish different "namespaces" (that's where .ns comes into play).

但我们甚至还会更进一步:我们将定义两个额外的类,让我们区分不同的“名称空间”(即,.ns在其中发挥作用)。

Think of namespaces as a way of distinguishing different runtime scenarios for a specific method for a specific application context (i.e. "testing" and "productive mode").

将名称空间看作是区分特定应用程序上下文的特定方法的不同运行时场景的一种方法(例如:“测试”和“生产方式”)。

Definining the classes

Definining类

setRefClass("ApplicationContextDepartmentC")
setRefClass("TestNamespace")
setRefClass("ProductionNamespace")

Definining a new method for the new application context and a "test" scenario

定义新的应用程序上下文和“测试”场景的新方法。

Notice how we register signature arguments .ctx to our new class ApplicationContextDepartmentC and .ns to our new class TestNamespace:

注意我们如何将签名参数.ctx注册到新的类ApplicationContextDepartmentC和.ns注册到新的类TestNamespace:

setMethod(
    f="foo", 
    signature=signature(x="character", y="numeric", 
        .ctx="ApplicationContextDepartmentC", .ns="TestNamespace"), 
    definition=function(x, y, ..., .ctx, .ns) {
        data.frame(x, y, test.ok=rep(TRUE, length(x)))
    }
)

Again, the method dispatcher will look up the correct method when calling foo() like this:

同样,当调用foo()时,方法调度程序将查找正确的方法:

foo(x=letters[1:5], y=11:15, .ctx=new("ApplicationContextDepartmentC"), 
    .ns=new("TestNamespace"))
  x  y test.ok
1 a 11    TRUE
2 b 12    TRUE
3 c 13    TRUE
4 d 14    TRUE
5 e 15    TRUE

Definining a new method for the new application context and a "productive" scenario

为新的应用程序上下文和“生产”场景定义一个新方法

setMethod(
    f="foo", 
    signature=signature(x="character", y="numeric", 
        .ctx="ApplicationContextDepartmentC", .ns="ProductionNamespace"), 
    definition=function(x, y, ..., .ctx, .ns) {
        data.frame(x, y)
    }
)

We tell the method dispatcher that we now want the method registered for this scenario or namespace like this:

我们告诉方法分派器,我们现在希望为这个场景或命名空间注册方法,如下所示:

foo(x=letters[1:5], y=11:15, .ctx=new("ApplicationContextDepartmentC"), 
    .ns=new("ProductionNamespace"))

  x  y
1 a 11
2 b 12
3 c 13
4 d 14
5 e 15

Notice that you're free to use the classes TestNamespace and ProductionNamespace anywhere you'd like. These classes are not bound to ApplicationContextDepartmentC in any way, so you can for example also use the for all your other application scenarios.

注意,您可以在任何您喜欢的地方使用类TestNamespace和ProductionNamespace。这些类不以任何方式绑定到ApplicationContextDepartmentC,因此,例如,您也可以在所有其他应用程序场景中使用这个类。

Additional remarks for method definitions

Something that's often quite usefull is to start out with a method that accepts ANY classes for its signature arguments and define more restrictive methods as your software evolves:

通常非常有用的一点是,首先使用一个方法,该方法接受任何类作为其签名参数,并随着软件的发展定义更严格的方法:

setMethod(
    f="foo", 
    signature=signature(x="ANY", y="ANY", .ctx="missing", .ns="missing"), 
    definition=function(x, y, ..., .ctx, .ns) {
        message("Value of x:")
        print(x)
        message("Value of y:")
        print(y)
    }
)
foo(x="Hello World!", y=rep(TRUE, 3))
Value of x:
[1] "Hello World!"
Value of y:
[1] TRUE TRUE TRUE

Additional remarks for class definitions

I prefer S4 Reference Classes over S4 Classes because of the self-referencing capabilities of S4 Reference Classes:

由于S4参考类的自引用功能,我更喜欢S4类的引用类:

setRefClass(
    Class="A", 
    fields=list(
        x1="numeric",
        x2="logical"
    ),
    methods=list(
        getX1=function() {
            .self$x1
        },
        getX2=function() {
            .self$x2
        },
        setX1=function(x) {
            .self$x1 <- x
        },
        setX2=function(x) {
            .self$field("x2", x)
        },
        addX1AndX2=function() {
            .self$getX1() + .self$getX2()
        }
    )
)
x <- new("A", x1=10, x2=TRUE)
x$getX1()
[1] 10
x$getX2()
[1] TRUE
x$addX1AndX2()
[1] 11

S4 Classes don't have that feature.

S4类没有这个特性。

Subsequent modifications of field values:

随后对字段值的修改:

x$setX1(100)
x$addX1AndX2()
[1] 101
x$x1 <- 1000
x$addX1AndX2()
[1] 1001

Additional remarks for documenting methods and classes

I strongly recommend using packages roxygen2 and devtools to document your methods and classes. You possibly might also want to look into package roxygen3.

我强烈建议使用roxygen2和devtools包来记录方法和类。您可能还需要查看roxygen3包。

Documenting generic methods with roxygen2:

用roxygen2记录通用方法:

#' Foo
#'
#' This method takes \code{x} and \code{y} and adds them.
#' 
#' Some details here
#' 
#' @param x \strong{Signature argument}.
#' @param y \strong{Signature argument}.
#' @param ... Further arguments to be passed to subsequent functions.
#' @param .ctx \strong{Signature argument}.
#'      Application context.
#' @param .ns \strong{Signature argument}.
#'      Application namespace. Usually used to distinguish different context 
#'      versions or configurations.
#' @author Janko Thyson \email{john.doe@@something.com}
#' @references \url{http://www.something.com/}
#' @example inst/examples/foo.R
#' @docType methods
#' @rdname foo-methods
#' @export

setGeneric(
    name="foo",
    signature=c("x", "y", ".ctx", ".ns"),
    def=function(x, y, ..., .ctx, .ns) {
        standardGeneric("foo")
    }
)

Documenting methods with roxygen2:

与roxygen2记录方法:

#' @param x \code{\link{character}}. Character vector.
#' @param y \code{\link{numeric}}. Numerical vector.  
#' @param .ctx \code{\link{ApplicationContextDepartmentC}}. 
#' @param .ns \code{\link{ProductionNamespace}}.  
#' @return \code{\link{data.frame}}. Some data frame.
#' @rdname foo-methods
#' @aliases foo,character,numeric,missing,missing-method
#' @export

setMethod(
    f="foo", 
    signature=signature(x="character", y="numeric", 
        .ctx="ApplicationContextDepartmentC", .ns="ProductionNamespace"), 
    definition=function(x, y, ..., .ctx, .ns) {
        data.frame(x, y)
    }
)

#1


20  

This is indeed an important thing to think about and I think ultimately requires the institutionalization of a couple of different processes.

这确实是一个需要考虑的重要问题,我认为最终需要一些不同过程的制度化。

  1. Version Control (svn, git, bzr, cvs, etc)
  2. 版本控制(svn、git、bzr、cvs等)
  3. Unit Tests
  4. 单元测试

My first reaction is that you need to institutionalize some sort of code management system. This will make it easier, because the old version of foo() is still available, if you really want it. From what you have said, it sounds like you need to package up your common functions and institute some sort of a release schedule. Scripts which require backward compatibility must include the package name and release information. This way it is possible to ALWAYS obtain foo() exactly as it was when the script was written. You should also make sure people only use official release versions in their work, because otherwise this could become quite a pain.

我的第一反应是您需要将某种代码管理系统制度化。这将使它更容易,因为如果您确实需要,旧版本的foo()仍然可用。根据您所说的,您似乎需要打包常见的功能并制定某种发布计划。需要向后兼容的脚本必须包含包名和发布信息。通过这种方式,可以始终获得与编写脚本时完全相同的foo()。您还应该确保人们在他们的工作中只使用官方版本,否则这将是非常痛苦的。

I agree, having a collection of foo:foo99 is doomed to failure. But at least it will be a gloriously confusing failure. Aesthetics aside, it will drive you all bonkers. If foo2() is an improvement (more accurate, faster, etc) of foo(), then it should be called foo() and released for use according to your company-wide release schedule. If it does something different, it is no longer foo(). It might be fooo() or superFoo() or fooMe(), but it ain't foo().

我同意,收藏foo:foo99是注定要失败的。但至少这将是一个令人眼花缭乱的失败。撇开美学不谈,它会让你发疯的。如果foo2()是foo()的改进(更准确、更快等),那么它应该被命名为foo()并根据公司范围内的发布计划发布。如果它做了一些不同的事情,它就不再是foo()了。可能是fooo()或superFoo()或fooMe(),但不是foo()。

Finally, you need to start testing your functions. (Unit Tests) For each function that is published and made available for others, you should have a clearly defined test suite. Unless someone fixes a bug in foo(), the results should stay the same. If someone fixes a bug, then the results should be more accurate and will probably more desirable in most cases. If you do need to reproduce the old, incorrect, results, you can dig out an old version of foo() from your version control system. By instituting rigorous unit tests, you will know if/when the results of foo have changed. This knowledge should help minimize the number of foo() functions you need. Rather than create a version every time someone tweaks something, you can test the new version to see whether or not the results conform to expectations. But, this is tricky, because you have to make sure that your tests cover anything the function is ever likely to see, including bizarre edge cases. In a research setting, I would imagine that could become a challenge.

最后,需要开始测试函数。(单元测试)对于每个已经发布并提供给其他人的函数,您应该有一个明确定义的测试套件。除非有人修复了foo()中的错误,否则结果应该保持不变。如果有人修复了错误,那么结果应该更准确,在大多数情况下可能更可取。如果您确实需要复制旧的、不正确的结果,您可以从您的版本控制系统中挖掘出一个旧版本的foo()。通过建立严格的单元测试,您将知道foo的结果是否/何时发生了变化。这些知识应该有助于减少foo()函数的数量。你可以测试新版本,看看结果是否符合预期,而不是每次都创建一个版本。但是,这很棘手,因为您必须确保您的测试覆盖了函数可能看到的任何内容,包括奇怪的边缘情况。在研究环境中,我认为这可能会成为一个挑战。

#2


8  

I'm not sure about integrating it with R, but Sumatra might be worth looking into. It appears to allow you to keep track of code and results. So if you need to go back an re-run that simulation from 4 years ago, the code should be there.

我不确定是否要把它和R结合起来,但是苏门答腊岛也许值得一看。它似乎允许您跟踪代码和结果。所以如果你需要重新运行4年前的模拟,代码应该在那里。

#3


5  

Well, ask yourself how you would do that in any other language. There's really nothing more to it than good bookkeeping I'm afraid:

好吧,问问你自己,用其他语言怎么做。恐怕没有什么比良好的簿记更重要的了:

  • record version numbers of all software involved
  • 所有相关软件的记录版本号。
  • put the code in manageable chunks, say in packages.
  • 将代码放在可管理的块中,比如包中。
  • make sure you have all software/packages involved still available in 5 years.
  • 确保所有的软件/软件包在5年内仍然可用。

R can easily be made portable, including all installed packages. Keep a portable version of R together with the used packages, the code and the data on a CD-ROM for each analysis, and you're sure you can reproduce whenever you want. OK, you miss the OS, but can't have them all. In any case, if the OS makes a difference important enough to call the analysis not reproducible, the problem is very likely your analysis. You don't want to tell anybody your result is dependent on the version of Windows you use, do you?

可以很容易地使R具有可移植性,包括所有已安装的包。将可移植的R版本与使用过的包、代码和数据保存在一个CD-ROM中,以便进行每次分析,您可以随时复制。好吧,你错过了操作系统,但不可能全部都拥有。在任何情况下,如果操作系统产生的差异非常重要,以至于可以将分析称为不可复制的,那么问题很可能是您的分析。你不想告诉任何人你的结果取决于你使用的Windows版本,是吗?

PS : please get into peoples head that they should never ever in their life copy-paste code. They should wrap it in functions and use those. A whole lot easier and far less error-prone. I mean, what's the difference between copying

PS:请让人们记住,他们不应该在一生中复制粘贴代码。他们应该用函数把它包装起来。更容易,更不容易出错。我的意思是,复制的区别是什么

x <- read.table("sometable")
y <- ColSums(x)/4.3

and adjusting the values, or typing

调整值,或者输入

myfun <- function(i,j){
  x <- read.table(i)
  y <- ColSums(x)/j
}

Saves you and a lot of other people a whole lot of copy-paste trouble. (How so, object not found? What object?)

为你和其他人省去了很多复制粘贴的麻烦。(怎么说,没有找到对象?什么对象?)

#4


5  

Whenever you want to freeze your code in a way that needs to be reproducible "forever", e.g., when your paper has been published, the safest way to do this is to create a virtual machine containing all your code and data and the software needed to run it (including the operating system). There's an example here on the University of Washington site.

每当你想冻结您的代码,需要可再生的“永远”,例如,当你的文章被发表,最安全的方法就是创建一个包含所有代码和数据和虚拟机运行它所需的软件(包括操作系统)。华盛顿大学网站上有个例子。

#5


3  

This is exactly the kind of thinking that causes Microsoft to maintain bug compatibility in Excel. Rather than attempting to conform to such a request you should be doing your best to show that it's not a good idea.

这正是导致微软在Excel中保持bug兼容性的原因。与其试图遵从这样的要求,你应该尽力表现出这不是一个好主意。

This thinking means that all errors remain errors in order to maintain consistency. It's thinking transferred from corporate bureaucracy and has no business in a science lab.

这种想法意味着所有的错误都是错误的,为了保持一致性。这是一种从公司官僚机构转移过来的思维方式,在科学实验室里没有任何业务。

The only way to do this is to save the copy of all your packages and version of R with your code. There's no central corporation beholden to bug compatibility that's going to take care of that for you.

这样做的唯一方法是将所有包和R版本的副本与代码一起保存。没有任何*公司对bug兼容性心存感激,它们会帮你解决这个问题。

#6


3  

What if a change in result is due to a change in your operating system? Perhaps Microsoft fix a bug in Windows XP for Windows 7 and then when you upgrade - all your outputs are different.

如果结果的更改是由于操作系统的更改,该怎么办?也许微软在Windows XP系统中修复了一个漏洞,当你升级时,你的输出都是不同的。

If you want to handle this then I think the best way of working is to keep snapshots of virtual machines when you close out an analysis, and store the VM images for later use. Of course in five years time you won't have a license to run Windows XP so that's another problem - one solved by using an open-source operating system, such as Linux.

如果您想处理这个问题,那么我认为最好的工作方式是在结束分析时保存虚拟机的快照,并存储VM映像供以后使用。当然,五年后你将没有运行Windows XP的许可证,这是另一个问题——一个通过使用开源操作系统解决的问题,比如Linux。

#7


2  

I would go with docker images.
This is pretty convenient way to reproduce OS and all dependencies.
You build an image and later can deploy it any time to docker, it will be fully configured.
You can find multiple R docker images available, so you can easily build your image upon them.
Having already built image you can use it to deploy to Test environment and later to Production.

我会选择docker图像。这是复制操作系统和所有依赖项的非常方便的方法。您构建一个映像,稍后可以将它部署到docker上,它将被完全配置。您可以找到多个R docker映像,因此您可以轻松地在它们之上构建您的映像。已经构建好映像后,您可以使用它来部署到测试环境,然后再部署到生产环境。

#8


1  

This may be a late answer, but I have found it useful to create a generic wrapper like the following, especially when iterating quickly in my development of a new function:

这可能是一个迟来的答案,但我发现创建如下这样的通用包装器是有用的,尤其是在我开发新函数时快速迭代时:

myFunction <- function(..., version = "latest"){
  if((version == "latest") || (version == 6)){
    return(myFunction06(...))
  } ...
  if((version == 1)){
    return(myFunction01(...))
  }
 }

Then, code should simply state which version it wants. Once the actual function stabilizes, I remove support for the older versions of the function, and a quick search through my code lets me find any offending calls. Use of "latest" means I can assure that the caller and the function match some fairly fixed definitions.

然后,代码应该简单地声明它想要的版本。一旦实际的函数稳定下来,我就删除了对旧版本函数的支持,通过代码的快速搜索,我可以找到任何违规的调用。使用“最新”意味着我可以保证调用者和函数匹配一些相当固定的定义。

Naturally, all code is maintained in a version control system, so even when I remove the earlier code, it is only from the currently available source. I can reproduce any behavior from any point in time, including errors, as long as the data from that point in time is obtainable.

当然,所有代码都是在版本控制系统中维护的,所以即使我删除了前面的代码,它也只是从当前可用的源代码中维护的。我可以从任何时间点复制任何行为,包括错误,只要该时间点的数据是可获得的。

#9


1  

A solution might be to use S4 methods and letting R's internal dispatcher do the work for you (see example below). That way, you're somewhat "bulletproof" with respect to being able to systematically update your code without running the risk of breaking something.

解决方案可能是使用S4方法,并让R的内部分派器为您完成工作(参见下面的示例)。这样,您就有点“防弹”了,因为您能够系统地更新您的代码,而不必冒破坏某些东西的风险。

Key benefits

The key thing here is that S4 methods support multiple dispatch.

这里的关键是S4方法支持多分派。

That way your function will always be foo (as opposed to having to keep track of foo1, foo2 etc.) while new functionality can be easily implemented (by adding respective methods) without touching "old" methods (that other people/packages might rely on).

这样,您的函数将始终是foo(而不是必须跟踪foo1、foo2等),而新功能可以轻松实现(通过添加各自的方法),而无需接触“旧”方法(其他人员/包可能依赖的方法)。

Key functions you'll need:

你需要的关键功能:

  • setGeneric
  • setGeneric
  • setMethod
  • setMethod
  • setRefClass (S4 Reference Classes; personal recommendation) or setClass (S4 Class; I wouldn't use them for the reason described in the "Additional remarks" at the very end)
  • setRefClass(S4引用类;或setClass (S4类);我不会使用它们的原因在“附加评论”在最后)

The "downsides"

  • You need to switch from a S3 to a S4 logic

    您需要从S3切换到S4逻辑

  • This implies that you need to write a bit more code than what you might be used to (generic method definitions, method definitions and possibly own class defitions (see example below). But this "buys" yourself and your code much more structure and makes it more robust.

    这意味着您需要编写比您可能使用的代码多一点的代码(通用的方法定义、方法定义,可能还有自己的类替换(参见下面的示例)。但这“购买”了你自己和你的代码更多的结构,使它更健壮。

  • It might also imply that you'll eventually dig deeper and deeper into the world of Object-Oriented Programming or Object-Oriented Design. While I personally consider this to be a good thing (my personal rule of thumb: the more complex/distributed your application, the better you're off using OOP), some would consider these approaches to be R-untypic (I strongly disagree as R does have superb OO-features that are maintained by the Core Team) or "unsuited" for R (this might be true depending on how much you rely on "non-OOP" packages/code). If you're willing to go that way, you might want to familiarize yourself with the SOLID principles of Object-Oriented Design. You also might want to check out the following books: Clean Coder and The Pragmatic Programmer.

    它还可能意味着您最终将深入到面向对象编程或面向对象设计的世界中。而我个人认为这是一件好事(我个人的经验法则:/分布式应用程序越复杂,越好你使用OOP),有些人会认为这些方法是R-untypic(我强烈反对R有高超的oo功能维护的核心团队)或“不合适的”R(这可能是真正的多少取决于你依靠“non-OOP”包/代码)。如果您愿意这样做,您可能希望熟悉面向对象设计的坚实原则。您可能还想看看以下书籍:Clean Coder和Pragmatic Programmer。

  • If computational efficiency (e.g. when estimating statistical models) is really critical, using S4 methods and S4 Reference Classes might slow you down a bit. After all, there's more code involved compared to S3. But I'd recommend testing the impact of this from case to case via system.time() and/or microbenchmark::microbenchmark() instead of picking "ideological" sides (S3 vs. S4).

    如果计算效率(例如在估计统计模型时)非常重要,使用S4方法和S4引用类可能会使您的计算速度慢一些。毕竟,与S3相比,涉及的代码更多。但是我建议通过system.time()和/或microbenchmark:::microbenchmark()来测试不同情况下的影响,而不是选择“意识形态”方面(S3和S4)。


Example

Initial function

Let's suppose you're in department A and someone in your team started out with creating a function called foo()

假设你在部门A团队中有人开始创建一个名为foo()的函数

foo <- function(x, y) {
    x + y
}
foo(x=10, y=20)

First change request

You would like to be able to extend it without breaking "old" code that relies on foo().

您希望能够在不破坏依赖foo()的“旧”代码的情况下扩展它。

Now, I think we all agree that this can be quite hard to do.

现在,我想我们都同意这是很难做到的。

You either need to explicitly modify the source code of foo() (each time running the risk that you break something that already used to work; this violates the "O" in SOLID: Open Closed-Principle) or you need to come with alternative names such as foo1, foo2 etc (really hard to keep track of which function is doing what).

您要么需要显式地修改foo()的源代码(每次运行破坏已经使用过的东西的风险;这违反了SOLID中的“O”:Open closed principle)或您需要提供其他名称,如foo1、foo2等(很难跟踪哪个函数在做什么)。

foo <- function(x, y, type=c("old", "new")) {
    type <- match.arg(type, choices=c("old", "new")) 
    if (type == "old") {
        x + y
    } else if (type == "new") {
        x * y    
    }
}
foo(x=10, y=20)
[1] 30
foo(x=10, y=20, type="new")
[1] 200

foo1 <- function(x, y) {
    x * y
}
foo1(x=10, y=20)
[1] 200

Let's see how S4 methods and multiple dispatch can really help us out here.

让我们看看S4方法和多重分派是如何在这里帮助我们的。

Generic method

You need to start out by turning foo() into a generic method.

您需要从将foo()转换为泛型方法开始。

setGeneric(
    name="foo",
    signature=c("x", "y", ".ctx", ".ns"),
    def=function(x, y, ..., .ctx, .ns) {
        standardGeneric("foo")
    }
)

In simplified words: a generic method itself doesn't do anything yet. It's simply a precondition in order to be able to specifiy "actual" methods for its signature arguments that do something useful.

简而言之:泛型方法本身还没有做任何事情。它只是一个先决条件,以便能够为其签名参数指定“实际”方法,以完成一些有用的工作。

Signature arguments

签名的参数

The degree of flexiblity with respect to the original problem is directly linked to the number of signature arguments that you declare (signature=c("x", "y", ".ctx", ".ns")): the more signature arguments, the more flexiblity you have but the more complex your code might get as well (with respect to how much code you have to write).

对于原始问题的灵活性程度直接与您声明的签名参数的数量相关(签名=c(“x”、“y”、“”)。ctx“,”.ns”):签名参数越多,灵活性就越强,但是代码可能也会越复杂(关于需要编写多少代码)。

Again, in simplified words: signature arguments (and it's classes) are used by the method dispatcher to retrieve the correct method that's doing the actual work.

同样,用简单的词:方法分派器使用签名参数(以及它的类)来检索正在执行实际工作的正确方法。

Think of the method dispatcher being like the clerk in a ski rental business: you present him an arbitrary large set of signature information (i.e. information that "clearly distinguish you from others": your age, height, shoe size and skill level) and he uses that information to provide you with the right equipment to hit the slopes. Think of R's method dispatcher as beeing the clerk that has access to the storage room of the ski rental. But instead of ski equipment it will return methods.

认为该方法的调度程序像滑雪租赁业务的职员:你给他任意大的签名信息(即“明显区分你从别人”的信息:你的年龄,身高,鞋码和技能水平),他利用这些信息为您提供合适的设备山坡上。把R的方法调度程序想象成能够进入滑雪场储藏室的职员。但它将返回方法,而不是滑雪设备。

Notice that we said that our "old" arguments x and y are from now on supposed to be signature arguments while there are also two new arguments: .ctx and .ns. I'll get to these in a minute. It's those arguments that will provide us with the flexibility that we're after.

注意,我们说过,我们的“旧”参数x和y从现在开始应该是签名参数,同时还有两个新参数:.ctx和.ns。我马上就会讲到这些。正是这些论点为我们提供了我们所追求的灵活性。

Initial method definition

We now define a "variant" (a method) of the generic method for the following "signature scenario":

我们现在为以下“签名场景”定义通用方法的“变体”(方法):

  1. x is numeric
  2. x是数字
  3. y is numeric
  4. y是数字
  5. .ctx will just not be provided when calling the method and is thus missing
  6. .ctx在调用方法时将不会被提供,因此会丢失
  7. .ns will just not be provided when calling the method and is thus missing
  8. .ns在调用方法时将不会被提供,因此会丢失

Think of it as registering your signature information with explicit equipment of the ski rental. Once you did that and ask for your equipment, the only thing the clerk has to do is to go to the storage room and look up which equipment is linked to your personal information.

可以把它看作是用滑雪租赁的显式设备注册您的签名信息。一旦你这么做了并且要求你的设备,职员所要做的唯一的事情就是去储藏室查找与你的个人信息相关联的设备。

setMethod(
    f="foo", 
    signature=signature(x="numeric", y="numeric", .ctx="missing", .ns="missing"), 
    definition=function(x, y, ..., .ctx, .ns) {
        x + y
    }
)

When we call foo with this "signature scenario" (asking for the method that we registered for this scenario), the method dispatcher knows exactly which actual method it needs to get out of the storage room:

当我们用这个“签名场景”调用foo时(请求我们为这个场景注册的方法),方法分派器确切地知道它需要从存储空间中取出哪个实际的方法:

foo(x=10, y=20)
[1] 30

First update

Now someone from department B comes along, looks at foo(), likes it but decides that foo() needs to be updated (x * y instead of x + y) if it is to be used in his department.

现在B部门的人来了,看着foo(),喜欢它,但是决定如果要在他的部门中使用foo(),那么需要更新(x * y而不是x + y)。

That's when .ctx (short for context) comes into play: it's an argument by which we are able to distinguish application contexts.

这时.ctx (context的缩写)就发挥作用了:它是我们能够区分应用程序上下文的参数。

Definining a class that represents the new application context

定义一个表示新应用程序上下文的类。

setRefClass("ApplicationContextDepartmentB")

When calling foo(), we'll provide it with an instance of this class (.ctx=new("ApplicationContextDepartmentB"))

在调用foo()时,我们将提供该类的实例(.ctx=new(“ApplicationContextDepartmentB”))

Definining a new method for the new application context

定义新应用程序上下文的新方法

Notice how we register signature argument .ctx to our new class ApplicationContextDepartmentB:

注意我们如何将签名参数.ctx注册到新的类ApplicationContextDepartmentB:

setMethod(
    f="foo", 
    signature=signature(x="numeric", y="numeric", 
        .ctx="ApplicationContextDepartmentB", .ns="missing"), 
    definition=function(x, y, ..., .ctx, .ns) {
        out <- x * y
        attributes(out)$description <- "I'm different from the original foo()"
        return(out)
    }
)

That way, the method dispatcher knows exactly that it should return the "new" method instead of the "old" method when we call foo() like this:

这样,当我们调用foo()时,方法分派器确切地知道它应该返回“new”方法而不是“old”方法:

foo(x=1, y=10, .ctx=new("ApplicationContextDepartmentB"))
[1] 10
attr(,"description")
[1] "I'm different from the original foo()"

The "old" method is not affected at all:

“旧”方法完全不受影响:

foo(x=1, y=10)
[1] 30

Second update

Suppose that someone from department C comes along and suggests yet another "configuration" or version for foo(). You can easily provide that withouth breaking anything that you've realized for departments A and B so far by following the same routine as for department B.

假设来自C部门的某人过来建议另一个foo()的“配置”或版本。你可以很容易地为A和B部门提供这种服务,而不会破坏到目前为止你为A和B部门所认识到的任何东西。

But we'll even take it one step further here: we'll define two additional classes that let us distinguish different "namespaces" (that's where .ns comes into play).

但我们甚至还会更进一步:我们将定义两个额外的类,让我们区分不同的“名称空间”(即,.ns在其中发挥作用)。

Think of namespaces as a way of distinguishing different runtime scenarios for a specific method for a specific application context (i.e. "testing" and "productive mode").

将名称空间看作是区分特定应用程序上下文的特定方法的不同运行时场景的一种方法(例如:“测试”和“生产方式”)。

Definining the classes

Definining类

setRefClass("ApplicationContextDepartmentC")
setRefClass("TestNamespace")
setRefClass("ProductionNamespace")

Definining a new method for the new application context and a "test" scenario

定义新的应用程序上下文和“测试”场景的新方法。

Notice how we register signature arguments .ctx to our new class ApplicationContextDepartmentC and .ns to our new class TestNamespace:

注意我们如何将签名参数.ctx注册到新的类ApplicationContextDepartmentC和.ns注册到新的类TestNamespace:

setMethod(
    f="foo", 
    signature=signature(x="character", y="numeric", 
        .ctx="ApplicationContextDepartmentC", .ns="TestNamespace"), 
    definition=function(x, y, ..., .ctx, .ns) {
        data.frame(x, y, test.ok=rep(TRUE, length(x)))
    }
)

Again, the method dispatcher will look up the correct method when calling foo() like this:

同样,当调用foo()时,方法调度程序将查找正确的方法:

foo(x=letters[1:5], y=11:15, .ctx=new("ApplicationContextDepartmentC"), 
    .ns=new("TestNamespace"))
  x  y test.ok
1 a 11    TRUE
2 b 12    TRUE
3 c 13    TRUE
4 d 14    TRUE
5 e 15    TRUE

Definining a new method for the new application context and a "productive" scenario

为新的应用程序上下文和“生产”场景定义一个新方法

setMethod(
    f="foo", 
    signature=signature(x="character", y="numeric", 
        .ctx="ApplicationContextDepartmentC", .ns="ProductionNamespace"), 
    definition=function(x, y, ..., .ctx, .ns) {
        data.frame(x, y)
    }
)

We tell the method dispatcher that we now want the method registered for this scenario or namespace like this:

我们告诉方法分派器,我们现在希望为这个场景或命名空间注册方法,如下所示:

foo(x=letters[1:5], y=11:15, .ctx=new("ApplicationContextDepartmentC"), 
    .ns=new("ProductionNamespace"))

  x  y
1 a 11
2 b 12
3 c 13
4 d 14
5 e 15

Notice that you're free to use the classes TestNamespace and ProductionNamespace anywhere you'd like. These classes are not bound to ApplicationContextDepartmentC in any way, so you can for example also use the for all your other application scenarios.

注意,您可以在任何您喜欢的地方使用类TestNamespace和ProductionNamespace。这些类不以任何方式绑定到ApplicationContextDepartmentC,因此,例如,您也可以在所有其他应用程序场景中使用这个类。

Additional remarks for method definitions

Something that's often quite usefull is to start out with a method that accepts ANY classes for its signature arguments and define more restrictive methods as your software evolves:

通常非常有用的一点是,首先使用一个方法,该方法接受任何类作为其签名参数,并随着软件的发展定义更严格的方法:

setMethod(
    f="foo", 
    signature=signature(x="ANY", y="ANY", .ctx="missing", .ns="missing"), 
    definition=function(x, y, ..., .ctx, .ns) {
        message("Value of x:")
        print(x)
        message("Value of y:")
        print(y)
    }
)
foo(x="Hello World!", y=rep(TRUE, 3))
Value of x:
[1] "Hello World!"
Value of y:
[1] TRUE TRUE TRUE

Additional remarks for class definitions

I prefer S4 Reference Classes over S4 Classes because of the self-referencing capabilities of S4 Reference Classes:

由于S4参考类的自引用功能,我更喜欢S4类的引用类:

setRefClass(
    Class="A", 
    fields=list(
        x1="numeric",
        x2="logical"
    ),
    methods=list(
        getX1=function() {
            .self$x1
        },
        getX2=function() {
            .self$x2
        },
        setX1=function(x) {
            .self$x1 <- x
        },
        setX2=function(x) {
            .self$field("x2", x)
        },
        addX1AndX2=function() {
            .self$getX1() + .self$getX2()
        }
    )
)
x <- new("A", x1=10, x2=TRUE)
x$getX1()
[1] 10
x$getX2()
[1] TRUE
x$addX1AndX2()
[1] 11

S4 Classes don't have that feature.

S4类没有这个特性。

Subsequent modifications of field values:

随后对字段值的修改:

x$setX1(100)
x$addX1AndX2()
[1] 101
x$x1 <- 1000
x$addX1AndX2()
[1] 1001

Additional remarks for documenting methods and classes

I strongly recommend using packages roxygen2 and devtools to document your methods and classes. You possibly might also want to look into package roxygen3.

我强烈建议使用roxygen2和devtools包来记录方法和类。您可能还需要查看roxygen3包。

Documenting generic methods with roxygen2:

用roxygen2记录通用方法:

#' Foo
#'
#' This method takes \code{x} and \code{y} and adds them.
#' 
#' Some details here
#' 
#' @param x \strong{Signature argument}.
#' @param y \strong{Signature argument}.
#' @param ... Further arguments to be passed to subsequent functions.
#' @param .ctx \strong{Signature argument}.
#'      Application context.
#' @param .ns \strong{Signature argument}.
#'      Application namespace. Usually used to distinguish different context 
#'      versions or configurations.
#' @author Janko Thyson \email{john.doe@@something.com}
#' @references \url{http://www.something.com/}
#' @example inst/examples/foo.R
#' @docType methods
#' @rdname foo-methods
#' @export

setGeneric(
    name="foo",
    signature=c("x", "y", ".ctx", ".ns"),
    def=function(x, y, ..., .ctx, .ns) {
        standardGeneric("foo")
    }
)

Documenting methods with roxygen2:

与roxygen2记录方法:

#' @param x \code{\link{character}}. Character vector.
#' @param y \code{\link{numeric}}. Numerical vector.  
#' @param .ctx \code{\link{ApplicationContextDepartmentC}}. 
#' @param .ns \code{\link{ProductionNamespace}}.  
#' @return \code{\link{data.frame}}. Some data frame.
#' @rdname foo-methods
#' @aliases foo,character,numeric,missing,missing-method
#' @export

setMethod(
    f="foo", 
    signature=signature(x="character", y="numeric", 
        .ctx="ApplicationContextDepartmentC", .ns="ProductionNamespace"), 
    definition=function(x, y, ..., .ctx, .ns) {
        data.frame(x, y)
    }
)