R:通过引用传递数据帧

时间:2022-08-22 22:55:05

R has pass-by-value semantics, which minimizes accidental side effects (a good thing). However, when code is organized into many functions/methods for reusability/readability/maintainability and when that code needs to manipulate large data structures through, e.g., big data frames, through a series of transformations/operations the pass-by-value semantics leads to a lot of copying of data around and much heap thrashing (a bad thing). For example, a data frame that takes 50Mb on the heap that is passed as a function parameter will be copied at a minimum the same number of times as the function call depth and the heap size at the bottom of the call stack will be N*50Mb. If the functions return a transformed/modified data frame from deep in the call chain then the copying goes up by another N.

R具有按值传递的语义,可以最大限度地减少意外的副作用(这是一件好事)。但是,当代码被组织成许多用于可重用性/可读性/可维护性的函数/方法时,并且当代码需要通过例如大数据帧操纵大数据结构时,通过一系列转换/操作,按值传递语义引导大量的数据复制和大量的堆栈颠簸(一件坏事)。例如,在作为函数参数传递的堆上占用50Mb的数据帧将至少复制与函数调用深度相同的次数,并且调用堆栈底部的堆大小将为N * 50MB。如果函数从调用链的深处返回转换/修改的数据帧,则复制由另一个N上升。

The SO question What is the best way to avoid passing a data frame around? touches this topic but is phrased in a way that avoids directly asking the pass-by-reference question and the winning answer basically says, "yes, pass-by-value is how R works". That's not actually 100% accurate. R environments enable pass-by-reference semantics and OO frameworks such as proto use this capability extensively. For example, when a proto object is passed as a function argument, while its "magic wrapper" is passed by value, to the R developer the semantics are pass-by-reference.

SO问题避免传递数据框的最佳方法是什么?触及这个主题,但措辞的方式避免直接询问传递引用的问题和获胜的答案基本上说,“是的,值传递R是如何工作的”。这实际上不是100%准确。 R环境支持传递引用语义,而诸如proto的OO框架广泛使用此功能。例如,当一个proto对象作为函数参数传递时,当它的“魔术包装器”通过值传递给R开发人员时,语义是传递引用。

It seems that passing a big data frame by reference would be a common problem and I'm wondering how others have approached it and whether there are any libraries that enable this. In my searching I have not discovered one.

似乎通过引用传递大数据框架将是一个常见问题,我想知道其他人是如何接近它的,以及是否有任何库可以实现这一点。在我的搜索中,我没有发现一个。

If nothing is available, my approach would be to create a proto object that wraps a data frame. I would appreciate pointers about the syntactic sugar that should be added to this object to make it useful, e.g., overloading the $ and [[ operators, as well as any gotchas I should look out for. I'm not an R expert.

如果没有可用的东西,我的方法是创建一个包装数据框的proto对象。我希望有关应该添加到此对象的语法糖的指针,以使其有用,例如,重载$和[[运算符,以及我应该注意的任何陷阱。我不是R专家。

Bonus points for a type-agnostic pass-by-reference solution that integrates nicely with R, though my needs are exclusively with data frames.

与类型无关的传递参考解决方案的奖励积分与R很好地集成,尽管我的需求仅限于数据帧。

1 个解决方案

#1


27  

The premise of the question is (partly) incorrect. R works as pass-by-promise and there is repeated copying in the manner you outline only when further assignments and alterations to the dataframe are made as the promise is passed on. So the number of copies will not be N*size where N is the stack depth, but rather where N is the number of levels where assignments are made. You are correct, however, that environments can be useful. I see on following the link that you have already found the 'proto' package. There is also a relatively recent introduction of a "reference class" sometimes referred to as "R5" where R/S3 was the original class system of S3 that is copied in R and R4 would be the more recent class system that seems to mostly support the BioConductor package development.

问题的前提是(部分)不正确。 R作为pass-by-promise工作,并且只有在传递承诺时对数据帧进行进一步的分配和更改时,才会以您概述的方式重复复制。因此,副本的数量将不是N *大小,其中N是堆栈深度,而是其中N是进行分配的级别数。但是,您是正确的,环境可能很有用。我看到你已经找到'proto'包的链接。还有一个相对较新的“引用类”的引入,有时也被称为“R5”,其中R / S3是在R中复制的S3的原始类系统,R4将是最近似乎主要支持的类系统BioConductor包开发。

Here is a link to an example by Steve Lianoglou (in a thread discussing the merits of reference classes) of embedding an environment inside an S4 object to avoid the copying costs:

以下是Steve Lianoglou(在一个讨论引用类的优点的线程中)在S4对象中嵌入环境以避免复制成本的示例的链接:

https://stat.ethz.ch/pipermail/r-help/2011-September/289987.html

https://stat.ethz.ch/pipermail/r-help/2011-September/289987.html

Matthew Dowle's 'data.table' package creates a new class of data object whose access semantics using the "[" are different than those of regular R data.frames, and which is really working as pass-by-reference. It has superior speed of access and processing. It also can fall back on dataframe semantics since in later years such objects now inherit the 'data.frame' class.

Matthew Dowle的'data.table'包创建了一个新类的数据对象,其使用“[”的访问语义与常规R data.frames的访问语义不同,并且它实际上是作为pass-by-reference工作。它具有出色的访问和处理速度。它也可以依赖于数据框语义,因为在以后的几年里,这些对象现在继承了'data.frame'类。

You may also want to investigate Hesterberg's dataframe package.

您可能还想调查Hesterberg的数据帧包。

#1


27  

The premise of the question is (partly) incorrect. R works as pass-by-promise and there is repeated copying in the manner you outline only when further assignments and alterations to the dataframe are made as the promise is passed on. So the number of copies will not be N*size where N is the stack depth, but rather where N is the number of levels where assignments are made. You are correct, however, that environments can be useful. I see on following the link that you have already found the 'proto' package. There is also a relatively recent introduction of a "reference class" sometimes referred to as "R5" where R/S3 was the original class system of S3 that is copied in R and R4 would be the more recent class system that seems to mostly support the BioConductor package development.

问题的前提是(部分)不正确。 R作为pass-by-promise工作,并且只有在传递承诺时对数据帧进行进一步的分配和更改时,才会以您概述的方式重复复制。因此,副本的数量将不是N *大小,其中N是堆栈深度,而是其中N是进行分配的级别数。但是,您是正确的,环境可能很有用。我看到你已经找到'proto'包的链接。还有一个相对较新的“引用类”的引入,有时也被称为“R5”,其中R / S3是在R中复制的S3的原始类系统,R4将是最近似乎主要支持的类系统BioConductor包开发。

Here is a link to an example by Steve Lianoglou (in a thread discussing the merits of reference classes) of embedding an environment inside an S4 object to avoid the copying costs:

以下是Steve Lianoglou(在一个讨论引用类的优点的线程中)在S4对象中嵌入环境以避免复制成本的示例的链接:

https://stat.ethz.ch/pipermail/r-help/2011-September/289987.html

https://stat.ethz.ch/pipermail/r-help/2011-September/289987.html

Matthew Dowle's 'data.table' package creates a new class of data object whose access semantics using the "[" are different than those of regular R data.frames, and which is really working as pass-by-reference. It has superior speed of access and processing. It also can fall back on dataframe semantics since in later years such objects now inherit the 'data.frame' class.

Matthew Dowle的'data.table'包创建了一个新类的数据对象,其使用“[”的访问语义与常规R data.frames的访问语义不同,并且它实际上是作为pass-by-reference工作。它具有出色的访问和处理速度。它也可以依赖于数据框语义,因为在以后的几年里,这些对象现在继承了'data.frame'类。

You may also want to investigate Hesterberg's dataframe package.

您可能还想调查Hesterberg的数据帧包。