您是否可以在R中“绑定”或提供数据的另一种实现?

时间:2022-01-20 21:09:02

In Perl (and probably other langauges), you can "tie" a variable to replace it's built-in behavior with user-defined behavior. For example, a hash table can be tied with custom "get" and "fetch" subroutines which, for example, query BerkeleyDB so that data is persistent and not limited by RAM, but still looks and acts like regular hash to Perl.

在Perl中(可能还有其他的langauges),您可以“tie”一个变量来替换它的内置行为和用户定义的行为。例如,哈希表可以与自定义的“get”和“fetch”子例程绑定,例如,查询BerkeleyDB,以便数据是持久的,不受RAM的限制,但在外观和行为上仍然类似于Perl的常规哈希。

Is something similar possible with R? In particular, I was thinking, since a data.frame looks much like a table in a relational db, that if a data.frame were tied to something like SQLite, it would enable R to handle very large data frames (I've stuffed 100GB+ into SQLite) without any code changes.

R有类似的可能吗?特别是,我在想,因为一个data.frame看起来很像关系db中的一个表,如果一个data.frame被绑定到SQLite之类的东西上,它将使R能够处理非常大的数据框架(我已经将100GB+填充到SQLite中),而不需要任何代码更改。

1 个解决方案

#1


2  

As the comments point out, a handful of package have already been built on this idea (or similar).

正如评论所指出的,已经基于这个想法(或类似的想法)构建了一些包。

data.table and dplyr are exceptionally good at dealing with very large data.frame and querying them. If the data.frame is actually >100GB, I would rather recommend data.table which seem to outperform dplyr in the limit nrow->Inf. Both have excellent support on * should you need it.

数据。表和dplyr非常擅长处理非常大的数据。如果数据。frame实际上是>100GB,我宁愿推荐数据。在nrow->极限Inf中似乎优于dplyr的表。如果您需要*,它们都有很好的支持。

However, to actually answer your question (and to be useful to the future readers of this question): yes it is possible to surcharge a function with R to provide an alternative behavior. It is actually very easy with the S3 dispatch system. I recommend this ressource to learn more.

然而,要真正回答你的问题(并且对这个问题的未来读者有用):是的,可以用R附加一个函数来提供另一种行为。使用S3分派系统实际上非常简单。我推荐这个资源来了解更多。

I'll give you the condensed version: If you have an object of class "myclass", you can write a function f.myclass to do what you want.

我将给出压缩版:如果你有一个类“myclass”的对象,你可以写一个函数f。我的班级去做你想做的事。

Then you define the generic function f:

然后定义通用函数f:

f <- function(obj, ...) UseMethod("f", obj, ...)

When you call f(obj), the function that UseMethod will call depends on the class of obj.

调用f(obj)时,UseMethod将调用的函数取决于obj的类。

If obj is of class "myclass", then f.myclass will be called on obj.

如果obj属于“myclass”类,则f。myclass将在obj上调用。

If the function you want to redefine already exists, say plot, then you can simply define plot.myclass which will be used when you call plot on a "myclass" object. The generic function already exists, no need to redefine it.

如果要重新定义的函数已经存在,比如plot,那么可以简单地定义plot。当您调用“myclass”对象上的plot时,将使用myclass。泛型函数已经存在,不需要重新定义它。

To change the class of an object (or append the new class to the existing classes, which is more common to not break the behavior you don't want to change), you can use class<-.

要更改对象的类(或将新类附加到现有的类中,这更常见的做法是不破坏您不希望更改的行为),可以使用class<- >。

Here's a silly example.

这是一个愚蠢的例子。

> print.myclass <- function(x) {
    print("Hello!")}

> df <- data.frame(a=1:3)
> class(df)
[1] "data.frame"
> df #equivalent to print(df)
  a
1 1
2 2
3 3

> class(df) <- append(class(df), "myclass")
> class(df)
[1] "data.frame" "myclass"   

> class(df) <- "myclass"
> class(df)
[1] "myclass"
> df
[1] "Hello!"
> str(df) # checking the structure of df: the data is still there of course
List of 1
 $ a: int [1:3] 1 2 3
 - attr(*, "row.names")= int [1:3] 1 2 3
 - attr(*, "class")= chr "myclass"

There are some subtleties, like which function is called if there are several classes, in what order, etc. I refer you to a thorough explanation of the S3 system.

这里有一些微妙之处,比如如果有几个类,以什么顺序调用哪个函数,等等。

That's how you would redefine the behavior of functions. Re-write them as f.myclass and then create objects of class "myclass".

这就是重新定义函数行为的方法。重写f。然后创建类“myclass”的对象。

Alternatively, you could redefine f.targetclass. For example, again with print and data.frame:

或者,您可以重新定义f.targetclass。例如,再次使用print和data.frame:

> print.data.frame <- function(x) {
         print(paste("data.frame with columns:", paste(names(x), collapse = ", ")))} # less silly example!
> df <- data.frame(a=1:3, b=4:6)
> df
[1] "data.frame with columns: a, b"

#1


2  

As the comments point out, a handful of package have already been built on this idea (or similar).

正如评论所指出的,已经基于这个想法(或类似的想法)构建了一些包。

data.table and dplyr are exceptionally good at dealing with very large data.frame and querying them. If the data.frame is actually >100GB, I would rather recommend data.table which seem to outperform dplyr in the limit nrow->Inf. Both have excellent support on * should you need it.

数据。表和dplyr非常擅长处理非常大的数据。如果数据。frame实际上是>100GB,我宁愿推荐数据。在nrow->极限Inf中似乎优于dplyr的表。如果您需要*,它们都有很好的支持。

However, to actually answer your question (and to be useful to the future readers of this question): yes it is possible to surcharge a function with R to provide an alternative behavior. It is actually very easy with the S3 dispatch system. I recommend this ressource to learn more.

然而,要真正回答你的问题(并且对这个问题的未来读者有用):是的,可以用R附加一个函数来提供另一种行为。使用S3分派系统实际上非常简单。我推荐这个资源来了解更多。

I'll give you the condensed version: If you have an object of class "myclass", you can write a function f.myclass to do what you want.

我将给出压缩版:如果你有一个类“myclass”的对象,你可以写一个函数f。我的班级去做你想做的事。

Then you define the generic function f:

然后定义通用函数f:

f <- function(obj, ...) UseMethod("f", obj, ...)

When you call f(obj), the function that UseMethod will call depends on the class of obj.

调用f(obj)时,UseMethod将调用的函数取决于obj的类。

If obj is of class "myclass", then f.myclass will be called on obj.

如果obj属于“myclass”类,则f。myclass将在obj上调用。

If the function you want to redefine already exists, say plot, then you can simply define plot.myclass which will be used when you call plot on a "myclass" object. The generic function already exists, no need to redefine it.

如果要重新定义的函数已经存在,比如plot,那么可以简单地定义plot。当您调用“myclass”对象上的plot时,将使用myclass。泛型函数已经存在,不需要重新定义它。

To change the class of an object (or append the new class to the existing classes, which is more common to not break the behavior you don't want to change), you can use class<-.

要更改对象的类(或将新类附加到现有的类中,这更常见的做法是不破坏您不希望更改的行为),可以使用class<- >。

Here's a silly example.

这是一个愚蠢的例子。

> print.myclass <- function(x) {
    print("Hello!")}

> df <- data.frame(a=1:3)
> class(df)
[1] "data.frame"
> df #equivalent to print(df)
  a
1 1
2 2
3 3

> class(df) <- append(class(df), "myclass")
> class(df)
[1] "data.frame" "myclass"   

> class(df) <- "myclass"
> class(df)
[1] "myclass"
> df
[1] "Hello!"
> str(df) # checking the structure of df: the data is still there of course
List of 1
 $ a: int [1:3] 1 2 3
 - attr(*, "row.names")= int [1:3] 1 2 3
 - attr(*, "class")= chr "myclass"

There are some subtleties, like which function is called if there are several classes, in what order, etc. I refer you to a thorough explanation of the S3 system.

这里有一些微妙之处,比如如果有几个类,以什么顺序调用哪个函数,等等。

That's how you would redefine the behavior of functions. Re-write them as f.myclass and then create objects of class "myclass".

这就是重新定义函数行为的方法。重写f。然后创建类“myclass”的对象。

Alternatively, you could redefine f.targetclass. For example, again with print and data.frame:

或者,您可以重新定义f.targetclass。例如,再次使用print和data.frame:

> print.data.frame <- function(x) {
         print(paste("data.frame with columns:", paste(names(x), collapse = ", ")))} # less silly example!
> df <- data.frame(a=1:3, b=4:6)
> df
[1] "data.frame with columns: a, b"