你能用data.frame做什么你不能用data.table

时间:2022-07-20 21:05:30

I just started using R, and came across data.table. I found it brilliant.

我刚开始用R,遇到data。table。我发现它辉煌。

A very naive question: Can I ignore data.frame to use data.table to avoid syntax confusion between two packages?

一个非常天真的问题:我可以忽略数据。表以避免两个包之间的语法混淆?

1 个解决方案

#1


54  

From the data.table FAQ

从数据。表常见问题解答

FAQ 1.8 OK, I'm starting to see what data.table is about, but why didn't you enhance data.frame in R? Why does it have to be a new package?

As FAQ 1.1 highlights, j in [.data.table is fundamentally different from j in [.data.frame. Even something as simple as DF[,1] would break existing code in many packages and user code. This is by design, and we want it to work this way for more complicated syntax to work. There are other differences, too (see FAQ 2.17).

正如FAQ 1.1所强调的,j在[.data]中。表与[.data.frame中的j有本质上的区别。即使是像DF[,1]这样简单的东西,也会破坏许多包和用户代码中的现有代码。这是通过设计实现的,我们希望它能以这种方式工作,以便使用更复杂的语法。还有其他的区别(见FAQ 2.17)。

Furthermore, data.table inherits from data.frame. It is a data.frame, too. A data.table can be passed to any package that only accepts data.frame and that package can use [.data.frame syntax on the data.table.

此外,数据。表继承自data.frame。它也是一个数据。一个数据。可以将表传递给只接受data.frame的任何包,该包可以在data.table上使用[.data.frame语法。

We have proposed enhancements to R wherever possible, too. One of these was accepted as a new feature in R 2.12.0 :

我们还建议尽可能地增强R。其中一个在r2.12.0中被接受为新特性:

unique() and match() are now faster on character vectors where all elements are in the global CHARSXP cache and have unmarked encoding (ASCII). Thanks to Matthew Dowle for suggesting improvements to the way the hash code is generated in unique.c.

unique()和match()现在在字符向量上速度更快,其中所有元素都位于全局CHARSXP缓存中,并且具有无标记编码(ASCII)。感谢Matthew Dowle对unique.c中哈希代码生成方式的改进。

A second proposal was to use memcpy in duplicate.c, which is much faster than a for loop in C. This would improve the way that R copies data internally (on some measures by 13 times). The thread on r-devel is here : http://tolstoy.newcastle.edu.au/R/e10/devel/10/04/0148.html.

第二个建议是使用memcpy,一式两份。c,比c中的for循环快得多,这将改进R在内部复制数据的方式(在某些度量上是13倍)。r-devel上的线程是:http://托尔斯泰y.newcastle.edu.au/e10/devel/10/04/0148.html。

2.17 What are the smaller syntax differences between data.frame and data.table?

  • DT[3] refers to the 3rd row, but DF[3] refers to the 3rd column
  • DT[3]是第三行,DF[3]是第三列
  • DT[3,] == DT[3], but DF[,3] == DF[3] (somewhat confusingly)
  • DT[3,] = DT[3],但DF[,3] = DF[3](有点困惑)
  • For this reason we say the comma is optional in DT, but not optional in DF
  • 由于这个原因,我们说逗号在DT中是可选的,但在DF中不是可选的。
  • DT[[3]] == DF[3] == DF[[3]]
  • DT[[3]] = DF[3] = DF[[3]]
  • DT[i,] where i is a single integer returns a single row, just like DF[i,], but unlike a matrix single row subset which returns a vector.
  • i是一个单一的整数,返回单行,就像DF[i,],但不像矩阵单行子集返回一个向量。
  • DT[,j,with=FALSE] where j is a single integer returns a one column data.table, unlike DF[,j] which returns a vector by default
  • DT[,j,with=FALSE],其中j是一个整数,返回一个列数据。表,与默认返回向量的DF[,j]不同
  • DT[,"colA",with=FALSE][[1]] == DF[,"colA"].
  • DT(与= FALSE),“可乐”,[[1]]= = DF,“可乐”。
  • DT[,colA] == DF[,"colA"]
  • DT(、可乐)= = DF(,“可乐”)
  • DT[,list(colA)] == DF[,"colA",drop=FALSE]
  • DT(、列表(可乐))= = DF(“可乐”,放弃= FALSE)
  • DT[NA] returns 1 row of NA, but DF[NA] returns a copy of DF containing NA throughout.
  • DT[NA]返回一行NA,而DF[NA]则返回包含NA的DF副本。
  • The symbol NA is type logical in R, and is therefore recycled by [.data.frame. Intention wasprobably DF[NA_integer_]. [.data.table does this automatically for convenience.
  • 符号NA在R中是逻辑类型,因此被[.data.frame循环使用。但意图或许DF要甚于[NA_integer_]。[. data。表自动地这样做,以方便。
  • DT[c(TRUE,NA,FALSE)] treats the NA as FALSE, but DF[c(TRUE,NA,FALSE)] returns NA rows
    for each NA
  • DT[c(TRUE,NA,FALSE)]将NA视为FALSE,而DF[c(TRUE,NA,FALSE)]则为每个NA返回NA行
  • DT[ColA==ColB] is simpler than DF[!is.na(ColA) & !is.na(ColB) & ColA==ColB,]
  • DT[ColA= ColB]比DF[!is.na(ColA) & !is.na(ColB) & ColA= ColB,]
  • data.frame(list(1:2,"k",1:4)) creates 3 columns, data.table creates one list column.
  • frame(list, 1:2,“k”,1:4)创建3列数据。表创建一个列表列。
  • check.names is by default TRUE in data.frame but FALSE in data.table, for convenience.
  • 在data.frame中默认为TRUE,在数据中为FALSE。表,为了方便。
  • stringsAsFactors is by default TRUE in data.frame but FALSE in data.table, for efficiency.
  • stringsAsFactors在data.frame中默认为TRUE,但在数据中为FALSE。表,效率。
  • Since a global string cache was added to R, characters items are a pointer to the single cached string and there is no longer a performance benefit of coverting to factor.
  • 由于在R中添加了全局字符串缓存,所以字符项是指向单个缓存的字符串的指针,并且不再具有覆盖到因子的性能优势。
  • Atomic vectors in list columns are collapsed when printed using ", " in data.frame, but "," in data.table with a trailing comma after the 6th item to avoid accidental printing of large embedded objects.
  • 列表列中的原子向量在打印时使用“,”在data.frame,但是“,”在data中。第6项后加逗号的表,以避免大型嵌入对象的意外打印。
  • In [.data.frame we very often set drop=FALSE. When we forget, bugs can arise in edge cases where single columns are selected and all of a sudden a vector is returned rather than a single column data.frame. In [.data.table we took the opportunity to make it consistent and drop drop.
  • 在[.data.frame中,我们经常设置drop=FALSE。当我们忘记时,在选择单个列而突然返回向量而不是单个列data.frame的边缘情况下,可能会出现错误。在[. data。我们利用这个机会使它始终如一,并不断下降。
  • When a data.table is passed to a data.table-unaware package, that package it not concerned with any of these differences; it just works
  • 当一个数据。表被传递给数据。不知情的包装,该包装不涉及任何这些差异;它只是工作

Small caveat

There will possibly be cases where some packages use code that falls down when given a data.frame, however, given that data.table is constantly being maintained to avoid such problems, any problems that may arise will be fixed promptly.

在某些情况下,某些包使用给定数据。框架时的代码,但是,给定数据。为避免此类问题的发生,我们一直在维护表格,任何可能出现的问题都会及时得到解决。

For example

例如

  • base::unname(DT) now works again, as needed by plyr::melt(). Thanks to Christoph Jaeckel for reporting. Test added.
  • base:::unname(DT)现在可以再次工作,根据plyr::melt()的需要。感谢Christoph Jaeckel的报道。测试补充道。
  • An as.data.frame method has been added for ITime, so that ITime can be passed to ggplot2 without error, #1713. Thanks to Farrel Buchinsky for reporting. Tests added. ITime axis labels are still displayed as integer seconds from midnight; we don't know why ggplot2 doesn't invoke ITime's as.character method. Convert ITime to POSIXct for ggplot2, is one approach.
  • 在ITime中添加了一个data.frame方法,这样ITime就可以在没有错误的情况下被传递给ggplot2, #1713。感谢Farrel Buchinsky的报道。测试补充道。ITime轴标签仍然以整数秒的形式显示,从午夜开始;我们不知道为什么ggplot2不调用ITime的as。字符的方法。对于ggplot2,将ITime转换为POSIXct是一种方法。

#1


54  

From the data.table FAQ

从数据。表常见问题解答

FAQ 1.8 OK, I'm starting to see what data.table is about, but why didn't you enhance data.frame in R? Why does it have to be a new package?

As FAQ 1.1 highlights, j in [.data.table is fundamentally different from j in [.data.frame. Even something as simple as DF[,1] would break existing code in many packages and user code. This is by design, and we want it to work this way for more complicated syntax to work. There are other differences, too (see FAQ 2.17).

正如FAQ 1.1所强调的,j在[.data]中。表与[.data.frame中的j有本质上的区别。即使是像DF[,1]这样简单的东西,也会破坏许多包和用户代码中的现有代码。这是通过设计实现的,我们希望它能以这种方式工作,以便使用更复杂的语法。还有其他的区别(见FAQ 2.17)。

Furthermore, data.table inherits from data.frame. It is a data.frame, too. A data.table can be passed to any package that only accepts data.frame and that package can use [.data.frame syntax on the data.table.

此外,数据。表继承自data.frame。它也是一个数据。一个数据。可以将表传递给只接受data.frame的任何包,该包可以在data.table上使用[.data.frame语法。

We have proposed enhancements to R wherever possible, too. One of these was accepted as a new feature in R 2.12.0 :

我们还建议尽可能地增强R。其中一个在r2.12.0中被接受为新特性:

unique() and match() are now faster on character vectors where all elements are in the global CHARSXP cache and have unmarked encoding (ASCII). Thanks to Matthew Dowle for suggesting improvements to the way the hash code is generated in unique.c.

unique()和match()现在在字符向量上速度更快,其中所有元素都位于全局CHARSXP缓存中,并且具有无标记编码(ASCII)。感谢Matthew Dowle对unique.c中哈希代码生成方式的改进。

A second proposal was to use memcpy in duplicate.c, which is much faster than a for loop in C. This would improve the way that R copies data internally (on some measures by 13 times). The thread on r-devel is here : http://tolstoy.newcastle.edu.au/R/e10/devel/10/04/0148.html.

第二个建议是使用memcpy,一式两份。c,比c中的for循环快得多,这将改进R在内部复制数据的方式(在某些度量上是13倍)。r-devel上的线程是:http://托尔斯泰y.newcastle.edu.au/e10/devel/10/04/0148.html。

2.17 What are the smaller syntax differences between data.frame and data.table?

  • DT[3] refers to the 3rd row, but DF[3] refers to the 3rd column
  • DT[3]是第三行,DF[3]是第三列
  • DT[3,] == DT[3], but DF[,3] == DF[3] (somewhat confusingly)
  • DT[3,] = DT[3],但DF[,3] = DF[3](有点困惑)
  • For this reason we say the comma is optional in DT, but not optional in DF
  • 由于这个原因,我们说逗号在DT中是可选的,但在DF中不是可选的。
  • DT[[3]] == DF[3] == DF[[3]]
  • DT[[3]] = DF[3] = DF[[3]]
  • DT[i,] where i is a single integer returns a single row, just like DF[i,], but unlike a matrix single row subset which returns a vector.
  • i是一个单一的整数,返回单行,就像DF[i,],但不像矩阵单行子集返回一个向量。
  • DT[,j,with=FALSE] where j is a single integer returns a one column data.table, unlike DF[,j] which returns a vector by default
  • DT[,j,with=FALSE],其中j是一个整数,返回一个列数据。表,与默认返回向量的DF[,j]不同
  • DT[,"colA",with=FALSE][[1]] == DF[,"colA"].
  • DT(与= FALSE),“可乐”,[[1]]= = DF,“可乐”。
  • DT[,colA] == DF[,"colA"]
  • DT(、可乐)= = DF(,“可乐”)
  • DT[,list(colA)] == DF[,"colA",drop=FALSE]
  • DT(、列表(可乐))= = DF(“可乐”,放弃= FALSE)
  • DT[NA] returns 1 row of NA, but DF[NA] returns a copy of DF containing NA throughout.
  • DT[NA]返回一行NA,而DF[NA]则返回包含NA的DF副本。
  • The symbol NA is type logical in R, and is therefore recycled by [.data.frame. Intention wasprobably DF[NA_integer_]. [.data.table does this automatically for convenience.
  • 符号NA在R中是逻辑类型,因此被[.data.frame循环使用。但意图或许DF要甚于[NA_integer_]。[. data。表自动地这样做,以方便。
  • DT[c(TRUE,NA,FALSE)] treats the NA as FALSE, but DF[c(TRUE,NA,FALSE)] returns NA rows
    for each NA
  • DT[c(TRUE,NA,FALSE)]将NA视为FALSE,而DF[c(TRUE,NA,FALSE)]则为每个NA返回NA行
  • DT[ColA==ColB] is simpler than DF[!is.na(ColA) & !is.na(ColB) & ColA==ColB,]
  • DT[ColA= ColB]比DF[!is.na(ColA) & !is.na(ColB) & ColA= ColB,]
  • data.frame(list(1:2,"k",1:4)) creates 3 columns, data.table creates one list column.
  • frame(list, 1:2,“k”,1:4)创建3列数据。表创建一个列表列。
  • check.names is by default TRUE in data.frame but FALSE in data.table, for convenience.
  • 在data.frame中默认为TRUE,在数据中为FALSE。表,为了方便。
  • stringsAsFactors is by default TRUE in data.frame but FALSE in data.table, for efficiency.
  • stringsAsFactors在data.frame中默认为TRUE,但在数据中为FALSE。表,效率。
  • Since a global string cache was added to R, characters items are a pointer to the single cached string and there is no longer a performance benefit of coverting to factor.
  • 由于在R中添加了全局字符串缓存,所以字符项是指向单个缓存的字符串的指针,并且不再具有覆盖到因子的性能优势。
  • Atomic vectors in list columns are collapsed when printed using ", " in data.frame, but "," in data.table with a trailing comma after the 6th item to avoid accidental printing of large embedded objects.
  • 列表列中的原子向量在打印时使用“,”在data.frame,但是“,”在data中。第6项后加逗号的表,以避免大型嵌入对象的意外打印。
  • In [.data.frame we very often set drop=FALSE. When we forget, bugs can arise in edge cases where single columns are selected and all of a sudden a vector is returned rather than a single column data.frame. In [.data.table we took the opportunity to make it consistent and drop drop.
  • 在[.data.frame中,我们经常设置drop=FALSE。当我们忘记时,在选择单个列而突然返回向量而不是单个列data.frame的边缘情况下,可能会出现错误。在[. data。我们利用这个机会使它始终如一,并不断下降。
  • When a data.table is passed to a data.table-unaware package, that package it not concerned with any of these differences; it just works
  • 当一个数据。表被传递给数据。不知情的包装,该包装不涉及任何这些差异;它只是工作

Small caveat

There will possibly be cases where some packages use code that falls down when given a data.frame, however, given that data.table is constantly being maintained to avoid such problems, any problems that may arise will be fixed promptly.

在某些情况下,某些包使用给定数据。框架时的代码,但是,给定数据。为避免此类问题的发生,我们一直在维护表格,任何可能出现的问题都会及时得到解决。

For example

例如

  • base::unname(DT) now works again, as needed by plyr::melt(). Thanks to Christoph Jaeckel for reporting. Test added.
  • base:::unname(DT)现在可以再次工作,根据plyr::melt()的需要。感谢Christoph Jaeckel的报道。测试补充道。
  • An as.data.frame method has been added for ITime, so that ITime can be passed to ggplot2 without error, #1713. Thanks to Farrel Buchinsky for reporting. Tests added. ITime axis labels are still displayed as integer seconds from midnight; we don't know why ggplot2 doesn't invoke ITime's as.character method. Convert ITime to POSIXct for ggplot2, is one approach.
  • 在ITime中添加了一个data.frame方法,这样ITime就可以在没有错误的情况下被传递给ggplot2, #1713。感谢Farrel Buchinsky的报道。测试补充道。ITime轴标签仍然以整数秒的形式显示,从午夜开始;我们不知道为什么ggplot2不调用ITime的as。字符的方法。对于ggplot2,将ITime转换为POSIXct是一种方法。