你可以用data.frame做什么,你不能使用data.table?

时间:2022-01-20 21:09:14

I just started using R, and came across data.table. I found it brilliant.

我刚开始使用R,遇到了data.table。我发现它很棒。

A very naive question: Can I ignore data.frame to use data.table to avoid syntax confusion between two packages?

一个非常天真的问题:我可以忽略data.frame来使用data.table来避免两个包之间的语法混淆吗?

1 个解决方案

#1


55  

From the data.table FAQ

来自data.table常见问题

FAQ 1.8 OK, I'm starting to see what data.table is about, but why didn't you enhance data.frame in R? Why does it have to be a new package?

As FAQ 1.1 highlights, j in [.data.table is fundamentally different from j in [.data.frame. Even something as simple as DF[,1] would break existing code in many packages and user code. This is by design, and we want it to work this way for more complicated syntax to work. There are other differences, too (see FAQ 2.17).

正如FAQ 1.1所强调的那样,[.data.table中的j与[.data.frame中的j]根本不同。即使像DF [,1]这样简单的东西也会破坏许多包和用户代码中的现有代码。这是设计使然,我们希望它以这种方式工作,以使更复杂的语法工作。还有其他差异(参见FAQ 2.17)。

Furthermore, data.table inherits from data.frame. It is a data.frame, too. A data.table can be passed to any package that only accepts data.frame and that package can use [.data.frame syntax on the data.table.

此外,data.table继承自data.frame。它也是一个data.frame。 data.table可以传递给任何只接受data.frame的包,并且该包可以在data.table上使用[.data.frame语法。

We have proposed enhancements to R wherever possible, too. One of these was accepted as a new feature in R 2.12.0 :

我们也尽可能地提议对R进行增强。其中一个被接受为R 2.12.0中的新功能:

unique() and match() are now faster on character vectors where all elements are in the global CHARSXP cache and have unmarked encoding (ASCII). Thanks to Matthew Dowle for suggesting improvements to the way the hash code is generated in unique.c.

现在,unique()和match()在字符向量上更快,其中所有元素都在全局CHARSXP缓存中并且具有未标记的编码(ASCII)。感谢Matthew Dowle建议改进在unique.c中生成哈希码的方式。

A second proposal was to use memcpy in duplicate.c, which is much faster than a for loop in C. This would improve the way that R copies data internally (on some measures by 13 times). The thread on r-devel is here : http://tolstoy.newcastle.edu.au/R/e10/devel/10/04/0148.html.

第二个提议是在duplicate.c中使用memcpy,这比C中的for循环要快得多。这将改进R在内部复制数据的方式(在某些度量上复制13次)。 r-devel上的帖子在这里:http://tolstoy.newcastle.edu.au/R/e10/devel/10/04/0148.html。

2.17 What are the smaller syntax differences between data.frame and data.table?

  • DT[3] refers to the 3rd row, but DF[3] refers to the 3rd column
  • DT [3]指第3行,但DF [3]指第3列

  • DT[3,] == DT[3], but DF[,3] == DF[3] (somewhat confusingly)
  • DT [3,] == DT [3],但DF [,3] == DF [3](有些令人困惑)

  • For this reason we say the comma is optional in DT, but not optional in DF
  • 出于这个原因,我们说逗号在DT中是可选的,但在DF中不是可选的

  • DT[[3]] == DF[3] == DF[[3]]
  • DT [[3]] == DF [3] == DF [[3]]

  • DT[i,] where i is a single integer returns a single row, just like DF[i,], but unlike a matrix single row subset which returns a vector.
  • DT [i,]其中i是单个整数,返回单行,就像DF [i,],但不同于返回向量的矩阵单行子集。

  • DT[,j,with=FALSE] where j is a single integer returns a one column data.table, unlike DF[,j] which returns a vector by default
  • DT [,j,with = FALSE]其中j是单个整数,返回一列data.table,与DF [,j]不同,它默认返回一个向量

  • DT[,"colA",with=FALSE][[1]] == DF[,"colA"].
  • DT [,“colA”,= = FALSE] [[1]] == DF [,“colA”]。

  • DT[,colA] == DF[,"colA"]
  • DT [,colA] == DF [,“colA”]

  • DT[,list(colA)] == DF[,"colA",drop=FALSE]
  • DT [,list(colA)] == DF [,“colA”,drop = FALSE]

  • DT[NA] returns 1 row of NA, but DF[NA] returns a copy of DF containing NA throughout.
  • DT [NA]返回1行NA,但DF [NA]返回包含NA的DF副本。

  • The symbol NA is type logical in R, and is therefore recycled by [.data.frame. Intention wasprobably DF[NA_integer_]. [.data.table does this automatically for convenience.
  • 符号NA在R中是逻辑类型,因此由[.data.frame]循环。意图可能是DF [NA_integer_]。 [.data.table为方便起见自动执行此操作。

  • DT[c(TRUE,NA,FALSE)] treats the NA as FALSE, but DF[c(TRUE,NA,FALSE)] returns NA rows
    for each NA
  • DT [c(TRUE,NA,FALSE)]将NA视为FALSE,但DF [c(TRUE,NA,FALSE)]为每个NA返回NA行

  • DT[ColA==ColB] is simpler than DF[!is.na(ColA) & !is.na(ColB) & ColA==ColB,]
  • DT [ColA == ColB]比DF更简单[!is.na(ColA)&!is.na(ColB)&ColA == ColB,]

  • data.frame(list(1:2,"k",1:4)) creates 3 columns, data.table creates one list column.
  • data.frame(list(1:2,“k”,1:4))创建3列,data.table创建一个列表列。

  • check.names is by default TRUE in data.frame but FALSE in data.table, for convenience.
  • 为方便起见,check.names在data.frame中默认为TRUE,在data.table中为FALSE。

  • stringsAsFactors is by default TRUE in data.frame but FALSE in data.table, for efficiency.
  • stringsAsFactors在data.frame中默认为TRUE,在data.table中为FALSE,以提高效率。

  • Since a global string cache was added to R, characters items are a pointer to the single cached string and there is no longer a performance benefit of coverting to factor.
  • 由于全局字符串缓存已添加到R,因此字符项是指向单个缓存字符串的指针,并且不再具有转换为因子的性能优势。

  • Atomic vectors in list columns are collapsed when printed using ", " in data.frame, but "," in data.table with a trailing comma after the 6th item to avoid accidental printing of large embedded objects.
  • 列表列中的原子向量在data.frame中使用“,”打印时折叠,但在data.table中使用“,”,在第6项之后使用尾随逗号,以避免意外打印大型嵌入对象。

  • In [.data.frame we very often set drop=FALSE. When we forget, bugs can arise in edge cases where single columns are selected and all of a sudden a vector is returned rather than a single column data.frame. In [.data.table we took the opportunity to make it consistent and drop drop.
  • 在[.data.frame中,我们经常设置drop = FALSE。当我们忘记时,在选择单列并且突然返回向量而不是单个列data.frame的边缘情况下会出现错误。在[.data.table中,我们借此机会使其保持一致并放弃。

  • When a data.table is passed to a data.table-unaware package, that package it not concerned with any of these differences; it just works
  • 当data.table传递给data.table-unaware包时,该包不涉及任何这些差异;它只是工作


Small caveat

There will possibly be cases where some packages use code that falls down when given a data.frame, however, given that data.table is constantly being maintained to avoid such problems, any problems that may arise will be fixed promptly.

可能会出现某些软件包使用代码在给定data.frame时崩溃的情况,但是,由于data.table一直被维护以避免此类问题,因此可能出现的任何问题都会立即得到解决。

For example

  • base::unname(DT) now works again, as needed by plyr::melt(). Thanks to Christoph Jaeckel for reporting. Test added.
  • base :: unname(DT)现在可以再次使用,plyr :: melt()需要它。感谢Christoph Jaeckel的报道。测试补充。

  • An as.data.frame method has been added for ITime, so that ITime can be passed to ggplot2 without error, #1713. Thanks to Farrel Buchinsky for reporting. Tests added. ITime axis labels are still displayed as integer seconds from midnight; we don't know why ggplot2 doesn't invoke ITime's as.character method. Convert ITime to POSIXct for ggplot2, is one approach.
  • 为ITime添加了一个as.data.frame方法,因此ITime可以无错误地传递给ggplot2,#1713。感谢Farrel Buchinsky的报道。测试补充说。 ITime轴标签仍显示为午夜的整数秒;我们不知道为什么ggplot2不会调用ITime的as.character方法。将ITime转换为POSIXct for ggplot2,是一种方法。

#1


55  

From the data.table FAQ

来自data.table常见问题

FAQ 1.8 OK, I'm starting to see what data.table is about, but why didn't you enhance data.frame in R? Why does it have to be a new package?

As FAQ 1.1 highlights, j in [.data.table is fundamentally different from j in [.data.frame. Even something as simple as DF[,1] would break existing code in many packages and user code. This is by design, and we want it to work this way for more complicated syntax to work. There are other differences, too (see FAQ 2.17).

正如FAQ 1.1所强调的那样,[.data.table中的j与[.data.frame中的j]根本不同。即使像DF [,1]这样简单的东西也会破坏许多包和用户代码中的现有代码。这是设计使然,我们希望它以这种方式工作,以使更复杂的语法工作。还有其他差异(参见FAQ 2.17)。

Furthermore, data.table inherits from data.frame. It is a data.frame, too. A data.table can be passed to any package that only accepts data.frame and that package can use [.data.frame syntax on the data.table.

此外,data.table继承自data.frame。它也是一个data.frame。 data.table可以传递给任何只接受data.frame的包,并且该包可以在data.table上使用[.data.frame语法。

We have proposed enhancements to R wherever possible, too. One of these was accepted as a new feature in R 2.12.0 :

我们也尽可能地提议对R进行增强。其中一个被接受为R 2.12.0中的新功能:

unique() and match() are now faster on character vectors where all elements are in the global CHARSXP cache and have unmarked encoding (ASCII). Thanks to Matthew Dowle for suggesting improvements to the way the hash code is generated in unique.c.

现在,unique()和match()在字符向量上更快,其中所有元素都在全局CHARSXP缓存中并且具有未标记的编码(ASCII)。感谢Matthew Dowle建议改进在unique.c中生成哈希码的方式。

A second proposal was to use memcpy in duplicate.c, which is much faster than a for loop in C. This would improve the way that R copies data internally (on some measures by 13 times). The thread on r-devel is here : http://tolstoy.newcastle.edu.au/R/e10/devel/10/04/0148.html.

第二个提议是在duplicate.c中使用memcpy,这比C中的for循环要快得多。这将改进R在内部复制数据的方式(在某些度量上复制13次)。 r-devel上的帖子在这里:http://tolstoy.newcastle.edu.au/R/e10/devel/10/04/0148.html。

2.17 What are the smaller syntax differences between data.frame and data.table?

  • DT[3] refers to the 3rd row, but DF[3] refers to the 3rd column
  • DT [3]指第3行,但DF [3]指第3列

  • DT[3,] == DT[3], but DF[,3] == DF[3] (somewhat confusingly)
  • DT [3,] == DT [3],但DF [,3] == DF [3](有些令人困惑)

  • For this reason we say the comma is optional in DT, but not optional in DF
  • 出于这个原因,我们说逗号在DT中是可选的,但在DF中不是可选的

  • DT[[3]] == DF[3] == DF[[3]]
  • DT [[3]] == DF [3] == DF [[3]]

  • DT[i,] where i is a single integer returns a single row, just like DF[i,], but unlike a matrix single row subset which returns a vector.
  • DT [i,]其中i是单个整数,返回单行,就像DF [i,],但不同于返回向量的矩阵单行子集。

  • DT[,j,with=FALSE] where j is a single integer returns a one column data.table, unlike DF[,j] which returns a vector by default
  • DT [,j,with = FALSE]其中j是单个整数,返回一列data.table,与DF [,j]不同,它默认返回一个向量

  • DT[,"colA",with=FALSE][[1]] == DF[,"colA"].
  • DT [,“colA”,= = FALSE] [[1]] == DF [,“colA”]。

  • DT[,colA] == DF[,"colA"]
  • DT [,colA] == DF [,“colA”]

  • DT[,list(colA)] == DF[,"colA",drop=FALSE]
  • DT [,list(colA)] == DF [,“colA”,drop = FALSE]

  • DT[NA] returns 1 row of NA, but DF[NA] returns a copy of DF containing NA throughout.
  • DT [NA]返回1行NA,但DF [NA]返回包含NA的DF副本。

  • The symbol NA is type logical in R, and is therefore recycled by [.data.frame. Intention wasprobably DF[NA_integer_]. [.data.table does this automatically for convenience.
  • 符号NA在R中是逻辑类型,因此由[.data.frame]循环。意图可能是DF [NA_integer_]。 [.data.table为方便起见自动执行此操作。

  • DT[c(TRUE,NA,FALSE)] treats the NA as FALSE, but DF[c(TRUE,NA,FALSE)] returns NA rows
    for each NA
  • DT [c(TRUE,NA,FALSE)]将NA视为FALSE,但DF [c(TRUE,NA,FALSE)]为每个NA返回NA行

  • DT[ColA==ColB] is simpler than DF[!is.na(ColA) & !is.na(ColB) & ColA==ColB,]
  • DT [ColA == ColB]比DF更简单[!is.na(ColA)&!is.na(ColB)&ColA == ColB,]

  • data.frame(list(1:2,"k",1:4)) creates 3 columns, data.table creates one list column.
  • data.frame(list(1:2,“k”,1:4))创建3列,data.table创建一个列表列。

  • check.names is by default TRUE in data.frame but FALSE in data.table, for convenience.
  • 为方便起见,check.names在data.frame中默认为TRUE,在data.table中为FALSE。

  • stringsAsFactors is by default TRUE in data.frame but FALSE in data.table, for efficiency.
  • stringsAsFactors在data.frame中默认为TRUE,在data.table中为FALSE,以提高效率。

  • Since a global string cache was added to R, characters items are a pointer to the single cached string and there is no longer a performance benefit of coverting to factor.
  • 由于全局字符串缓存已添加到R,因此字符项是指向单个缓存字符串的指针,并且不再具有转换为因子的性能优势。

  • Atomic vectors in list columns are collapsed when printed using ", " in data.frame, but "," in data.table with a trailing comma after the 6th item to avoid accidental printing of large embedded objects.
  • 列表列中的原子向量在data.frame中使用“,”打印时折叠,但在data.table中使用“,”,在第6项之后使用尾随逗号,以避免意外打印大型嵌入对象。

  • In [.data.frame we very often set drop=FALSE. When we forget, bugs can arise in edge cases where single columns are selected and all of a sudden a vector is returned rather than a single column data.frame. In [.data.table we took the opportunity to make it consistent and drop drop.
  • 在[.data.frame中,我们经常设置drop = FALSE。当我们忘记时,在选择单列并且突然返回向量而不是单个列data.frame的边缘情况下会出现错误。在[.data.table中,我们借此机会使其保持一致并放弃。

  • When a data.table is passed to a data.table-unaware package, that package it not concerned with any of these differences; it just works
  • 当data.table传递给data.table-unaware包时,该包不涉及任何这些差异;它只是工作


Small caveat

There will possibly be cases where some packages use code that falls down when given a data.frame, however, given that data.table is constantly being maintained to avoid such problems, any problems that may arise will be fixed promptly.

可能会出现某些软件包使用代码在给定data.frame时崩溃的情况,但是,由于data.table一直被维护以避免此类问题,因此可能出现的任何问题都会立即得到解决。

For example

  • base::unname(DT) now works again, as needed by plyr::melt(). Thanks to Christoph Jaeckel for reporting. Test added.
  • base :: unname(DT)现在可以再次使用,plyr :: melt()需要它。感谢Christoph Jaeckel的报道。测试补充。

  • An as.data.frame method has been added for ITime, so that ITime can be passed to ggplot2 without error, #1713. Thanks to Farrel Buchinsky for reporting. Tests added. ITime axis labels are still displayed as integer seconds from midnight; we don't know why ggplot2 doesn't invoke ITime's as.character method. Convert ITime to POSIXct for ggplot2, is one approach.
  • 为ITime添加了一个as.data.frame方法,因此ITime可以无错误地传递给ggplot2,#1713。感谢Farrel Buchinsky的报道。测试补充说。 ITime轴标签仍显示为午夜的整数秒;我们不知道为什么ggplot2不会调用ITime的as.character方法。将ITime转换为POSIXct for ggplot2,是一种方法。