I just started using R, and came across data.table. I found it brilliant.
我刚开始使用R,遇到了data.table。我发现它很棒。
A very naive question: Can I ignore data.frame to use data.table to avoid syntax confusion between two packages?
一个非常天真的问题:我可以忽略data.frame来使用data.table来避免两个包之间的语法混淆吗?
1 个解决方案
#1
55
From the data.table FAQ
来自data.table常见问题
FAQ 1.8 OK, I'm starting to see what data.table is about, but why didn't you enhance data.frame in R? Why does it have to be a new package?
As FAQ 1.1 highlights,
j
in[.data.table
is fundamentally different fromj
in[.data.frame
. Even something as simple asDF[,1]
would break existing code in many packages and user code. This is by design, and we want it to work this way for more complicated syntax to work. There are other differences, too (see FAQ 2.17).正如FAQ 1.1所强调的那样,[.data.table中的j与[.data.frame中的j]根本不同。即使像DF [,1]这样简单的东西也会破坏许多包和用户代码中的现有代码。这是设计使然,我们希望它以这种方式工作,以使更复杂的语法工作。还有其他差异(参见FAQ 2.17)。
Furthermore,
data.table
inherits fromdata.frame
. It is adata.frame
, too. Adata.table
can be passed to any package that only acceptsdata.frame
and that package can use[.data.frame
syntax on thedata.table
.此外,data.table继承自data.frame。它也是一个data.frame。 data.table可以传递给任何只接受data.frame的包,并且该包可以在data.table上使用[.data.frame语法。
We have proposed enhancements to R wherever possible, too. One of these was accepted as a new feature in R 2.12.0 :
我们也尽可能地提议对R进行增强。其中一个被接受为R 2.12.0中的新功能:
unique()
andmatch()
are now faster on character vectors where all elements are in the globalCHARSXP
cache and have unmarked encoding (ASCII). Thanks to Matthew Dowle for suggesting improvements to the way the hash code is generated inunique.
c.现在,unique()和match()在字符向量上更快,其中所有元素都在全局CHARSXP缓存中并且具有未标记的编码(ASCII)。感谢Matthew Dowle建议改进在unique.c中生成哈希码的方式。
A second proposal was to use
memcpy
induplicate.c
, which is much faster than a for loop in C. This would improve the way that R copies data internally (on some measures by 13 times). The thread on r-devel is here : http://tolstoy.newcastle.edu.au/R/e10/devel/10/04/0148.html.第二个提议是在duplicate.c中使用memcpy,这比C中的for循环要快得多。这将改进R在内部复制数据的方式(在某些度量上复制13次)。 r-devel上的帖子在这里:http://tolstoy.newcastle.edu.au/R/e10/devel/10/04/0148.html。
2.17 What are the smaller syntax differences between data.frame and data.table?
DT[3]
refers to the 3rd row, butDF[3]
refers to the 3rd columnDT [3]指第3行,但DF [3]指第3列
DT[3,] == DT[3],
butDF[,3] == DF[3]
(somewhat confusingly)DT [3,] == DT [3],但DF [,3] == DF [3](有些令人困惑)
- For this reason we say the comma is optional in DT, but not optional in DF
出于这个原因,我们说逗号在DT中是可选的,但在DF中不是可选的
DT[[3]] == DF[3] == DF[[3]]
DT [[3]] == DF [3] == DF [[3]]
DT[i,]
where i is a single integer returns a single row, just likeDF[i,]
, but unlike a matrix single row subset which returns a vector.DT [i,]其中i是单个整数,返回单行,就像DF [i,],但不同于返回向量的矩阵单行子集。
DT[,j,with=FALSE]
where j is a single integer returns a one column data.table, unlikeDF[,j]
which returns a vector by defaultDT [,j,with = FALSE]其中j是单个整数,返回一列data.table,与DF [,j]不同,它默认返回一个向量
DT[,"colA",with=FALSE][[1]] == DF[,"colA"]
.DT [,“colA”,= = FALSE] [[1]] == DF [,“colA”]。
DT[,colA] == DF[,"colA"]
DT [,colA] == DF [,“colA”]
DT[,list(colA)] == DF[,"colA",drop=FALSE]
DT [,list(colA)] == DF [,“colA”,drop = FALSE]
DT[NA]
returns 1 row of NA, butDF[NA]
returns a copy of DF containing NA throughout.DT [NA]返回1行NA,但DF [NA]返回包含NA的DF副本。
- The symbol
NA
is type logical in R, and is therefore recycled by[.data.frame
. Intention wasprobablyDF[NA_integer_]
.[.data.table
does this automatically for convenience.符号NA在R中是逻辑类型,因此由[.data.frame]循环。意图可能是DF [NA_integer_]。 [.data.table为方便起见自动执行此操作。
DT[c(TRUE,NA,FALSE)]
treats the NA as FALSE, butDF[c(TRUE,NA,FALSE)]
returns NA rows
for eachNA
DT [c(TRUE,NA,FALSE)]将NA视为FALSE,但DF [c(TRUE,NA,FALSE)]为每个NA返回NA行
DT[ColA==ColB]
is simpler thanDF[!is.na(ColA) & !is.na(ColB) & ColA==ColB,]
DT [ColA == ColB]比DF更简单[!is.na(ColA)&!is.na(ColB)&ColA == ColB,]
data.frame(list(1:2,"k",1:4))
creates 3 columns,data.table
creates one list column.data.frame(list(1:2,“k”,1:4))创建3列,data.table创建一个列表列。
check.names
is by defaultTRUE
indata.frame
butFALSE
indata.table
, for convenience.为方便起见,check.names在data.frame中默认为TRUE,在data.table中为FALSE。
stringsAsFactors
is by default TRUE indata.frame
but FALSE indata.table
, for efficiency.stringsAsFactors在data.frame中默认为TRUE,在data.table中为FALSE,以提高效率。
- Since a global string cache was added to R, characters items are a pointer to the single cached string and there is no longer a performance benefit of coverting to factor.
由于全局字符串缓存已添加到R,因此字符项是指向单个缓存字符串的指针,并且不再具有转换为因子的性能优势。
- Atomic vectors in list columns are collapsed when printed using ", " in data.frame, but "," in data.table with a trailing comma after the 6th item to avoid accidental printing of large embedded objects.
列表列中的原子向量在data.frame中使用“,”打印时折叠,但在data.table中使用“,”,在第6项之后使用尾随逗号,以避免意外打印大型嵌入对象。
- In
[.data.frame
we very often setdrop=FALSE
. When we forget, bugs can arise in edge cases where single columns are selected and all of a sudden a vector is returned rather than a single column data.frame. In[.data.table
we took the opportunity to make it consistent and drop drop.在[.data.frame中,我们经常设置drop = FALSE。当我们忘记时,在选择单列并且突然返回向量而不是单个列data.frame的边缘情况下会出现错误。在[.data.table中,我们借此机会使其保持一致并放弃。
- When a data.table is passed to a data.table-unaware package, that package it not concerned with any of these differences; it just works
当data.table传递给data.table-unaware包时,该包不涉及任何这些差异;它只是工作
Small caveat
There will possibly be cases where some packages use code that falls down when given a data.frame, however, given that data.table
is constantly being maintained to avoid such problems, any problems that may arise will be fixed promptly.
可能会出现某些软件包使用代码在给定data.frame时崩溃的情况,但是,由于data.table一直被维护以避免此类问题,因此可能出现的任何问题都会立即得到解决。
For example
-
see this question and prompt response
看到这个问题并迅速做出回应
-
From the NEWS for v 1.8.2
来自NEW 1.8 for v 1.8.2
- base::unname(DT) now works again, as needed by plyr::melt(). Thanks to Christoph Jaeckel for reporting. Test added.
base :: unname(DT)现在可以再次使用,plyr :: melt()需要它。感谢Christoph Jaeckel的报道。测试补充。
- An as.data.frame method has been added for ITime, so that ITime can be passed to ggplot2 without error, #1713. Thanks to Farrel Buchinsky for reporting. Tests added. ITime axis labels are still displayed as integer seconds from midnight; we don't know why ggplot2 doesn't invoke ITime's as.character method. Convert ITime to POSIXct for ggplot2, is one approach.
为ITime添加了一个as.data.frame方法,因此ITime可以无错误地传递给ggplot2,#1713。感谢Farrel Buchinsky的报道。测试补充说。 ITime轴标签仍显示为午夜的整数秒;我们不知道为什么ggplot2不会调用ITime的as.character方法。将ITime转换为POSIXct for ggplot2,是一种方法。
#1
55
From the data.table FAQ
来自data.table常见问题
FAQ 1.8 OK, I'm starting to see what data.table is about, but why didn't you enhance data.frame in R? Why does it have to be a new package?
As FAQ 1.1 highlights,
j
in[.data.table
is fundamentally different fromj
in[.data.frame
. Even something as simple asDF[,1]
would break existing code in many packages and user code. This is by design, and we want it to work this way for more complicated syntax to work. There are other differences, too (see FAQ 2.17).正如FAQ 1.1所强调的那样,[.data.table中的j与[.data.frame中的j]根本不同。即使像DF [,1]这样简单的东西也会破坏许多包和用户代码中的现有代码。这是设计使然,我们希望它以这种方式工作,以使更复杂的语法工作。还有其他差异(参见FAQ 2.17)。
Furthermore,
data.table
inherits fromdata.frame
. It is adata.frame
, too. Adata.table
can be passed to any package that only acceptsdata.frame
and that package can use[.data.frame
syntax on thedata.table
.此外,data.table继承自data.frame。它也是一个data.frame。 data.table可以传递给任何只接受data.frame的包,并且该包可以在data.table上使用[.data.frame语法。
We have proposed enhancements to R wherever possible, too. One of these was accepted as a new feature in R 2.12.0 :
我们也尽可能地提议对R进行增强。其中一个被接受为R 2.12.0中的新功能:
unique()
andmatch()
are now faster on character vectors where all elements are in the globalCHARSXP
cache and have unmarked encoding (ASCII). Thanks to Matthew Dowle for suggesting improvements to the way the hash code is generated inunique.
c.现在,unique()和match()在字符向量上更快,其中所有元素都在全局CHARSXP缓存中并且具有未标记的编码(ASCII)。感谢Matthew Dowle建议改进在unique.c中生成哈希码的方式。
A second proposal was to use
memcpy
induplicate.c
, which is much faster than a for loop in C. This would improve the way that R copies data internally (on some measures by 13 times). The thread on r-devel is here : http://tolstoy.newcastle.edu.au/R/e10/devel/10/04/0148.html.第二个提议是在duplicate.c中使用memcpy,这比C中的for循环要快得多。这将改进R在内部复制数据的方式(在某些度量上复制13次)。 r-devel上的帖子在这里:http://tolstoy.newcastle.edu.au/R/e10/devel/10/04/0148.html。
2.17 What are the smaller syntax differences between data.frame and data.table?
DT[3]
refers to the 3rd row, butDF[3]
refers to the 3rd columnDT [3]指第3行,但DF [3]指第3列
DT[3,] == DT[3],
butDF[,3] == DF[3]
(somewhat confusingly)DT [3,] == DT [3],但DF [,3] == DF [3](有些令人困惑)
- For this reason we say the comma is optional in DT, but not optional in DF
出于这个原因,我们说逗号在DT中是可选的,但在DF中不是可选的
DT[[3]] == DF[3] == DF[[3]]
DT [[3]] == DF [3] == DF [[3]]
DT[i,]
where i is a single integer returns a single row, just likeDF[i,]
, but unlike a matrix single row subset which returns a vector.DT [i,]其中i是单个整数,返回单行,就像DF [i,],但不同于返回向量的矩阵单行子集。
DT[,j,with=FALSE]
where j is a single integer returns a one column data.table, unlikeDF[,j]
which returns a vector by defaultDT [,j,with = FALSE]其中j是单个整数,返回一列data.table,与DF [,j]不同,它默认返回一个向量
DT[,"colA",with=FALSE][[1]] == DF[,"colA"]
.DT [,“colA”,= = FALSE] [[1]] == DF [,“colA”]。
DT[,colA] == DF[,"colA"]
DT [,colA] == DF [,“colA”]
DT[,list(colA)] == DF[,"colA",drop=FALSE]
DT [,list(colA)] == DF [,“colA”,drop = FALSE]
DT[NA]
returns 1 row of NA, butDF[NA]
returns a copy of DF containing NA throughout.DT [NA]返回1行NA,但DF [NA]返回包含NA的DF副本。
- The symbol
NA
is type logical in R, and is therefore recycled by[.data.frame
. Intention wasprobablyDF[NA_integer_]
.[.data.table
does this automatically for convenience.符号NA在R中是逻辑类型,因此由[.data.frame]循环。意图可能是DF [NA_integer_]。 [.data.table为方便起见自动执行此操作。
DT[c(TRUE,NA,FALSE)]
treats the NA as FALSE, butDF[c(TRUE,NA,FALSE)]
returns NA rows
for eachNA
DT [c(TRUE,NA,FALSE)]将NA视为FALSE,但DF [c(TRUE,NA,FALSE)]为每个NA返回NA行
DT[ColA==ColB]
is simpler thanDF[!is.na(ColA) & !is.na(ColB) & ColA==ColB,]
DT [ColA == ColB]比DF更简单[!is.na(ColA)&!is.na(ColB)&ColA == ColB,]
data.frame(list(1:2,"k",1:4))
creates 3 columns,data.table
creates one list column.data.frame(list(1:2,“k”,1:4))创建3列,data.table创建一个列表列。
check.names
is by defaultTRUE
indata.frame
butFALSE
indata.table
, for convenience.为方便起见,check.names在data.frame中默认为TRUE,在data.table中为FALSE。
stringsAsFactors
is by default TRUE indata.frame
but FALSE indata.table
, for efficiency.stringsAsFactors在data.frame中默认为TRUE,在data.table中为FALSE,以提高效率。
- Since a global string cache was added to R, characters items are a pointer to the single cached string and there is no longer a performance benefit of coverting to factor.
由于全局字符串缓存已添加到R,因此字符项是指向单个缓存字符串的指针,并且不再具有转换为因子的性能优势。
- Atomic vectors in list columns are collapsed when printed using ", " in data.frame, but "," in data.table with a trailing comma after the 6th item to avoid accidental printing of large embedded objects.
列表列中的原子向量在data.frame中使用“,”打印时折叠,但在data.table中使用“,”,在第6项之后使用尾随逗号,以避免意外打印大型嵌入对象。
- In
[.data.frame
we very often setdrop=FALSE
. When we forget, bugs can arise in edge cases where single columns are selected and all of a sudden a vector is returned rather than a single column data.frame. In[.data.table
we took the opportunity to make it consistent and drop drop.在[.data.frame中,我们经常设置drop = FALSE。当我们忘记时,在选择单列并且突然返回向量而不是单个列data.frame的边缘情况下会出现错误。在[.data.table中,我们借此机会使其保持一致并放弃。
- When a data.table is passed to a data.table-unaware package, that package it not concerned with any of these differences; it just works
当data.table传递给data.table-unaware包时,该包不涉及任何这些差异;它只是工作
Small caveat
There will possibly be cases where some packages use code that falls down when given a data.frame, however, given that data.table
is constantly being maintained to avoid such problems, any problems that may arise will be fixed promptly.
可能会出现某些软件包使用代码在给定data.frame时崩溃的情况,但是,由于data.table一直被维护以避免此类问题,因此可能出现的任何问题都会立即得到解决。
For example
-
see this question and prompt response
看到这个问题并迅速做出回应
-
From the NEWS for v 1.8.2
来自NEW 1.8 for v 1.8.2
- base::unname(DT) now works again, as needed by plyr::melt(). Thanks to Christoph Jaeckel for reporting. Test added.
base :: unname(DT)现在可以再次使用,plyr :: melt()需要它。感谢Christoph Jaeckel的报道。测试补充。
- An as.data.frame method has been added for ITime, so that ITime can be passed to ggplot2 without error, #1713. Thanks to Farrel Buchinsky for reporting. Tests added. ITime axis labels are still displayed as integer seconds from midnight; we don't know why ggplot2 doesn't invoke ITime's as.character method. Convert ITime to POSIXct for ggplot2, is one approach.
为ITime添加了一个as.data.frame方法,因此ITime可以无错误地传递给ggplot2,#1713。感谢Farrel Buchinsky的报道。测试补充说。 ITime轴标签仍显示为午夜的整数秒;我们不知道为什么ggplot2不会调用ITime的as.character方法。将ITime转换为POSIXct for ggplot2,是一种方法。