UPDATE This problem is not relevant anymore for data.table
versions 1.8.0 and higher. From the NEWS file:
更新这个问题与数据无关。表版本1.8.0和更高版本。从新闻文件:
character columns are now allowed in keys and are preferred to factor. data.table() and setkey() no longer coerce character to factor. Factors are still supported. Implements FR#1493, FR#1224 and (partially) FR#951.
字符列现在在键中是允许的,并且优先于分解。table()和setkey()不再强制字符进行因子分解。因素仍然支持。执行FR#1493, FR#1224和(部分)FR#951。
Original question
最初的问题
I try to join two data.tables. However, the success of the join is dependent on the classes of the columns I use to match the data.tables. More precisely, it seems that the columns should not have the class "character". I don't quite understand the reason, but I'm sure I'm missing something obvious here. So help is really appreciated.
我尝试加入两个data.tables。但是,连接的成功依赖于我用来匹配数据的列的类。更准确地说,似乎列不应该有类“字符”。我不太明白其中的原因,但我肯定我漏掉了一些明显的东西。所以,我们非常感谢大家的帮助。
Here is an example:
这是一个例子:
#Objective: Select all rows from DT for which Region=="US", Year >= 5 & Year<=8, Cat="A"
library(data.table)
#Set-up data.table DT
DT <- data.table(Year=1:20, value=rnorm(20), Region=c(rep("US", 10), rep("EU", 10)), Cat=c(rep("A", 7), rep("B", 7), rep("C", 6)))
setkey(DT, Region, Cat, Year)
#Set-up data.table int_DT to join with DT
years <- 5:8
df <- data.frame(Region=c("US", "EU"), Categ=c("A", "B"))
int_DT <- J(cbind(df[1, ], years))
#Join them: Works like a charm!
DT[int_DT]
#Let's assume that for any reason the columns in df are of class "character"
df$Region <- as.character(df$Region)
df$Categ <- as.character(df$Categ)
#Rebuild int_DT
int_DT <- J(cbind(df[1, ], years))
DT[int_DT]
#Error in `[.data.table`(DT, int_DT) :
# unsorted column Region of i is not internally type integer.
#OK, maybe the problem is that the column classes in DT are factors, so change those:
DT[, Cat:=as.character(Cat)]
DT[, Region:=as.character(Region)]
DT[int_DT]
#Error in `[.data.table`(DT, int_DT) :
# When i is a data.table, x must be sorted to avoid a vector scan of x per row of i
Still doesn't work. Why? What is the restriction? What do I miss? Additionally information: I'm using data.table 1.6.6 and R version 2.13.2 (2011-09-30) on Platform: x86_64-pc-linux-gnu (64-bit).
仍然不工作。为什么?的限制是什么?我错过了什么?附加信息:我正在使用数据。表1.6.6和R版本2.13.2(2011-09-30)在平台上:x86_64-pc-linux-gnu(64位)。
1 个解决方案
#1
3
You don't need a join operation to get your desired results. You said: 'Objective: Select all rows from DT for which Region=="US", Year >= 5 & Year<=8, Cat="A"'
您不需要一个连接操作来获得所需的结果。您说:'Objective:从DT中选择所有行,其中区域="US",年份>= 5,年份<=8,Cat="A"
DT[Region=="US" & Year>=5 & Year <= 8 & Categ=="A"]
Year value Region Categ
[1,] 5 -0.18631697 US A
[2,] 6 1.40059083 US A
[3,] 7 0.01848557 US A
But to answer your question about column classes. I managed to get this code to work, which essentially mirrors your code above:
但是要回答关于列类的问题。我成功地让这段代码运行起来,它实际上反映了上面的代码:
> setkey(DT, Region, Categ, Year)
> df <- data.frame(Region=c("US", "EU"), Categ=c("A", "B"))
> dt2 <- data.table(data.frame(df[1, ], Year=5:8))
Warning message:
In data.frame(df[1, ], Year = 5:8) :
row names were found from a short variable and have been discarded
> dt1[dt2]
Region Categ Year value
[1,] US A 5 -0.5565422
[2,] US A 6 -0.1805841
[3,] US A 7 1.4474403
[4,] US A 8 NA
The same, with column classes of character
:
相同的,具有列类的字符:
df$Region <- as.character(df$Region)
df$Categ <- as.character(df$Categ)
#Rebuild int_DT
dt2 <- J(cbind(df[1, ], Year=5:8))
Warning message:
In data.frame(..., check.names = FALSE) :
row names were found from a short variable and have been discarded
setkey(dt2, Region)
dt1[dt2]
Region Year value Categ Categ.1 Year.1
US 1 1.20152558 A A 5
US 2 1.89391079 A A 5
US 3 -1.76022634 A A 5
US 4 0.92454680 A A 5
US 5 -0.55654217 A A 5
...
snip
...
US 9 0.67936243 B A 8
US 10 -0.09355764 B A 8
#1
3
You don't need a join operation to get your desired results. You said: 'Objective: Select all rows from DT for which Region=="US", Year >= 5 & Year<=8, Cat="A"'
您不需要一个连接操作来获得所需的结果。您说:'Objective:从DT中选择所有行,其中区域="US",年份>= 5,年份<=8,Cat="A"
DT[Region=="US" & Year>=5 & Year <= 8 & Categ=="A"]
Year value Region Categ
[1,] 5 -0.18631697 US A
[2,] 6 1.40059083 US A
[3,] 7 0.01848557 US A
But to answer your question about column classes. I managed to get this code to work, which essentially mirrors your code above:
但是要回答关于列类的问题。我成功地让这段代码运行起来,它实际上反映了上面的代码:
> setkey(DT, Region, Categ, Year)
> df <- data.frame(Region=c("US", "EU"), Categ=c("A", "B"))
> dt2 <- data.table(data.frame(df[1, ], Year=5:8))
Warning message:
In data.frame(df[1, ], Year = 5:8) :
row names were found from a short variable and have been discarded
> dt1[dt2]
Region Categ Year value
[1,] US A 5 -0.5565422
[2,] US A 6 -0.1805841
[3,] US A 7 1.4474403
[4,] US A 8 NA
The same, with column classes of character
:
相同的,具有列类的字符:
df$Region <- as.character(df$Region)
df$Categ <- as.character(df$Categ)
#Rebuild int_DT
dt2 <- J(cbind(df[1, ], Year=5:8))
Warning message:
In data.frame(..., check.names = FALSE) :
row names were found from a short variable and have been discarded
setkey(dt2, Region)
dt1[dt2]
Region Year value Categ Categ.1 Year.1
US 1 1.20152558 A A 5
US 2 1.89391079 A A 5
US 3 -1.76022634 A A 5
US 4 0.92454680 A A 5
US 5 -0.55654217 A A 5
...
snip
...
US 9 0.67936243 B A 8
US 10 -0.09355764 B A 8