如何最好地将data.table的一列与同一data.table的另一列连接?

时间:2022-11-08 01:13:35

My data

I have a data.table DT with the current (F0YR) and the next (F1YR) fiscal year-end (FYE) encoded as integers. Since every next FYE will eventually become a current FYE, the integer will be both in the column F1YR and F0YR. Also, my data contains monthly observations so the same FYE will be in the data set multiple times:

我有一个data.table DT,当前(F0YR)和下一个(F1YR)会计年度末(FYE)编码为整数。由于每个下一个FYE最终将成为当前FYE,因此整数将在F1YR和F0YR列中。此外,我的数据包含每月观察结果,因此相同的FYE将多次出现在数据集中:

library(data.table)
DT <- data.table(ID     = rep(c("A", "B"), each=9),
                 MONTH  = rep(100L:108L, times=2),
                 F0YR   = rep(c(1L, 4L, 7L), each=3, times=2),
                 F1YR   = rep(c(4L, 7L, 9L), each=3, times=2),
                 value  = c(rep(1:5, each=3), 6, 6, 7),
                 key    = "ID,F0YR")
DT
      ID MONTH F0YR F1YR value
 [1,]  A   100    1    4     1
 [2,]  A   101    1    4     1
 [3,]  A   102    1    4     1
 [4,]  A   103    4    7     2
 [5,]  A   104    4    7     2
 [6,]  A   105    4    7     2
 [7,]  A   106    7    9     3
 [8,]  A   107    7    9     3
 [9,]  A   108    7    9     3
[10,]  B   100    1    4     4
[11,]  B   101    1    4     4
...

What I want to do

For every ID and F1YR combination, I want to get the value for the ID and F0YR combination. As an example: Company A had a value of 2 for FOYR==4. Now, I want an additional column for all combinations with ID=="A" and F1YR==4 which is set to 2, next to the already existent value of 1.

对于每个ID和F1YR组合,我想获得ID和F0YR组合的值。例如:对于FOYR == 4,公司A的值为2。现在,我希望为ID ==“A”和F1YR == 4的所有组合添加一个附加列,该列设置为2,在已存在的值1旁边。

What I tried

intDT <- DT[CJ(unique(ID), unique(F0YR)), list(ID, F0YR, valueNew = value), mult="last"]
setkey(intDT, ID, F0YR)
setkey(DT, ID, F1YR)
DT <- intDT[DT]
setnames(DT, c("F0YR.1", "F0YR"), c("F0YR", "F1YR"))
DT
      ID F1YR valueNew MONTH F0YR value
 [1,]  A    4        2   100    1     1
 [2,]  A    4        2   101    1     1
 [3,]  A    4        2   102    1     1
 [4,]  A    7        3   103    4     2
 [5,]  A    7        3   104    4     2
 [6,]  A    7        3   105    4     2
 [7,]  A    9       NA   106    7     3
 [8,]  A    9       NA   107    7     3
 [9,]  A    9       NA   108    7     3
[10,]  B    4        5   100    1     4
[11,]  B    4        5   101    1     4
...

(Note that I use mult="last" here because, although the values should only change with F0YR or F1YR changes, sometimes they don't and this is just my tie breaker).

(注意,我在这里使用mult =“last”,因为虽然值只会随着F0YR或F1YR的变化而改变,但有时它们不会改变,这只是我的打破)。

What I want

This looks improvable. First of all, I have to make a copy of my DT. Second, since I join basically the same data.table, all the column names have the same name and I have to rename them. I thought that a self join would be the way forward, but I tried and tried and couldn't get a nice solution. I have the hope that there is something easy out there which I just don't see...Does anyone have a clue? Or is my data set up in such a way that it is actually hard (maybe because I have monthly observations, but want to join only quarterly or yearly changing values).

这看起来很容易。首先,我必须复制我的DT。其次,因为我基本上加入了相同的data.table,所有列名都有相同的名称,我必须重命名它们。我认为自我加入将是前进的方向,但我尝试并尝试过,但无法得到一个好的解决方案。我希望有一些简单的东西,我只是看不到......有没有人有线索?或者我的数据设置方式实际上很难(可能是因为我有月度观察,但只想加入季度或年度变化值)。

1 个解决方案

#1


6  

In use cases like this, the mantra "aggregate first, then join with that" often helps. So, starting with your DT, and using v1.8.1 :

在像这样的用例中,口头禅“首先聚合,然后加入”通常会有所帮助。因此,从您的DT开始,并使用v1.8.1:

> agg = DT[,last(value),by=list(ID,F0YR)]
> agg
   ID F0YR V1
1:  A    1  1
2:  A    4  2
3:  A    7  3
4:  B    1  4
5:  B    4  5
6:  B    7  7

I called it agg because I couldn't think of a better name. In this case you wanted last which isn't really an aggregate as such, but you know what I mean.

我称之为agg,因为我想不出一个更好的名字。在这种情况下,你想要的最后一个并不是真正的聚合,但你知道我的意思。

Then update DT by reference by group. Here we're grouping by i.

然后按组引用更新DT。我们在这里分组。

setkey(DT,ID,F1YR)
DT[agg,newcol:=V1]
    ID MONTH F0YR F1YR value newcol
 1:  A   100    1    4     1      2
 2:  A   101    1    4     1      2
 3:  A   102    1    4     1      2
 4:  A   103    4    7     2      3
 5:  A   104    4    7     2      3
 6:  A   105    4    7     2      3
 7:  A   106    7    9     3     NA
 8:  A   107    7    9     3     NA
 9:  A   108    7    9     3     NA
10:  B   100    1    4     4      5
11:  B   101    1    4     4      5
12:  B   102    1    4     4      5
13:  B   103    4    7     5      7
14:  B   104    4    7     5      7
15:  B   105    4    7     5      7
16:  B   106    7    9     6     NA
17:  B   107    7    9     6     NA
18:  B   108    7    9     7     NA

Is that right? Not sure I fully followed. Those ops should be very fast, without any copies, and should scale to large data. At least, that's the intention.

是对的吗?不确定我是否完全遵循。那些操作应该非常快,没有任何副本,并且应该扩展到大数据。至少,这是意图。

#1


6  

In use cases like this, the mantra "aggregate first, then join with that" often helps. So, starting with your DT, and using v1.8.1 :

在像这样的用例中,口头禅“首先聚合,然后加入”通常会有所帮助。因此,从您的DT开始,并使用v1.8.1:

> agg = DT[,last(value),by=list(ID,F0YR)]
> agg
   ID F0YR V1
1:  A    1  1
2:  A    4  2
3:  A    7  3
4:  B    1  4
5:  B    4  5
6:  B    7  7

I called it agg because I couldn't think of a better name. In this case you wanted last which isn't really an aggregate as such, but you know what I mean.

我称之为agg,因为我想不出一个更好的名字。在这种情况下,你想要的最后一个并不是真正的聚合,但你知道我的意思。

Then update DT by reference by group. Here we're grouping by i.

然后按组引用更新DT。我们在这里分组。

setkey(DT,ID,F1YR)
DT[agg,newcol:=V1]
    ID MONTH F0YR F1YR value newcol
 1:  A   100    1    4     1      2
 2:  A   101    1    4     1      2
 3:  A   102    1    4     1      2
 4:  A   103    4    7     2      3
 5:  A   104    4    7     2      3
 6:  A   105    4    7     2      3
 7:  A   106    7    9     3     NA
 8:  A   107    7    9     3     NA
 9:  A   108    7    9     3     NA
10:  B   100    1    4     4      5
11:  B   101    1    4     4      5
12:  B   102    1    4     4      5
13:  B   103    4    7     5      7
14:  B   104    4    7     5      7
15:  B   105    4    7     5      7
16:  B   106    7    9     6     NA
17:  B   107    7    9     6     NA
18:  B   108    7    9     7     NA

Is that right? Not sure I fully followed. Those ops should be very fast, without any copies, and should scale to large data. At least, that's the intention.

是对的吗?不确定我是否完全遵循。那些操作应该非常快,没有任何副本,并且应该扩展到大数据。至少,这是意图。