table with data (its a data.table object) that looks like the following :
带数据的表(它是一个数据。),如下所示:
date stock_id logret
1: 2011-01-01 1 0.001
2: 2011-01-02 1 0.003
3: 2011-01-03 1 0.005
4: 2011-01-04 1 0.007
5: 2011-01-05 1 0.009
6: 2011-01-06 1 0.011
7: 2011-01-01 2 0.013
8: 2011-01-02 2 0.015
9: 2011-01-03 2 0.017
10: 2011-01-04 2 0.019
11: 2011-01-05 2 0.021
12: 2011-01-06 2 0.023
13: 2011-01-01 3 0.025
14: 2011-01-02 3 0.027
15: 2011-01-03 3 0.029
16: 2011-01-04 3 0.031
17: 2011-01-05 3 0.033
18: 2011-01-06 3 0.035
The above can be created as :
可以将上述内容创建为:
DT = data.table(
date=rep(as.Date('2011-01-01')+0:5,3) ,
stock_id=c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3),
logret=seq(0.001, by=0.002, len=18));
setkeyv(DT,c('stock_id','date'))
Of course the real table is larger with many more stock_ids and dates. The aim to to reshape this data table such that I can run a regression of all stockid log_returns with their corresponding log_returns with a lag of 1 day (or prior traded day in case of weekends).
当然,实际的表更大,包含更多的stock_id和日期。目的是重塑这个数据表,以便我可以运行一个对所有stockid log_returns的回归,其相应的log_returns的滞后时间为1天(如果是周末的话,则是提前交易的一天)。
The final results would look like :
最终结果如下:
date stock_id logret lagret
1: 2011-01-01 1 0.001 NA
2: 2011-01-02 1 0.003 0.001
3: 2011-01-03 1 0.005 0.003
....
16: 2011-01-04 3 0.031 0.029
17: 2011-01-05 3 0.033 0.031
18: 2011-01-06 3 0.035 0.033
I'm finding this data structure really tricky to build without mixing up my stockid.
我发现这种数据结构在构建时非常棘手,而且不会混淆我的stockid。
3 个解决方案
#1
21
Just some additional notes due to Alex's comment. The reason you have difficulties understanding what's going on here is that a lot of things are done within one line. So it's always a good idea to break things down.
由于艾利克斯的评论,我只做了一些补充说明。你很难理解这里发生了什么,因为很多事情都是在一行内完成的。所以把事情分解是一个好主意。
What do we actually want? We want a new column lagret
and the syntax to add a new column in data.table is the following:
我们到底想要什么?我们需要一个新的列lagret和在数据中添加新列的语法。表如下:
DT[, lagret := xxx]
where xxx
has to be filled up with whatever you want to have in column lagret
. So if we just want a new column that gives us the rows, we could just call
xxx必须填满你想要的东西在列lagret。如果我们想要一个新的列给我们行,我们可以调用
DT[, lagret := seq(from=1, to=nrow(DT))]
Here, we actually want the lagged value of logret
, but we have to consider that there are many stocks in here. That's why we do a self-join, i.e. we join the data.table DT
with itself by the columns stock_id
and date
, but since we want the previous value of each stock, we use date-1
. Note that we have to set the keys first to do such a join:
这里,我们需要的是logret的滞后值,但我们需要考虑这里有很多股票。这就是为什么我们进行自连接,也就是我们加入数据。表DT由stock_id和date列组成,但是由于我们想要每个股票的前一个值,所以我们使用date-1。请注意,我们必须先设置键来完成这样的连接:
setkeyv(DT,c('stock_id','date'))
DT[list(stock_id,date-1)]
stock_id date logret
1: 1 2010-12-31 NA
2: 1 2011-01-01 0.001
3: 1 2011-01-02 0.003
4: 1 2011-01-03 0.005
5: 1 2011-01-04 0.007
6: 1 2011-01-05 0.009
...
As you can see, we now have what we want. logret
is now lagged by one period. But we actually want that in a new column lagret
in DT
, so we just get that column by calling [[3L]] (this means nothing else then get me the third column) and name this new column lagret
:
正如你所看到的,我们现在有了我们想要的。logret现在落后了一个时期。但是我们实际上想要在一个新的列lagret in DT中,所以我们只要调用[[[3L]]就能得到这个列(这意味着什么都没有,然后我得到第三列)并命名这个新的列lagret:
DT[,lagret:=DT[list(stock_id,date-1),logret][[3L]]]
date stock_id logret lagret
1: 2011-01-01 1 0.001 NA
2: 2011-01-02 1 0.003 0.001
3: 2011-01-03 1 0.005 0.003
4: 2011-01-04 1 0.007 0.005
5: 2011-01-05 1 0.009 0.007
...
This is already the correct solution. In this simple case, we do not need roll=TRUE
because there are no gaps in the dates. However, in a more realistic example (as mentioned above, for instance when we have weekends), there might be gaps. So let's make such a realistic example by just deleting two days in the DT
for the first stock:
这已经是正确的解决方案了。在这个简单的例子中,我们不需要roll=TRUE,因为日期中没有间隔。然而,在一个更现实的示例中(如上面提到的,例如当我们有周末时),可能会出现空白。我们来做一个现实的例子,在DT中删除第一支股票的两天:
DT <- DT[-c(4, 5)]
setkeyv(DT,c('stock_id','date'))
DT[,lagret:=DT[list(stock_id,date-1),logret][[3L]]]
date stock_id logret lagret
1: 2011-01-01 1 0.001 NA
2: 2011-01-02 1 0.003 0.001
3: 2011-01-03 1 0.005 0.003
4: 2011-01-06 1 0.011 NA
5: 2011-01-01 2 0.013 NA
...
As you can see, the problem is now that we don't have a value for the 6th of January. That's why we use roll=TRUE
:
正如你所看到的,问题是现在我们没有1月6日的价值。这就是为什么我们使用roll=TRUE:
DT[,lagret:=DT[list(stock_id,date-1),logret,roll=TRUE][[3L]]]
date stock_id logret lagret
1: 2011-01-01 1 0.001 NA
2: 2011-01-02 1 0.003 0.001
3: 2011-01-03 1 0.005 0.003
4: 2011-01-06 1 0.011 0.005
5: 2011-01-01 2 0.013 NA
...
Just have a look on the documentation on how roll=TRUE
works exactly. In a nutshell: If it can't find the previous value (here logret
for the 5th of January), it just takes the last available one (here from the 3rd of January).
看看关于roll=TRUE如何准确工作的文档。简单地说:如果它找不到以前的值(这里是1月5日的logret),它只需要最后一个可用的值(从1月3日开始)。
#2
4
Update:
In the current development version of data.table, v1.9.5, shift()
is implemented #965, which takes two types at the moment type = "lag"
(default) and type = "lead"
. See ?shift
for more on usage.
在当前开发版本的数据中。表v1.9.5、shift()实现了#965,其中包含两种类型:type = "lag"(默认)和type = "lead"。看到了吗?
With this, we can simply do:
有了这个,我们可以做的很简单:
# type="lag" may be omitted, as it is the default.
require(data.table) ## 1.9.5+
DT[, lagret := shift(logret, 1L, type="lag"), by=stock_id]
# date stock_id logret lagret
# 1: 2011-01-01 1 0.001 NA
# 2: 2011-01-02 1 0.003 0.001
# 3: 2011-01-03 1 0.005 0.003
# 4: 2011-01-04 1 0.007 0.005
# 5: 2011-01-05 1 0.009 0.007
# 6: 2011-01-06 1 0.011 0.009
# 7: 2011-01-01 2 0.013 NA
# 8: 2011-01-02 2 0.015 0.013
# 9: 2011-01-03 2 0.017 0.015
# 10: 2011-01-04 2 0.019 0.017
# 11: 2011-01-05 2 0.021 0.019
# 12: 2011-01-06 2 0.023 0.021
# 13: 2011-01-01 3 0.025 NA
# 14: 2011-01-02 3 0.027 0.025
# 15: 2011-01-03 3 0.029 0.027
# 16: 2011-01-04 3 0.031 0.029
# 17: 2011-01-05 3 0.033 0.031
# 18: 2011-01-06 3 0.035 0.033
#3
2
Thanks to Matthew Dowle's advice, I was able to use the following :
多亏了马修·道尔的建议,我才得以使用以下内容:
DT[,lagret:=DT[list(stock_id,date-1),logret,roll=TRUE][[3L]]]
Results are :
结果是:
date stock_id logret lagret
1: 2011-01-01 1 0.001 NA
2: 2011-01-02 1 0.003 0.001
3: 2011-01-03 1 0.005 0.003
4: 2011-01-04 1 0.007 0.005
5: 2011-01-05 1 0.009 0.007
6: 2011-01-06 1 0.011 0.009
7: 2011-01-01 2 0.013 NA
8: 2011-01-02 2 0.015 0.013
9: 2011-01-03 2 0.017 0.015
10: 2011-01-04 2 0.019 0.017
11: 2011-01-05 2 0.021 0.019
12: 2011-01-06 2 0.023 0.021
13: 2011-01-01 3 0.025 NA
14: 2011-01-02 3 0.027 0.025
15: 2011-01-03 3 0.029 0.027
16: 2011-01-04 3 0.031 0.029
17: 2011-01-05 3 0.033 0.031
18: 2011-01-06 3 0.035 0.033
#1
21
Just some additional notes due to Alex's comment. The reason you have difficulties understanding what's going on here is that a lot of things are done within one line. So it's always a good idea to break things down.
由于艾利克斯的评论,我只做了一些补充说明。你很难理解这里发生了什么,因为很多事情都是在一行内完成的。所以把事情分解是一个好主意。
What do we actually want? We want a new column lagret
and the syntax to add a new column in data.table is the following:
我们到底想要什么?我们需要一个新的列lagret和在数据中添加新列的语法。表如下:
DT[, lagret := xxx]
where xxx
has to be filled up with whatever you want to have in column lagret
. So if we just want a new column that gives us the rows, we could just call
xxx必须填满你想要的东西在列lagret。如果我们想要一个新的列给我们行,我们可以调用
DT[, lagret := seq(from=1, to=nrow(DT))]
Here, we actually want the lagged value of logret
, but we have to consider that there are many stocks in here. That's why we do a self-join, i.e. we join the data.table DT
with itself by the columns stock_id
and date
, but since we want the previous value of each stock, we use date-1
. Note that we have to set the keys first to do such a join:
这里,我们需要的是logret的滞后值,但我们需要考虑这里有很多股票。这就是为什么我们进行自连接,也就是我们加入数据。表DT由stock_id和date列组成,但是由于我们想要每个股票的前一个值,所以我们使用date-1。请注意,我们必须先设置键来完成这样的连接:
setkeyv(DT,c('stock_id','date'))
DT[list(stock_id,date-1)]
stock_id date logret
1: 1 2010-12-31 NA
2: 1 2011-01-01 0.001
3: 1 2011-01-02 0.003
4: 1 2011-01-03 0.005
5: 1 2011-01-04 0.007
6: 1 2011-01-05 0.009
...
As you can see, we now have what we want. logret
is now lagged by one period. But we actually want that in a new column lagret
in DT
, so we just get that column by calling [[3L]] (this means nothing else then get me the third column) and name this new column lagret
:
正如你所看到的,我们现在有了我们想要的。logret现在落后了一个时期。但是我们实际上想要在一个新的列lagret in DT中,所以我们只要调用[[[3L]]就能得到这个列(这意味着什么都没有,然后我得到第三列)并命名这个新的列lagret:
DT[,lagret:=DT[list(stock_id,date-1),logret][[3L]]]
date stock_id logret lagret
1: 2011-01-01 1 0.001 NA
2: 2011-01-02 1 0.003 0.001
3: 2011-01-03 1 0.005 0.003
4: 2011-01-04 1 0.007 0.005
5: 2011-01-05 1 0.009 0.007
...
This is already the correct solution. In this simple case, we do not need roll=TRUE
because there are no gaps in the dates. However, in a more realistic example (as mentioned above, for instance when we have weekends), there might be gaps. So let's make such a realistic example by just deleting two days in the DT
for the first stock:
这已经是正确的解决方案了。在这个简单的例子中,我们不需要roll=TRUE,因为日期中没有间隔。然而,在一个更现实的示例中(如上面提到的,例如当我们有周末时),可能会出现空白。我们来做一个现实的例子,在DT中删除第一支股票的两天:
DT <- DT[-c(4, 5)]
setkeyv(DT,c('stock_id','date'))
DT[,lagret:=DT[list(stock_id,date-1),logret][[3L]]]
date stock_id logret lagret
1: 2011-01-01 1 0.001 NA
2: 2011-01-02 1 0.003 0.001
3: 2011-01-03 1 0.005 0.003
4: 2011-01-06 1 0.011 NA
5: 2011-01-01 2 0.013 NA
...
As you can see, the problem is now that we don't have a value for the 6th of January. That's why we use roll=TRUE
:
正如你所看到的,问题是现在我们没有1月6日的价值。这就是为什么我们使用roll=TRUE:
DT[,lagret:=DT[list(stock_id,date-1),logret,roll=TRUE][[3L]]]
date stock_id logret lagret
1: 2011-01-01 1 0.001 NA
2: 2011-01-02 1 0.003 0.001
3: 2011-01-03 1 0.005 0.003
4: 2011-01-06 1 0.011 0.005
5: 2011-01-01 2 0.013 NA
...
Just have a look on the documentation on how roll=TRUE
works exactly. In a nutshell: If it can't find the previous value (here logret
for the 5th of January), it just takes the last available one (here from the 3rd of January).
看看关于roll=TRUE如何准确工作的文档。简单地说:如果它找不到以前的值(这里是1月5日的logret),它只需要最后一个可用的值(从1月3日开始)。
#2
4
Update:
In the current development version of data.table, v1.9.5, shift()
is implemented #965, which takes two types at the moment type = "lag"
(default) and type = "lead"
. See ?shift
for more on usage.
在当前开发版本的数据中。表v1.9.5、shift()实现了#965,其中包含两种类型:type = "lag"(默认)和type = "lead"。看到了吗?
With this, we can simply do:
有了这个,我们可以做的很简单:
# type="lag" may be omitted, as it is the default.
require(data.table) ## 1.9.5+
DT[, lagret := shift(logret, 1L, type="lag"), by=stock_id]
# date stock_id logret lagret
# 1: 2011-01-01 1 0.001 NA
# 2: 2011-01-02 1 0.003 0.001
# 3: 2011-01-03 1 0.005 0.003
# 4: 2011-01-04 1 0.007 0.005
# 5: 2011-01-05 1 0.009 0.007
# 6: 2011-01-06 1 0.011 0.009
# 7: 2011-01-01 2 0.013 NA
# 8: 2011-01-02 2 0.015 0.013
# 9: 2011-01-03 2 0.017 0.015
# 10: 2011-01-04 2 0.019 0.017
# 11: 2011-01-05 2 0.021 0.019
# 12: 2011-01-06 2 0.023 0.021
# 13: 2011-01-01 3 0.025 NA
# 14: 2011-01-02 3 0.027 0.025
# 15: 2011-01-03 3 0.029 0.027
# 16: 2011-01-04 3 0.031 0.029
# 17: 2011-01-05 3 0.033 0.031
# 18: 2011-01-06 3 0.035 0.033
#3
2
Thanks to Matthew Dowle's advice, I was able to use the following :
多亏了马修·道尔的建议,我才得以使用以下内容:
DT[,lagret:=DT[list(stock_id,date-1),logret,roll=TRUE][[3L]]]
Results are :
结果是:
date stock_id logret lagret
1: 2011-01-01 1 0.001 NA
2: 2011-01-02 1 0.003 0.001
3: 2011-01-03 1 0.005 0.003
4: 2011-01-04 1 0.007 0.005
5: 2011-01-05 1 0.009 0.007
6: 2011-01-06 1 0.011 0.009
7: 2011-01-01 2 0.013 NA
8: 2011-01-02 2 0.015 0.013
9: 2011-01-03 2 0.017 0.015
10: 2011-01-04 2 0.019 0.017
11: 2011-01-05 2 0.021 0.019
12: 2011-01-06 2 0.023 0.021
13: 2011-01-01 3 0.025 NA
14: 2011-01-02 3 0.027 0.025
15: 2011-01-03 3 0.029 0.027
16: 2011-01-04 3 0.031 0.029
17: 2011-01-05 3 0.033 0.031
18: 2011-01-06 3 0.035 0.033