I have started using data.table
for a large population model. So far, I have been impressed because using the data.table structure decreases my simulation run times by about 30%. I am trying to further optimize my code and have included a simplified example. My two questions are:
我已经开始将data.table用于大型人口模型。到目前为止,我印象深刻,因为使用data.table结构会使我的模拟运行时间缩短约30%。我正在尝试进一步优化我的代码,并包含一个简化的示例。我的两个问题是:
- Is is possible to use the
:=
operator with this code? - 可以使用带有此代码的:=运算符吗?
- Would using the
:=
operator be quicker (although, if I am able to answer my first question, I should be able to answer my question 2!)? - 使用:=运算符会更快(但是,如果我能够回答我的第一个问题,我应该能够回答我的问题2!)?
I am using R version 3.1.2 on a machine running Windows 7 with data.table
version 1.9.4.
我在运行带有data.table版本1.9.4的Windows 7的计算机上使用R版本3.1.2。
Here is my reproducible example:
这是我可重复的例子:
library(data.table)
## Create example table and set initial conditions
nYears = 10
exampleTable = data.table(Site = paste("Site", 1:3))
exampleTable[ , growthRate := c(1.1, 1.2, 1.3), ]
exampleTable[ , c(paste("popYears", 0:nYears, sep = "")) := 0, ]
exampleTable[ , "popYears0" := c(10, 12, 13)] # set the initial population size
for(yearIndex in 0:(nYears - 1)){
exampleTable[[paste("popYears", yearIndex + 1, sep = "")]] <-
exampleTable[[paste("popYears", yearIndex, sep = "")]] *
exampleTable[, growthRate]
}
I am trying to do something like:
我想做的事情如下:
for(yearIndex in 0:(nYears - 1)){
exampleTable[ , paste("popYears", yearIndex + 1, sep = "") :=
paste("popYears", yearIndex, sep = "") * growthRate, ]
}
However, this does not work because the paste does not work with the data.table
, for example:
但是,这不起作用,因为粘贴不适用于data.table,例如:
exampleTable[ , paste("popYears", yearIndex + 1, sep = "")]
# [1] "popYears10"
I have looked through the data.table documentation. Section 2.9 of the FAQ uses cat
, but this produces a null output.
我查看了data.table文档。 FAQ的第2.9节使用cat,但这会产生空输出。
exampleTable[ , cat(paste("popYears", yearIndex + 1, sep = ""))]
# [1] popYears10NULL
Also, I tried searching Google and rseek.org, but didn't find anything. If am missing an obvious search term, I would appreciate a search tip. I have always found searching for R operators to be hard because search engines don't like symbols (e.g., ":=
") and "R" can be vague.
此外,我尝试搜索谷歌和rseek.org,但没有找到任何东西。如果我错过了一个明显的搜索词,我会很感激搜索提示。我总是发现搜索R运算符很难,因为搜索引擎不喜欢符号(例如“:=”)和“R”可能很模糊。
2 个解决方案
#1
10
## Start with 1st three columns of example data
dt <- exampleTable[,1:3,with=FALSE]
## Run for 1st five years
nYears <- 5
for(ii in seq_len(nYears)-1) {
y0 <- as.symbol(paste0("popYears", ii))
y1 <- paste0("popYears", ii+1)
dt[, (y1) := eval(y0)*growthRate]
}
## Check that it worked
dt
# Site growthRate popYears0 popYears1 popYears2 popYears3 popYears4 popYears5
#1: Site 1 1.1 10 11.0 12.10 13.310 14.6410 16.10510
#2: Site 2 1.2 12 14.4 17.28 20.736 24.8832 29.85984
#3: Site 3 1.3 13 16.9 21.97 28.561 37.1293 48.26809
Edit:
编辑:
Because the possibility of speeding this up using set()
keeps coming up in the comments, I'll throw this additional option out there.
因为使用set()加速这一点的可能性不断出现在评论中,所以我会抛出这个额外的选项。
nYears <- 5
## Things that only need to be calculated once can be taken out of the loop
r <- dt[["growthRate"]]
yy <- paste0("popYears", seq_len(nYears+1)-1)
## A loop using set() and data.table's nice compact syntax
for(ii in seq_len(nYears)) {
set(dt, , yy[ii+1], r*dt[[yy[ii]]])
}
## Check results
dt
# Site growthRate popYears0 popYears1 popYears2 popYears3 popYears4 popYears5
#1: Site 1 1.1 10 11.0 12.10 13.310 14.6410 16.10510
#2: Site 2 1.2 12 14.4 17.28 20.736 24.8832 29.85984
#3: Site 3 1.3 13 16.9 21.97 28.561 37.1293 48.26809
#2
-1
Struggling with column names is a strong indicator that the wide format is probably not the best choice for the given problem. Therefore, I suggest to do the computations in long form and to reshape the result from long to wide format, finally.
对列名称的挣扎是一个强有力的指标,宽格式可能不是给定问题的最佳选择。因此,我建议以长格式进行计算,并最终将结果从长格式转换为宽格式。
nYears = 10
params = data.table(Site = paste("Site", 1:3),
growthRate = c(1.1, 1.2, 1.3),
pop = c(10, 12, 13))
long <- params[CJ(Site = Site, Year = 0:nYears), on = "Site"][
, growth := cumprod(shift(growthRate, fill = 1)), by = Site][
, pop := pop * growth][]
dcast(long, Site + growthRate ~ sprintf("popYears%02i", Year), value.var = "pop")
Site growthRate popYears 0 popYears 1 popYears 2 popYears 3 popYears 4 popYears 5 popYears 6 popYears 7 popYears 8 popYears 9 popYears10 1: Site 1 1.1 10 11.0 12.10 13.310 14.6410 16.10510 17.71561 19.48717 21.43589 23.57948 25.93742 2: Site 2 1.2 12 14.4 17.28 20.736 24.8832 29.85984 35.83181 42.99817 51.59780 61.91736 74.30084 3: Site 3 1.3 13 16.9 21.97 28.561 37.1293 48.26809 62.74852 81.57307 106.04499 137.85849 179.21604
Explanation
First, the parameters are expanded to cover 11 years (including year 0) using the cross join function CJ()
and a subsequent right join on Site
:
首先,使用交叉连接函数CJ()和网站上的后续右连接,将参数扩展到11年(包括第0年):
params[CJ(Site = Site, Year = 0:nYears), on = "Site"]
Site growthRate pop Year 1: Site 1 1.1 10 0 2: Site 1 1.1 10 1 3: Site 1 1.1 10 2 4: Site 1 1.1 10 3 5: Site 1 1.1 10 4 6: Site 1 1.1 10 5 7: Site 1 1.1 10 6 8: Site 1 1.1 10 7 9: Site 1 1.1 10 8 10: Site 1 1.1 10 9 11: Site 1 1.1 10 10 12: Site 2 1.2 12 0 13: Site 2 1.2 12 1 14: Site 2 1.2 12 2 15: Site 2 1.2 12 3 16: Site 2 1.2 12 4 17: Site 2 1.2 12 5 18: Site 2 1.2 12 6 19: Site 2 1.2 12 7 20: Site 2 1.2 12 8 21: Site 2 1.2 12 9 22: Site 2 1.2 12 10 23: Site 3 1.3 13 0 24: Site 3 1.3 13 1 25: Site 3 1.3 13 2 26: Site 3 1.3 13 3 27: Site 3 1.3 13 4 28: Site 3 1.3 13 5 29: Site 3 1.3 13 6 30: Site 3 1.3 13 7 31: Site 3 1.3 13 8 32: Site 3 1.3 13 9 33: Site 3 1.3 13 10 Site growthRate pop Year
Then the growth is computed from the shifted growth rates using the cumulative product function cumprod()
separately for each Site
. The shift is required to skip the initial year for each Site
. Then the population is computed by multiplying with the intial population.
然后,使用累积产品函数cumprod()分别为每个站点从移位的增长率计算增长。要求每个站点跳过第一年的班次。然后通过乘以初始种群来计算种群。
Finally, the data.table is reshaped from long to wide format using dcast()
. The column headers are created on-the-fly using sprintf()
to ensure the correct order of columns.
最后,使用dcast()将data.table从长格式转换为宽格式。列标题是使用sprintf()即时创建的,以确保列的正确顺序。
#1
10
## Start with 1st three columns of example data
dt <- exampleTable[,1:3,with=FALSE]
## Run for 1st five years
nYears <- 5
for(ii in seq_len(nYears)-1) {
y0 <- as.symbol(paste0("popYears", ii))
y1 <- paste0("popYears", ii+1)
dt[, (y1) := eval(y0)*growthRate]
}
## Check that it worked
dt
# Site growthRate popYears0 popYears1 popYears2 popYears3 popYears4 popYears5
#1: Site 1 1.1 10 11.0 12.10 13.310 14.6410 16.10510
#2: Site 2 1.2 12 14.4 17.28 20.736 24.8832 29.85984
#3: Site 3 1.3 13 16.9 21.97 28.561 37.1293 48.26809
Edit:
编辑:
Because the possibility of speeding this up using set()
keeps coming up in the comments, I'll throw this additional option out there.
因为使用set()加速这一点的可能性不断出现在评论中,所以我会抛出这个额外的选项。
nYears <- 5
## Things that only need to be calculated once can be taken out of the loop
r <- dt[["growthRate"]]
yy <- paste0("popYears", seq_len(nYears+1)-1)
## A loop using set() and data.table's nice compact syntax
for(ii in seq_len(nYears)) {
set(dt, , yy[ii+1], r*dt[[yy[ii]]])
}
## Check results
dt
# Site growthRate popYears0 popYears1 popYears2 popYears3 popYears4 popYears5
#1: Site 1 1.1 10 11.0 12.10 13.310 14.6410 16.10510
#2: Site 2 1.2 12 14.4 17.28 20.736 24.8832 29.85984
#3: Site 3 1.3 13 16.9 21.97 28.561 37.1293 48.26809
#2
-1
Struggling with column names is a strong indicator that the wide format is probably not the best choice for the given problem. Therefore, I suggest to do the computations in long form and to reshape the result from long to wide format, finally.
对列名称的挣扎是一个强有力的指标,宽格式可能不是给定问题的最佳选择。因此,我建议以长格式进行计算,并最终将结果从长格式转换为宽格式。
nYears = 10
params = data.table(Site = paste("Site", 1:3),
growthRate = c(1.1, 1.2, 1.3),
pop = c(10, 12, 13))
long <- params[CJ(Site = Site, Year = 0:nYears), on = "Site"][
, growth := cumprod(shift(growthRate, fill = 1)), by = Site][
, pop := pop * growth][]
dcast(long, Site + growthRate ~ sprintf("popYears%02i", Year), value.var = "pop")
Site growthRate popYears 0 popYears 1 popYears 2 popYears 3 popYears 4 popYears 5 popYears 6 popYears 7 popYears 8 popYears 9 popYears10 1: Site 1 1.1 10 11.0 12.10 13.310 14.6410 16.10510 17.71561 19.48717 21.43589 23.57948 25.93742 2: Site 2 1.2 12 14.4 17.28 20.736 24.8832 29.85984 35.83181 42.99817 51.59780 61.91736 74.30084 3: Site 3 1.3 13 16.9 21.97 28.561 37.1293 48.26809 62.74852 81.57307 106.04499 137.85849 179.21604
Explanation
First, the parameters are expanded to cover 11 years (including year 0) using the cross join function CJ()
and a subsequent right join on Site
:
首先,使用交叉连接函数CJ()和网站上的后续右连接,将参数扩展到11年(包括第0年):
params[CJ(Site = Site, Year = 0:nYears), on = "Site"]
Site growthRate pop Year 1: Site 1 1.1 10 0 2: Site 1 1.1 10 1 3: Site 1 1.1 10 2 4: Site 1 1.1 10 3 5: Site 1 1.1 10 4 6: Site 1 1.1 10 5 7: Site 1 1.1 10 6 8: Site 1 1.1 10 7 9: Site 1 1.1 10 8 10: Site 1 1.1 10 9 11: Site 1 1.1 10 10 12: Site 2 1.2 12 0 13: Site 2 1.2 12 1 14: Site 2 1.2 12 2 15: Site 2 1.2 12 3 16: Site 2 1.2 12 4 17: Site 2 1.2 12 5 18: Site 2 1.2 12 6 19: Site 2 1.2 12 7 20: Site 2 1.2 12 8 21: Site 2 1.2 12 9 22: Site 2 1.2 12 10 23: Site 3 1.3 13 0 24: Site 3 1.3 13 1 25: Site 3 1.3 13 2 26: Site 3 1.3 13 3 27: Site 3 1.3 13 4 28: Site 3 1.3 13 5 29: Site 3 1.3 13 6 30: Site 3 1.3 13 7 31: Site 3 1.3 13 8 32: Site 3 1.3 13 9 33: Site 3 1.3 13 10 Site growthRate pop Year
Then the growth is computed from the shifted growth rates using the cumulative product function cumprod()
separately for each Site
. The shift is required to skip the initial year for each Site
. Then the population is computed by multiplying with the intial population.
然后,使用累积产品函数cumprod()分别为每个站点从移位的增长率计算增长。要求每个站点跳过第一年的班次。然后通过乘以初始种群来计算种群。
Finally, the data.table is reshaped from long to wide format using dcast()
. The column headers are created on-the-fly using sprintf()
to ensure the correct order of columns.
最后,使用dcast()将data.table从长格式转换为宽格式。列标题是使用sprintf()即时创建的,以确保列的正确顺序。