在data.table中使用:= with paste()

时间:2021-08-09 20:10:48

I have started using data.table for a large population model. So far, I have been impressed because using the data.table structure decreases my simulation run times by about 30%. I am trying to further optimize my code and have included a simplified example. My two questions are:

我已经开始将data.table用于大型人口模型。到目前为止,我印象深刻,因为使用data.table结构会使我的模拟运行时间缩短约30%。我正在尝试进一步优化我的代码,并包含一个简化的示例。我的两个问题是:

  1. Is is possible to use the := operator with this code?
  2. 可以使用带有此代码的:=运算符吗?
  3. Would using the := operator be quicker (although, if I am able to answer my first question, I should be able to answer my question 2!)?
  4. 使用:=运算符会更快(但是,如果我能够回答我的第一个问题,我应该能够回答我的问题2!)?

I am using R version 3.1.2 on a machine running Windows 7 with data.table version 1.9.4.

我在运行带有data.table版本1.9.4的Windows 7的计算机上使用R版本3.1.2。

Here is my reproducible example:

这是我可重复的例子:

library(data.table)

## Create  example table and set initial conditions
nYears = 10
exampleTable = data.table(Site = paste("Site", 1:3))
exampleTable[ , growthRate := c(1.1, 1.2, 1.3), ]
exampleTable[ , c(paste("popYears", 0:nYears, sep = "")) := 0, ]

exampleTable[ , "popYears0" := c(10, 12, 13)] # set the initial population size

for(yearIndex in 0:(nYears - 1)){
    exampleTable[[paste("popYears", yearIndex + 1, sep = "")]] <- 
    exampleTable[[paste("popYears", yearIndex, sep = "")]] * 
    exampleTable[, growthRate]
}

I am trying to do something like:

我想做的事情如下:

for(yearIndex in 0:(nYears - 1)){
    exampleTable[ , paste("popYears", yearIndex + 1, sep = "") := 
    paste("popYears", yearIndex, sep = "") * growthRate, ] 
}

However, this does not work because the paste does not work with the data.table, for example:

但是,这不起作用,因为粘贴不适用于data.table,例如:

exampleTable[ , paste("popYears", yearIndex + 1, sep = "")]
# [1] "popYears10"

I have looked through the data.table documentation. Section 2.9 of the FAQ uses cat, but this produces a null output.

我查看了data.table文档。 FAQ的第2.9节使用cat,但这会产生空输出。

exampleTable[ , cat(paste("popYears", yearIndex + 1, sep = ""))]
# [1] popYears10NULL

Also, I tried searching Google and rseek.org, but didn't find anything. If am missing an obvious search term, I would appreciate a search tip. I have always found searching for R operators to be hard because search engines don't like symbols (e.g., ":=") and "R" can be vague.

此外,我尝试搜索谷歌和rseek.org,但没有找到任何东西。如果我错过了一个明显的搜索词,我会很感激搜索提示。我总是发现搜索R运算符很难,因为搜索引擎不喜欢符号(例如“:=”)和“R”可能很模糊。

2 个解决方案

#1


10  

## Start with 1st three columns of example data
dt <- exampleTable[,1:3,with=FALSE]

## Run for 1st five years
nYears <- 5
for(ii in seq_len(nYears)-1) {
    y0 <- as.symbol(paste0("popYears", ii))
    y1 <- paste0("popYears", ii+1)
    dt[, (y1) := eval(y0)*growthRate]
}

## Check that it worked
dt
#     Site growthRate popYears0 popYears1 popYears2 popYears3 popYears4 popYears5
#1: Site 1        1.1        10      11.0     12.10    13.310   14.6410  16.10510
#2: Site 2        1.2        12      14.4     17.28    20.736   24.8832  29.85984
#3: Site 3        1.3        13      16.9     21.97    28.561   37.1293  48.26809

Edit:

编辑:

Because the possibility of speeding this up using set() keeps coming up in the comments, I'll throw this additional option out there.

因为使用set()加速这一点的可能性不断出现在评论中,所以我会抛出这个额外的选项。

nYears <- 5

## Things that only need to be calculated once can be taken out of the loop
r <- dt[["growthRate"]]
yy <- paste0("popYears", seq_len(nYears+1)-1)

## A loop using set() and data.table's nice compact syntax
for(ii in seq_len(nYears)) {
    set(dt, , yy[ii+1], r*dt[[yy[ii]]])
}

## Check results
dt
#     Site growthRate popYears0 popYears1 popYears2 popYears3 popYears4 popYears5
#1: Site 1        1.1        10      11.0     12.10    13.310   14.6410  16.10510
#2: Site 2        1.2        12      14.4     17.28    20.736   24.8832  29.85984
#3: Site 3        1.3        13      16.9     21.97    28.561   37.1293  48.26809

#2


-1  

Struggling with column names is a strong indicator that the wide format is probably not the best choice for the given problem. Therefore, I suggest to do the computations in long form and to reshape the result from long to wide format, finally.

对列名称的挣扎是一个强有力的指标,宽格式可能不是给定问题的最佳选择。因此,我建议以长格式进行计算,并最终将结果从长格式转换为宽格式。

nYears = 10
params = data.table(Site = paste("Site", 1:3),
                    growthRate = c(1.1, 1.2, 1.3), 
                    pop = c(10, 12, 13))
long <- params[CJ(Site = Site, Year = 0:nYears), on = "Site"][
  , growth := cumprod(shift(growthRate, fill = 1)), by = Site][
    , pop := pop * growth][]
dcast(long, Site + growthRate ~ sprintf("popYears%02i", Year), value.var = "pop")
     Site growthRate popYears 0 popYears 1 popYears 2 popYears 3 popYears 4 popYears 5 popYears 6 popYears 7 popYears 8 popYears 9 popYears10
1: Site 1        1.1         10       11.0      12.10     13.310    14.6410   16.10510   17.71561   19.48717   21.43589   23.57948   25.93742
2: Site 2        1.2         12       14.4      17.28     20.736    24.8832   29.85984   35.83181   42.99817   51.59780   61.91736   74.30084
3: Site 3        1.3         13       16.9      21.97     28.561    37.1293   48.26809   62.74852   81.57307  106.04499  137.85849  179.21604

Explanation

First, the parameters are expanded to cover 11 years (including year 0) using the cross join function CJ() and a subsequent right join on Site:

首先,使用交叉连接函数CJ()和网站上的后续右连接,将参数扩展到11年(包括第0年):

params[CJ(Site = Site, Year = 0:nYears), on = "Site"]
       Site growthRate pop Year
 1: Site 1        1.1  10    0
 2: Site 1        1.1  10    1
 3: Site 1        1.1  10    2
 4: Site 1        1.1  10    3
 5: Site 1        1.1  10    4
 6: Site 1        1.1  10    5
 7: Site 1        1.1  10    6
 8: Site 1        1.1  10    7
 9: Site 1        1.1  10    8
10: Site 1        1.1  10    9
11: Site 1        1.1  10   10
12: Site 2        1.2  12    0
13: Site 2        1.2  12    1
14: Site 2        1.2  12    2
15: Site 2        1.2  12    3
16: Site 2        1.2  12    4
17: Site 2        1.2  12    5
18: Site 2        1.2  12    6
19: Site 2        1.2  12    7
20: Site 2        1.2  12    8
21: Site 2        1.2  12    9
22: Site 2        1.2  12   10
23: Site 3        1.3  13    0
24: Site 3        1.3  13    1
25: Site 3        1.3  13    2
26: Site 3        1.3  13    3
27: Site 3        1.3  13    4
28: Site 3        1.3  13    5
29: Site 3        1.3  13    6
30: Site 3        1.3  13    7
31: Site 3        1.3  13    8
32: Site 3        1.3  13    9
33: Site 3        1.3  13   10
      Site growthRate pop Year

Then the growth is computed from the shifted growth rates using the cumulative product function cumprod() separately for each Site. The shift is required to skip the initial year for each Site. Then the population is computed by multiplying with the intial population.

然后,使用累积产品函数cumprod()分别为每个站点从移位的增长率计算增长。要求每个站点跳过第一年的班次。然后通过乘以初始种群来计算种群。

Finally, the data.table is reshaped from long to wide format using dcast(). The column headers are created on-the-fly using sprintf() to ensure the correct order of columns.

最后,使用dcast()将data.table从长格式转换为宽格式。列标题是使用sprintf()即时创建的,以确保列的正确顺序。

#1


10  

## Start with 1st three columns of example data
dt <- exampleTable[,1:3,with=FALSE]

## Run for 1st five years
nYears <- 5
for(ii in seq_len(nYears)-1) {
    y0 <- as.symbol(paste0("popYears", ii))
    y1 <- paste0("popYears", ii+1)
    dt[, (y1) := eval(y0)*growthRate]
}

## Check that it worked
dt
#     Site growthRate popYears0 popYears1 popYears2 popYears3 popYears4 popYears5
#1: Site 1        1.1        10      11.0     12.10    13.310   14.6410  16.10510
#2: Site 2        1.2        12      14.4     17.28    20.736   24.8832  29.85984
#3: Site 3        1.3        13      16.9     21.97    28.561   37.1293  48.26809

Edit:

编辑:

Because the possibility of speeding this up using set() keeps coming up in the comments, I'll throw this additional option out there.

因为使用set()加速这一点的可能性不断出现在评论中,所以我会抛出这个额外的选项。

nYears <- 5

## Things that only need to be calculated once can be taken out of the loop
r <- dt[["growthRate"]]
yy <- paste0("popYears", seq_len(nYears+1)-1)

## A loop using set() and data.table's nice compact syntax
for(ii in seq_len(nYears)) {
    set(dt, , yy[ii+1], r*dt[[yy[ii]]])
}

## Check results
dt
#     Site growthRate popYears0 popYears1 popYears2 popYears3 popYears4 popYears5
#1: Site 1        1.1        10      11.0     12.10    13.310   14.6410  16.10510
#2: Site 2        1.2        12      14.4     17.28    20.736   24.8832  29.85984
#3: Site 3        1.3        13      16.9     21.97    28.561   37.1293  48.26809

#2


-1  

Struggling with column names is a strong indicator that the wide format is probably not the best choice for the given problem. Therefore, I suggest to do the computations in long form and to reshape the result from long to wide format, finally.

对列名称的挣扎是一个强有力的指标,宽格式可能不是给定问题的最佳选择。因此,我建议以长格式进行计算,并最终将结果从长格式转换为宽格式。

nYears = 10
params = data.table(Site = paste("Site", 1:3),
                    growthRate = c(1.1, 1.2, 1.3), 
                    pop = c(10, 12, 13))
long <- params[CJ(Site = Site, Year = 0:nYears), on = "Site"][
  , growth := cumprod(shift(growthRate, fill = 1)), by = Site][
    , pop := pop * growth][]
dcast(long, Site + growthRate ~ sprintf("popYears%02i", Year), value.var = "pop")
     Site growthRate popYears 0 popYears 1 popYears 2 popYears 3 popYears 4 popYears 5 popYears 6 popYears 7 popYears 8 popYears 9 popYears10
1: Site 1        1.1         10       11.0      12.10     13.310    14.6410   16.10510   17.71561   19.48717   21.43589   23.57948   25.93742
2: Site 2        1.2         12       14.4      17.28     20.736    24.8832   29.85984   35.83181   42.99817   51.59780   61.91736   74.30084
3: Site 3        1.3         13       16.9      21.97     28.561    37.1293   48.26809   62.74852   81.57307  106.04499  137.85849  179.21604

Explanation

First, the parameters are expanded to cover 11 years (including year 0) using the cross join function CJ() and a subsequent right join on Site:

首先,使用交叉连接函数CJ()和网站上的后续右连接,将参数扩展到11年(包括第0年):

params[CJ(Site = Site, Year = 0:nYears), on = "Site"]
       Site growthRate pop Year
 1: Site 1        1.1  10    0
 2: Site 1        1.1  10    1
 3: Site 1        1.1  10    2
 4: Site 1        1.1  10    3
 5: Site 1        1.1  10    4
 6: Site 1        1.1  10    5
 7: Site 1        1.1  10    6
 8: Site 1        1.1  10    7
 9: Site 1        1.1  10    8
10: Site 1        1.1  10    9
11: Site 1        1.1  10   10
12: Site 2        1.2  12    0
13: Site 2        1.2  12    1
14: Site 2        1.2  12    2
15: Site 2        1.2  12    3
16: Site 2        1.2  12    4
17: Site 2        1.2  12    5
18: Site 2        1.2  12    6
19: Site 2        1.2  12    7
20: Site 2        1.2  12    8
21: Site 2        1.2  12    9
22: Site 2        1.2  12   10
23: Site 3        1.3  13    0
24: Site 3        1.3  13    1
25: Site 3        1.3  13    2
26: Site 3        1.3  13    3
27: Site 3        1.3  13    4
28: Site 3        1.3  13    5
29: Site 3        1.3  13    6
30: Site 3        1.3  13    7
31: Site 3        1.3  13    8
32: Site 3        1.3  13    9
33: Site 3        1.3  13   10
      Site growthRate pop Year

Then the growth is computed from the shifted growth rates using the cumulative product function cumprod() separately for each Site. The shift is required to skip the initial year for each Site. Then the population is computed by multiplying with the intial population.

然后,使用累积产品函数cumprod()分别为每个站点从移位的增长率计算增长。要求每个站点跳过第一年的班次。然后通过乘以初始种群来计算种群。

Finally, the data.table is reshaped from long to wide format using dcast(). The column headers are created on-the-fly using sprintf() to ensure the correct order of columns.

最后,使用dcast()将data.table从长格式转换为宽格式。列标题是使用sprintf()即时创建的,以确保列的正确顺序。