在数据子集上应用函数的最有效方法(ddply的替代方法)

时间:2021-12-19 09:51:53

I have quite substantial data set where value of different objects is reported over time. Furthermore, value itself can be measured multiple times per year. I'am just interested in taking plain average of the valuations for one object in a given year. My problem is that, due to the size of the data it takes quite a while to apply function on chosen subsets. Are there any more efficient ways to do that? I read somewhere that using data.table should speed up the process but with my toy example that is not really the case.

我有相当多的数据集,其中随着时间报告不同对象的值。此外,价值本身每年可以多次测量。我只想对给定年份中某个对象的平均估值进行评分。我的问题是,由于数据的大小,在所选子集上应用函数需要相当长的时间。有没有更有效的方法来做到这一点?我在某处读到使用data.table应该加快这个过程,但是我的玩具示例并非如此。

Toy example (+ benchmarking):

玩具示例(+基准测试):

library(data.table)
library(dplyr)

time_taken_df = c()
time_taken_dt = c()

test_data <- data.frame(id = round(runif(1000, 1,10), 0),
                        Value = round(runif(1000, 10, 50), 0),
                        Value_Year = round(runif(1000, 1999, 2010), 0))

for (i in 1:100){

  #Data Frame
  test_data <- as.data.frame(test_data)

  start_time_df <- Sys.time()

  test_data <- test_data %>%
    ddply(.(id, Value_Year), mutate, new_val = mean(Value))

  end_time_df <- Sys.time()

  #Data Table
  test_data <- as.data.table(test_data)

  start_time_dt <- Sys.time()

  test_data <- test_data %>%
    ddply(.(id, Value_Year), mutate, new_val = mean(Value))

  end_time_dt <- Sys.time()

  #Results
  time_taken_df[i] <- end_time_df - start_time_df
  time_taken_dt[i] <- end_time_dt - start_time_dt
}


mean(time_taken_df)
mean(time_taken_dt)

Any suggestions how to achieve faster performance are welcome!

欢迎任何有关如何实现更快性能的建议!

Note:

  • I reduced actual valuation date to valuation year, to increase clarity of the example.

    我将实际估值日期减少到估值年份,以提高示例的清晰度。

  • Desired output is a data.frame, since there are other characteristics that are later used in the analysis.

    期望的输出是data.frame,因为后来在分析中使用了其他特征。

1 个解决方案

#1


0  

As it was pointed out in the comments by Imo and user3293236 using data.table significantly improves the performance. Using:

正如Imo和user3293236在评论中指出的那样,使用data.table显着提高了性能。使用:

setDT(test_data)[, myAvg := mean(Value), by=.(id, Value_Year)]

or simply if test_data is already a data.table:

或者只是如果test_data已经是data.table:

data_table %>%
    [, myAvg := mean(Value), by=.(id, Value_Year)]

Comparison between dplyr and data.table

dplyr和data.table之间的比较

mean(time_taken_df)
[1] 1.357766 

mean(time_taken_dt) 
[1] 0.003700418

#1


0  

As it was pointed out in the comments by Imo and user3293236 using data.table significantly improves the performance. Using:

正如Imo和user3293236在评论中指出的那样,使用data.table显着提高了性能。使用:

setDT(test_data)[, myAvg := mean(Value), by=.(id, Value_Year)]

or simply if test_data is already a data.table:

或者只是如果test_data已经是data.table:

data_table %>%
    [, myAvg := mean(Value), by=.(id, Value_Year)]

Comparison between dplyr and data.table

dplyr和data.table之间的比较

mean(time_taken_df)
[1] 1.357766 

mean(time_taken_dt) 
[1] 0.003700418