计算一个条件在每个组内真实的次数。

时间:2021-10-17 23:58:31

I'm using a simulated dataset with many groups (+2mil) where I want to count the total number of observations and the number of observations above a threshold (here it is 2) for each one of my groups.

我正在使用一个模拟数据集,其中有许多组(+2mil),我想要计算每个组的观察总数和阈值以上的观察数(这里是2)。

It seems much faster when I create a flag variable, especially for dplyr and a little bit faster for data.table.

当我创建一个标志变量时,它看起来要快得多,特别是对于dplyr,对于data.table,它要快得多。

Why does this happen? How does it work in the background in each case?

这为什么会发生?在每个案例的背景下它是如何工作的?

Check my examples below.

检查我的下面的例子。

Simulated dataset

模拟数据集

# create an example dataset
set.seed(318)

N = 3000000 # number of rows

dt = data.frame(id = sample(1:5000000, N, replace = T),
                value = runif(N, 0, 10))

Using dplyr

使用dplyr

library(dplyr)

# calculate summary variables for each group
t = proc.time()
dt2 = dt %>% group_by(id) %>% summarise(N = n(),
                                        N2 = sum(value > 2))
proc.time() - t

# user  system elapsed
# 51.70    0.06   52.11


# calculate summary variables for each group after creating a flag variable
t = proc.time()
dt2 = dt %>% mutate(flag = ifelse(value > 2, 1, 0)) %>%
  group_by(id) %>% summarise(N = n(),
                             N2 = sum(flag))
proc.time() - t

# user  system elapsed
# 3.40    0.16    3.55

Using data.table

使用data.table

library(data.table)

# set as data table
dt2 = setDT(dt, key = "id")


# calculate summary variables for each group
t = proc.time()
dt3 = dt2[, .(N = .N,
              N2 = sum(value > 2)), by = id]
proc.time() - t

# user  system elapsed 
# 1.93    0.00    1.94 


# calculate summary variables for each group after creating a flag variable
t = proc.time()
dt3 = dt2[, flag := ifelse(value > 2, 1, 0)][, .(N = .N,
                                                 N2 = sum(flag)), by = id]
proc.time() - t

# user  system elapsed 
# 0.33    0.04    0.39 

1 个解决方案

#1


1  

The issue with dplyr is that the sum function is used with an expression and a high number of IDs/groups. From what Arun is saying in the comments, I guess the issue with data.table is similar.

dplyr的问题是sum函数用于表达式和大量的id /组。从阿伦在评论中所说的,我猜是数据的问题。表是相似的。

Consider the code below: I reduced it to the bare minimum that is necessary to illustrate the problem. dplyr is slow when summing an expression, even if the expression involves only the identity function, so the performance issues have nothing to do with the greater than comparison operator. In contrast, dplyr is fast when summing a vector. A even greater performance gain is achieved by reducing the number of IDs/groups from one million to ten.

考虑下面的代码:我将它简化到说明问题所需的最小值。dplyr在求和表达式时很慢,即使表达式只涉及标识函数,因此性能问题与大于比较运算符无关。与此相反,当对一个向量求和时,dplyr是快速的。通过将IDs/组的数量从100万减少到10,可以获得更大的性能收益。

The reason is that hybrid evaluation, i.e., evaluation in C++, works only if sum is used with a vector. With an expression as argument, the evaluation is done in R, which adds overhead for each group. The details are in the linked vignette. From the profile of the code, it seems that the overhead mainly comes from the tryCatch error handling function.

原因是混合评价,即。,在c++中求值,仅当sum与向量一起使用时才有效。使用表达式作为参数,计算是在R中完成的,这将为每个组增加开销。细节在链接的插图中。从代码的概要来看,开销似乎主要来自于tryCatch错误处理函数。

##########################
### many different IDs ###
##########################

df <- data.frame(id = 1:1e6, value = runif(1e6))

# sum with expression as argument
system.time(df %>% group_by(id) %>% summarise(sum(identity(value))))
#    user  system elapsed
#  80.492   0.368  83.251

# sum with vector as argument
system.time(df %>% group_by(id) %>% summarise(sum(value)))
#    user  system elapsed
#   1.264   0.004   1.279


#########################
### few different IDs ###
#########################

df$id <- rep(1:10, each = 1e5)

# sum with expression as argument
system.time(df %>% group_by(id) %>% summarise(sum(identity(value))))
#    user  system elapsed
#   0.088   0.000   0.093

# sum with vector as argument
system.time(df %>% group_by(id) %>% summarise(sum(value)))
#    user  system elapsed
#   0.072   0.004   0.077


#################
### profiling ###
#################

df <- data.frame(id = 1:1e6, value = runif(1e6))

profvis::profvis({ df %>% group_by(id) %>% summarise(sum(identity(value))) })

Code profile:

代码简介:

计算一个条件在每个组内真实的次数。

#1


1  

The issue with dplyr is that the sum function is used with an expression and a high number of IDs/groups. From what Arun is saying in the comments, I guess the issue with data.table is similar.

dplyr的问题是sum函数用于表达式和大量的id /组。从阿伦在评论中所说的,我猜是数据的问题。表是相似的。

Consider the code below: I reduced it to the bare minimum that is necessary to illustrate the problem. dplyr is slow when summing an expression, even if the expression involves only the identity function, so the performance issues have nothing to do with the greater than comparison operator. In contrast, dplyr is fast when summing a vector. A even greater performance gain is achieved by reducing the number of IDs/groups from one million to ten.

考虑下面的代码:我将它简化到说明问题所需的最小值。dplyr在求和表达式时很慢,即使表达式只涉及标识函数,因此性能问题与大于比较运算符无关。与此相反,当对一个向量求和时,dplyr是快速的。通过将IDs/组的数量从100万减少到10,可以获得更大的性能收益。

The reason is that hybrid evaluation, i.e., evaluation in C++, works only if sum is used with a vector. With an expression as argument, the evaluation is done in R, which adds overhead for each group. The details are in the linked vignette. From the profile of the code, it seems that the overhead mainly comes from the tryCatch error handling function.

原因是混合评价,即。,在c++中求值,仅当sum与向量一起使用时才有效。使用表达式作为参数,计算是在R中完成的,这将为每个组增加开销。细节在链接的插图中。从代码的概要来看,开销似乎主要来自于tryCatch错误处理函数。

##########################
### many different IDs ###
##########################

df <- data.frame(id = 1:1e6, value = runif(1e6))

# sum with expression as argument
system.time(df %>% group_by(id) %>% summarise(sum(identity(value))))
#    user  system elapsed
#  80.492   0.368  83.251

# sum with vector as argument
system.time(df %>% group_by(id) %>% summarise(sum(value)))
#    user  system elapsed
#   1.264   0.004   1.279


#########################
### few different IDs ###
#########################

df$id <- rep(1:10, each = 1e5)

# sum with expression as argument
system.time(df %>% group_by(id) %>% summarise(sum(identity(value))))
#    user  system elapsed
#   0.088   0.000   0.093

# sum with vector as argument
system.time(df %>% group_by(id) %>% summarise(sum(value)))
#    user  system elapsed
#   0.072   0.004   0.077


#################
### profiling ###
#################

df <- data.frame(id = 1:1e6, value = runif(1e6))

profvis::profvis({ df %>% group_by(id) %>% summarise(sum(identity(value))) })

Code profile:

代码简介:

计算一个条件在每个组内真实的次数。