Vectorization / data.table - 提高12kk记录DF的for循环效率

时间:2023-01-28 12:32:01

I need to associate the group to 20k groups which total amounts to 12M rows.

我需要将该组与20k组相关联,总计达到12M行。

To solve this problem I wrote a for loop but it is clearly totally inefficient and I am sure this task can be easily vectorized. However, I am struggling in understanding how to write this instruction in a vectorized fashion.

为了解决这个问题,我写了一个for循环,但显然效率很低,我确信这个任务可以很容易地进行矢量化。但是,我正在努力理解如何以矢量化的方式编写这个指令。

The problem is the following: I have an auxiliary_table with 3 features: ID, start_row, end_Row.
start_row is the row index of the first element in my_DF belonging to ID x;
end_row is the row index of the last element in my_DF belonging to ID x.

问题如下:我有一个具有3个功能的auxiliary_table:ID,start_row,end_Row。 start_row是属于ID x的my_DF中第一个元素的行索引; end_row是属于ID x的my_DF中最后一个元素的行索引。

The vectorized instruction should do the following:

矢量化指令应该执行以下操作:

Considering the auxiliary_table like the following:

考虑如下的auxiliary_table:

auxiliary_table <- data.frame(ID = c(1,2,3,4), start_row = c(1,4,8,13), end_row = c(3,7,12,14))

Considering a DF like the following:

考虑如下DF:

  my_df <- data.frame(Var_a = c(1,2,3,1,2,3,4,6,4,3,1,2,1,1)

We need to associate the ID based on the start_row and end_row index information contained in the auxiliary_table.

我们需要根据auxiliary_table中包含的start_row和end_row索引信息关联ID。

The solution_df is:

solution_df是:

solution_df <- data.frame(my_df, ID=(1,1,1,2,2,2,2,3,3,3,3,3,4,4)

I asked for a vectorization of the for loop but I am open for example to data.table solutions.

我要求for循环的矢量化,但我打开例如data.table解决方案。

I hope I was clear and presented my question correctly.

我希望我很清楚并正确地提出我的问题。

2 个解决方案

#1


1  

The auxiliary_table is kind of run-length encoded. Therefore, I suggest to try the inverse.rle() function with an appropriately transformed auxiliary_table:

auxiliary_table是一种运行长度编码。因此,我建议使用适当转换的auxiliary_table来尝试inverse.rle()函数:

1. dplyr

library(dplyr)
my_df %>%
  mutate(ID = auxiliary_table %>% 
           transmute(lengths = end_row - start_row + 1L, values = ID) %>% 
           inverse.rle())
   Var_a ID
1      1  1
2      2  1
3      3  1
4      1  2
5      2  2
6      3  2
7      4  2
8      6  3
9      4  3
10     3  3
11     1  3
12     2  3
13     1  4
14     1  4

2. data.table

This adds the ID column without copying my_df.

这会添加ID列而不复制my_df。

library(data.table)
setDT(my_df)[, ID := inverse.rle(setDT(auxiliary_table)[
  , .(lengths = end_row - start_row + 1L, values = ID)])][]

Depending on the size of auxiliary_table the code below might be somewhat more efficient because it transforms auxiliary_table in place:

根据auxiliary_table的大小,下面的代码可能会更有效,因为它会在适当的位置转换auxiliary_table:

setDT(my_df)[, ID := inverse.rle(setDT(auxiliary_table)[
  , lengths := end_row - start_row + 1L][
    , c("end_row", "start_row") := NULL][
      , setnames(.SD, "ID", "values")])][]

#2


1  

I have designed a user defined function and applying it on the auxillary_table. See if this helps -

我设计了一个用户定义的函数并将其应用于auxillary_table。看看这是否有帮助 -

auxiliary_table <- data.frame(ID = c(1,2,3,4), start_row = c(1,4,8,13), end_row = c(3,7,12,14))
my_df <- data.frame(Var_a = c(1,2,3,1,2,3,4,6,4,3,1,2,1,1))
solution_df <- data.frame(my_df, ID=c(1,1,1,2,2,2,2,3,3,3,3,3,4,4))

aux_to_df <- function(aux_row){
  # 1,2,3 can be replaced by column names
  value = aux_row[1]
  start_row = aux_row[2]
  end_row = aux_row[3]

  my_df[start_row:end_row, "ID"] <<- value # <<- means assigning to global out of scope variable
}

apply(auxiliary_table, 1, aux_to_df)
my_df

#1


1  

The auxiliary_table is kind of run-length encoded. Therefore, I suggest to try the inverse.rle() function with an appropriately transformed auxiliary_table:

auxiliary_table是一种运行长度编码。因此,我建议使用适当转换的auxiliary_table来尝试inverse.rle()函数:

1. dplyr

library(dplyr)
my_df %>%
  mutate(ID = auxiliary_table %>% 
           transmute(lengths = end_row - start_row + 1L, values = ID) %>% 
           inverse.rle())
   Var_a ID
1      1  1
2      2  1
3      3  1
4      1  2
5      2  2
6      3  2
7      4  2
8      6  3
9      4  3
10     3  3
11     1  3
12     2  3
13     1  4
14     1  4

2. data.table

This adds the ID column without copying my_df.

这会添加ID列而不复制my_df。

library(data.table)
setDT(my_df)[, ID := inverse.rle(setDT(auxiliary_table)[
  , .(lengths = end_row - start_row + 1L, values = ID)])][]

Depending on the size of auxiliary_table the code below might be somewhat more efficient because it transforms auxiliary_table in place:

根据auxiliary_table的大小,下面的代码可能会更有效,因为它会在适当的位置转换auxiliary_table:

setDT(my_df)[, ID := inverse.rle(setDT(auxiliary_table)[
  , lengths := end_row - start_row + 1L][
    , c("end_row", "start_row") := NULL][
      , setnames(.SD, "ID", "values")])][]

#2


1  

I have designed a user defined function and applying it on the auxillary_table. See if this helps -

我设计了一个用户定义的函数并将其应用于auxillary_table。看看这是否有帮助 -

auxiliary_table <- data.frame(ID = c(1,2,3,4), start_row = c(1,4,8,13), end_row = c(3,7,12,14))
my_df <- data.frame(Var_a = c(1,2,3,1,2,3,4,6,4,3,1,2,1,1))
solution_df <- data.frame(my_df, ID=c(1,1,1,2,2,2,2,3,3,3,3,3,4,4))

aux_to_df <- function(aux_row){
  # 1,2,3 can be replaced by column names
  value = aux_row[1]
  start_row = aux_row[2]
  end_row = aux_row[3]

  my_df[start_row:end_row, "ID"] <<- value # <<- means assigning to global out of scope variable
}

apply(auxiliary_table, 1, aux_to_df)
my_df