I need to associate the group to 20k groups which total amounts to 12M rows.
我需要将该组与20k组相关联,总计达到12M行。
To solve this problem I wrote a for loop but it is clearly totally inefficient and I am sure this task can be easily vectorized. However, I am struggling in understanding how to write this instruction in a vectorized fashion.
为了解决这个问题,我写了一个for循环,但显然效率很低,我确信这个任务可以很容易地进行矢量化。但是,我正在努力理解如何以矢量化的方式编写这个指令。
The problem is the following: I have an auxiliary_table with 3 features: ID, start_row, end_Row.
start_row is the row index of the first element in my_DF belonging to ID x;
end_row is the row index of the last element in my_DF belonging to ID x.
问题如下:我有一个具有3个功能的auxiliary_table:ID,start_row,end_Row。 start_row是属于ID x的my_DF中第一个元素的行索引; end_row是属于ID x的my_DF中最后一个元素的行索引。
The vectorized instruction should do the following:
矢量化指令应该执行以下操作:
Considering the auxiliary_table like the following:
考虑如下的auxiliary_table:
auxiliary_table <- data.frame(ID = c(1,2,3,4), start_row = c(1,4,8,13), end_row = c(3,7,12,14))
Considering a DF like the following:
考虑如下DF:
my_df <- data.frame(Var_a = c(1,2,3,1,2,3,4,6,4,3,1,2,1,1)
We need to associate the ID based on the start_row and end_row index information contained in the auxiliary_table.
我们需要根据auxiliary_table中包含的start_row和end_row索引信息关联ID。
The solution_df is:
solution_df是:
solution_df <- data.frame(my_df, ID=(1,1,1,2,2,2,2,3,3,3,3,3,4,4)
I asked for a vectorization of the for loop but I am open for example to data.table solutions.
我要求for循环的矢量化,但我打开例如data.table解决方案。
I hope I was clear and presented my question correctly.
我希望我很清楚并正确地提出我的问题。
2 个解决方案
#1
1
The auxiliary_table
is kind of run-length encoded. Therefore, I suggest to try the inverse.rle()
function with an appropriately transformed auxiliary_table
:
auxiliary_table是一种运行长度编码。因此,我建议使用适当转换的auxiliary_table来尝试inverse.rle()函数:
1. dplyr
library(dplyr)
my_df %>%
mutate(ID = auxiliary_table %>%
transmute(lengths = end_row - start_row + 1L, values = ID) %>%
inverse.rle())
Var_a ID 1 1 1 2 2 1 3 3 1 4 1 2 5 2 2 6 3 2 7 4 2 8 6 3 9 4 3 10 3 3 11 1 3 12 2 3 13 1 4 14 1 4
2. data.table
This adds the ID
column without copying my_df
.
这会添加ID列而不复制my_df。
library(data.table)
setDT(my_df)[, ID := inverse.rle(setDT(auxiliary_table)[
, .(lengths = end_row - start_row + 1L, values = ID)])][]
Depending on the size of auxiliary_table
the code below might be somewhat more efficient because it transforms auxiliary_table
in place:
根据auxiliary_table的大小,下面的代码可能会更有效,因为它会在适当的位置转换auxiliary_table:
setDT(my_df)[, ID := inverse.rle(setDT(auxiliary_table)[ , lengths := end_row - start_row + 1L][ , c("end_row", "start_row") := NULL][ , setnames(.SD, "ID", "values")])][]
#2
1
I have designed a user defined function and applying it on the auxillary_table
. See if this helps -
我设计了一个用户定义的函数并将其应用于auxillary_table。看看这是否有帮助 -
auxiliary_table <- data.frame(ID = c(1,2,3,4), start_row = c(1,4,8,13), end_row = c(3,7,12,14))
my_df <- data.frame(Var_a = c(1,2,3,1,2,3,4,6,4,3,1,2,1,1))
solution_df <- data.frame(my_df, ID=c(1,1,1,2,2,2,2,3,3,3,3,3,4,4))
aux_to_df <- function(aux_row){
# 1,2,3 can be replaced by column names
value = aux_row[1]
start_row = aux_row[2]
end_row = aux_row[3]
my_df[start_row:end_row, "ID"] <<- value # <<- means assigning to global out of scope variable
}
apply(auxiliary_table, 1, aux_to_df)
my_df
#1
1
The auxiliary_table
is kind of run-length encoded. Therefore, I suggest to try the inverse.rle()
function with an appropriately transformed auxiliary_table
:
auxiliary_table是一种运行长度编码。因此,我建议使用适当转换的auxiliary_table来尝试inverse.rle()函数:
1. dplyr
library(dplyr)
my_df %>%
mutate(ID = auxiliary_table %>%
transmute(lengths = end_row - start_row + 1L, values = ID) %>%
inverse.rle())
Var_a ID 1 1 1 2 2 1 3 3 1 4 1 2 5 2 2 6 3 2 7 4 2 8 6 3 9 4 3 10 3 3 11 1 3 12 2 3 13 1 4 14 1 4
2. data.table
This adds the ID
column without copying my_df
.
这会添加ID列而不复制my_df。
library(data.table)
setDT(my_df)[, ID := inverse.rle(setDT(auxiliary_table)[
, .(lengths = end_row - start_row + 1L, values = ID)])][]
Depending on the size of auxiliary_table
the code below might be somewhat more efficient because it transforms auxiliary_table
in place:
根据auxiliary_table的大小,下面的代码可能会更有效,因为它会在适当的位置转换auxiliary_table:
setDT(my_df)[, ID := inverse.rle(setDT(auxiliary_table)[ , lengths := end_row - start_row + 1L][ , c("end_row", "start_row") := NULL][ , setnames(.SD, "ID", "values")])][]
#2
1
I have designed a user defined function and applying it on the auxillary_table
. See if this helps -
我设计了一个用户定义的函数并将其应用于auxillary_table。看看这是否有帮助 -
auxiliary_table <- data.frame(ID = c(1,2,3,4), start_row = c(1,4,8,13), end_row = c(3,7,12,14))
my_df <- data.frame(Var_a = c(1,2,3,1,2,3,4,6,4,3,1,2,1,1))
solution_df <- data.frame(my_df, ID=c(1,1,1,2,2,2,2,3,3,3,3,3,4,4))
aux_to_df <- function(aux_row){
# 1,2,3 can be replaced by column names
value = aux_row[1]
start_row = aux_row[2]
end_row = aux_row[3]
my_df[start_row:end_row, "ID"] <<- value # <<- means assigning to global out of scope variable
}
apply(auxiliary_table, 1, aux_to_df)
my_df