避免使用循环来获取R中的行总和，我想在每行的不同列上开始和停止总和

I am relatively new to R from Stata. I have a data frame that has 100+ columns and thousands of rows. Each row has a start value, stop value, and 100+ columns of numerical values. The goal is to get the sum of each row from the column that corresponds to the start value to the column that corresponds to the stop value. This is direct enough to do in a loop, that looks like this (data.frame is df, start is the start column, stop is the stop column):

我对St来说比较新。我有一个包含100多列和数千行的数据框。每行都有一个起始值,停止值和100多列数值。目标是将对应于起始值的列中的每一行的总和与对应于终止值的列相加。这在循环中是直接的,看起来像这样(data.frame是df,start是起始列,stop是stop列):

for(i in 1:nrow(df)) {
    df$out[i] <- rowSums(df[i,df$start[i]:df$stop[i]])
}

This works great, but it is taking 15 minutes or so. Does anyone have any suggestions on a faster way to do this?

这很好用,但需要15分钟左右。有没有人对更快的方法有任何建议?

2 个解决方案

#1

If you are dealing with values of all the same types, you typically want to do things in matrices. Here is a solution in matrix form:

如果要处理所有相同类型的值,通常需要在矩阵中执行操作。这是矩阵形式的解决方案:

rows <- 10^3
cols <- 10^2
start <- sample(1:cols, rows, replace=T)
end <- pmin(cols, start + sample(1:(cols/2), rows, replace=T))

# first 2 cols of matrix are start and end, the rest are
# random data

mx <- matrix(c(start, end, runif(rows * cols)), nrow=rows)

# use `apply` to apply a function to each row, here the 
# function sums each row excluding the first two values
# from the value in the start column to the value in the
# end column

apply(mx, 1, function(x) sum(x[-(1:2)][x[[1]]:x[[2]]]))

# df version

df <- as.data.frame(mx)  
df$out <- apply(df, 1, function(x) sum(x[-(1:2)][x[[1]]:x[[2]]]))

You can convert your data.frame to a matrix with as.matrix. You can also run the apply directly on your data.frame as shown, which should still be reasonably fast. The real problem with your code is that your are modifying a data frame nrow times, and modifying data frames is very slow. By using apply you get around that by generating your answer (the $out column), which you can then cbind back to your data frame (and that means you modify your data frame just once).

您可以使用as.matrix将data.frame转换为矩阵。您也可以直接在data.frame上运行apply,如图所示,这应该仍然相当快。您的代码的真正问题在于您正在修改数据帧,并且修改数据帧非常慢。通过使用应用程序,您可以通过生成答案($ out列)来解决这个问题,然后您可以将其返回到数据框(这意味着您只需修改一次数据框)。

#2

You can do this using some algebra (if you have a sufficient amount of memory):

你可以使用一些代数(如果你有足够的内存)来做到这一点:

DF <- data.frame(start=3:7, end=4:8)
DF <- cbind(DF, matrix(1:50, nrow=5, ncol=10))

#  start end 1  2  3  4  5  6  7  8  9 10
#1     3   4 1  6 11 16 21 26 31 36 41 46
#2     4   5 2  7 12 17 22 27 32 37 42 47
#3     5   6 3  8 13 18 23 28 33 38 43 48
#4     6   7 4  9 14 19 24 29 34 39 44 49
#5     7   8 5 10 15 20 25 30 35 40 45 50

take <- outer(seq_len(ncol(DF)-2)+2, DF$start-1, ">") &
        outer(seq_len(ncol(DF)-2)+2, DF$end+1, "<")

diag(as.matrix(DF[,-(1:2)]) %*% take)
#[1]  7 19 31 43 55

#1