将函数应用于R中的距离矩阵

This question came today in the manipulatr mailing list.

今天这个问题来自于操纵邮件列表。

http://groups.google.com/group/manipulatr/browse_thread/thread/fbab76945f7cba3f

I am rephrasing.

我在改写。

Given a distance matrix (calculated with dist) apply a function to the rows of the distance matrix.

给定距离矩阵(用dist计算)将函数应用于距离矩阵的行。

Code:

library(plyr)
N <- 100
a <- data.frame(b=1:N,c=runif(N))
d <- dist(a,diag=T,upper=T)
sumd <- adply(as.matrix(d),1,sum)

The problem is that to apply the function by row you have to store the whole matrix (instead of just the lower triangular part. So it uses too much memory for large matrices. It fails in my computer for matrices of dimensions ~ 10000.

问题是要按行应用函数,你必须存储整个矩阵(而不仅仅是下三角形部分。因此它对大型矩阵使用太多内存。在我的计算机中,对于大小为10000的矩阵,它会失败。

Any ideas?

3 个解决方案

#1

First of all, for anyone who hasn't seen this yet, I strongly recommend reading this article on the r-wiki about code optimization.

首先,对于尚未见过这一点的人,我强烈建议您阅读r-wiki上有关代码优化的文章。

Here's another version without using ifelse (that's a relatively slow function):

这是不使用ifelse的另一个版本(这是一个相对较慢的函数):

noeq.2 <- function(i, j, N) {
    i <- i-1
    j <- j-1
    x <- i*(N-1) - (i-1)*((i-1) + 1)/2 + j - i
    x2 <- j*(N-1) - (j-1)*((j-1) + 1)/2 + i - j
    idx <- i < j
    x[!idx] <- x2[!idx]
    x[i==j] <- 0
    x
}

And timings on my laptop:

和我的笔记本电脑上的时间:

> N <- 1000
> system.time(sapply(1:N, function(i) sapply(1:N, function(j) noeq(i, j, N))))
   user  system elapsed 
  51.31    0.10   52.06 
> system.time(sapply(1:N, function(j) noeq.1(1:N, j, N)))
   user  system elapsed 
   2.47    0.02    2.67 
> system.time(sapply(1:N, function(j) noeq.2(1:N, j, N)))
   user  system elapsed 
   0.88    0.01    1.12

And lapply is faster than sapply:

lapply比sapply更快:

> system.time(do.call("rbind",lapply(1:N, function(j) noeq.2(1:N, j, N))))
   user  system elapsed 
   0.67    0.00    0.67

#2

This is a vectorized version of the function noeq (either argument i or j):

这是函数noeq的矢量化版本(参数i或j):

noeq.1 <- function(i, j, N) {
    i <- i-1
    j <- j-1
    ifelse(i < j,
           i*(N-1) - ((i-1)*i)/2 + j - i,
           j*(N-1) - ((j-1)*j)/2 + i - j) * ifelse(i == j, 0, 1)
}   

> N <- 4
> sapply(1:N, function(i) sapply(1:N, function(j) noeq(i, j, N)))
     [,1] [,2] [,3] [,4]
[1,]    0    1    2    3
[2,]    1    0    4    5
[3,]    2    4    0    6
[4,]    3    5    6    0
> sapply(1:N, function(i) noeq.1(i, 1:N, N))
     [,1] [,2] [,3] [,4]
[1,]    0    1    2    3
[2,]    1    0    4    5
[3,]    2    4    0    6
[4,]    3    5    6    0

Timings are done on a 2.4 GHz Intel Core 2 Duo (Mac OS 10.6.1):

计时是在2.4 GHz Intel Core 2 Duo(Mac OS 10.6.1)上完成的:

> N <- 1000
> system.time(sapply(1:N, function(j) noeq.1(1:N, j, N)))
   user  system elapsed 
  0.676   0.061   0.738 
> system.time(sapply(1:N, function(i) sapply(1:N, function(j) noeq(i, j, N))))
   user  system elapsed 
 14.359   0.032  14.410

#3

My solution is to get the indexes of the distance vector, given a row and the size of the matrix. I got this from codeguru

我的解决方案是获得距离向量的索引,给定一行和矩阵的大小。我从codeguru得到了这个

int Trag_noeq(int row, int col, int N)
{
   //assert(row != col);    //You can add this in if you like
   if (row<col)
      return row*(N-1) - (row-1)*((row-1) + 1)/2 + col - row - 1;
   else if (col<row)
      return col*(N-1) - (col-1)*((col-1) + 1)/2 + row - col - 1;
   else
      return -1;
}

After translating to R, assuming indexes start at 1, and assuming a lower tri instead of upper tri matrix I got.
EDIT: Using the vectorized version contributed by rcs

在转换为R之后,假设索引从1开始,并且假设我得到了较低的三元而不是上三元矩阵。编辑:使用由rcs提供的矢量化版本

noeq.1 <- function(i, j, N) {
    i <- i-1
    j <- j-1
    ix <- ifelse(i < j,
                 i*(N-1) - (i-1)*((i-1) + 1)/2 + j - i,
                 j*(N-1) - (j-1)*((j-1) + 1)/2 + i - j) * ifelse(i == j, 0, 1)
    ix
}

## To get the indexes of the row, the following one liner works:

getrow <- function(z, N) noeq.1(z, 1:N, N)

## to get the row sums

getsum <- function(d, f=sum) {
    N <- attr(d, "Size")
    sapply(1:N, function(i) {
        if (i%%100==0) print(i)
        f(d[getrow(i,N)])
    })
}

So, with the example:

所以,举个例子:

sumd2 <- getsum(d)

This was much slower than as.matrix for small matrices before vectorizing. But just about 3x as slow after vectorizing. In a Intel Core2Duo 2ghz applying the sum by row of the size 10000 matrix took just over 100s. The as.matrix method fails. Thanks rcs!

这在矢量化之前比小矩阵的as.matrix慢得多。但是在矢量化之后只有大约3倍的速度。在Intel Core2Duo 2ghz中,应用10000行矩阵的总和只需要超过100秒。 as.matrix方法失败。谢谢rcs!

#1