快速计算特殊相关距离矩阵

I would like to build a distance matrix using Pearson correlation distance. I first tried the scipy.spatial.distance.pdist(df,'correlation') which is very fast for my 5000 rows * 20 features dataset.

我想用皮尔逊相关距离建立一个距离矩阵。我首先尝试了scipy.spatial.distance.pdist(df，'correlation')，它对于我的5000行* 20特性数据集来说非常快。

Since I want to build a recommender, I wanted to slightly change the distance, only considering features which are distinct for NaN for both users. Indeed, scipy.spatial.distance.pdist(df,'correlation') output NaN when it meets any feature whose value is float('nan').

因为我想建立一个推荐人，所以我想稍微改变一下距离，只考虑对于NaN和NaN来说是不同的特性。的确，scipy.spatial.distance.pdist(df，'correlation')在遇到任何值为float(' NaN ')的特性时，会输出NaN。

Here is my code, df being my 5000*20 pandas DataFrame

这是我的代码，df是我的5000*20熊猫数据存储器

dist_mat = []
d = df.shape[1]
for i,row_i in enumerate(df.itertuples()):
    for j,row_j in enumerate(df.itertuples()):
        if i<j:
            print(i,j)
            ind = [False if (math.isnan(row_i[t+1]) or math.isnan(row_j[t+1])) else True for t in range(d)]
            dist_mat.append(scipy.spatial.distance.correlation([row_i[t] for t in ind],[row_j[t] for t in ind]))

This code works but it is ashtoningly slow compared to the scipy.spatial.distance.pdist(df,'correlation') one. My question is: how can I improve my code so it runs a lot faster? Or where can I find a library which calculates correlation between two vectors which only take in consideration features which appears in both of them?

这个代码可以工作，但是与scipy.spatial.distance.pdist(df，'correlation')代码相比，它的运行速度非常慢。我的问题是:如何改进代码，使其运行得更快?或者我在哪里可以找到一个库来计算两个向量之间的相关性，而这两个向量只考虑两个向量中出现的特征?

Thank you for your answers.

谢谢你的回答。

1 个解决方案

#1

I think you need to do this with Cython, here is an example:

我认为你需要用Cython来做这个，这里有一个例子:

#cython: boundscheck=False, wraparound=False, cdivision=True

import numpy as np

cdef extern from "math.h":
    bint isnan(double x)
    double sqrt(double x)

def pair_correlation(double[:, ::1] x):
    cdef double[:, ::] res = np.empty((x.shape[0], x.shape[0]))
    cdef double u, v
    cdef int i, j, k, count
    cdef double du, dv, d, n, r
    cdef double sum_u, sum_v, sum_u2, sum_v2, sum_uv

    for i in range(x.shape[0]):
        for j in range(i, x.shape[0]):
            sum_u = sum_v = sum_u2 = sum_v2 = sum_uv = 0.0
            count = 0            
            for k in range(x.shape[1]):
                u = x[i, k]
                v = x[j, k]
                if u == u and v == v:
                    sum_u += u
                    sum_v += v
                    sum_u2 += u*u
                    sum_v2 += v*v
                    sum_uv += u*v
                    count += 1
            if count == 0:
                res[i, j] = res[j, i] = -9999
                continue

            um = sum_u / count
            vm = sum_v / count
            n = sum_uv - sum_u * vm - sum_v * um + um * vm * count
            du = sqrt(sum_u2 - 2 * sum_u * um + um * um * count) 
            dv = sqrt(sum_v2 - 2 * sum_v * vm + vm * vm * count)
            r = 1 - n / (du * dv)
            res[i, j] = res[j, i] = r
    return res.base

To check the output without NAN:

不使用NAN检查输出:

import numpy as np
from scipy.spatial.distance import pdist, squareform, correlation
x = np.random.rand(2000, 20)
np.allclose(pair_correlation(x), squareform(pdist(x, "correlation")))

To check the output with NAN:

用NAN检查输出:

x = np.random.rand(2000, 20)
x[x < 0.3] = np.nan
r = pair_correlation(x)

i, j = 200, 60 # change this
mask = ~(np.isnan(x[i]) | np.isnan(x[j]))
u = x[i, mask]
v = x[j, mask]
assert abs(correlation(u, v) - r[i, j]) < 1e-12

#1