I would like to build a distance matrix using Pearson correlation distance. I first tried the scipy.spatial.distance.pdist(df,'correlation')
which is very fast for my 5000 rows * 20 features dataset.
我想用皮尔逊相关距离建立一个距离矩阵。我首先尝试了scipy.spatial.distance.pdist(df,'correlation'),它对于我的5000行* 20特性数据集来说非常快。
Since I want to build a recommender, I wanted to slightly change the distance, only considering features which are distinct for NaN for both users. Indeed, scipy.spatial.distance.pdist(df,'correlation')
output NaN when it meets any feature whose value is float('nan').
因为我想建立一个推荐人,所以我想稍微改变一下距离,只考虑对于NaN和NaN来说是不同的特性。的确,scipy.spatial.distance.pdist(df,'correlation')在遇到任何值为float(' NaN ')的特性时,会输出NaN。
Here is my code, df being my 5000*20 pandas DataFrame
这是我的代码,df是我的5000*20熊猫数据存储器
dist_mat = []
d = df.shape[1]
for i,row_i in enumerate(df.itertuples()):
for j,row_j in enumerate(df.itertuples()):
if i<j:
print(i,j)
ind = [False if (math.isnan(row_i[t+1]) or math.isnan(row_j[t+1])) else True for t in range(d)]
dist_mat.append(scipy.spatial.distance.correlation([row_i[t] for t in ind],[row_j[t] for t in ind]))
This code works but it is ashtoningly slow compared to the scipy.spatial.distance.pdist(df,'correlation')
one. My question is: how can I improve my code so it runs a lot faster? Or where can I find a library which calculates correlation between two vectors which only take in consideration features which appears in both of them?
这个代码可以工作,但是与scipy.spatial.distance.pdist(df,'correlation')代码相比,它的运行速度非常慢。我的问题是:如何改进代码,使其运行得更快?或者我在哪里可以找到一个库来计算两个向量之间的相关性,而这两个向量只考虑两个向量中出现的特征?
Thank you for your answers.
谢谢你的回答。
1 个解决方案
#1
1
I think you need to do this with Cython, here is an example:
我认为你需要用Cython来做这个,这里有一个例子:
#cython: boundscheck=False, wraparound=False, cdivision=True
import numpy as np
cdef extern from "math.h":
bint isnan(double x)
double sqrt(double x)
def pair_correlation(double[:, ::1] x):
cdef double[:, ::] res = np.empty((x.shape[0], x.shape[0]))
cdef double u, v
cdef int i, j, k, count
cdef double du, dv, d, n, r
cdef double sum_u, sum_v, sum_u2, sum_v2, sum_uv
for i in range(x.shape[0]):
for j in range(i, x.shape[0]):
sum_u = sum_v = sum_u2 = sum_v2 = sum_uv = 0.0
count = 0
for k in range(x.shape[1]):
u = x[i, k]
v = x[j, k]
if u == u and v == v:
sum_u += u
sum_v += v
sum_u2 += u*u
sum_v2 += v*v
sum_uv += u*v
count += 1
if count == 0:
res[i, j] = res[j, i] = -9999
continue
um = sum_u / count
vm = sum_v / count
n = sum_uv - sum_u * vm - sum_v * um + um * vm * count
du = sqrt(sum_u2 - 2 * sum_u * um + um * um * count)
dv = sqrt(sum_v2 - 2 * sum_v * vm + vm * vm * count)
r = 1 - n / (du * dv)
res[i, j] = res[j, i] = r
return res.base
To check the output without NAN:
不使用NAN检查输出:
import numpy as np
from scipy.spatial.distance import pdist, squareform, correlation
x = np.random.rand(2000, 20)
np.allclose(pair_correlation(x), squareform(pdist(x, "correlation")))
To check the output with NAN:
用NAN检查输出:
x = np.random.rand(2000, 20)
x[x < 0.3] = np.nan
r = pair_correlation(x)
i, j = 200, 60 # change this
mask = ~(np.isnan(x[i]) | np.isnan(x[j]))
u = x[i, mask]
v = x[j, mask]
assert abs(correlation(u, v) - r[i, j]) < 1e-12
#1
1
I think you need to do this with Cython, here is an example:
我认为你需要用Cython来做这个,这里有一个例子:
#cython: boundscheck=False, wraparound=False, cdivision=True
import numpy as np
cdef extern from "math.h":
bint isnan(double x)
double sqrt(double x)
def pair_correlation(double[:, ::1] x):
cdef double[:, ::] res = np.empty((x.shape[0], x.shape[0]))
cdef double u, v
cdef int i, j, k, count
cdef double du, dv, d, n, r
cdef double sum_u, sum_v, sum_u2, sum_v2, sum_uv
for i in range(x.shape[0]):
for j in range(i, x.shape[0]):
sum_u = sum_v = sum_u2 = sum_v2 = sum_uv = 0.0
count = 0
for k in range(x.shape[1]):
u = x[i, k]
v = x[j, k]
if u == u and v == v:
sum_u += u
sum_v += v
sum_u2 += u*u
sum_v2 += v*v
sum_uv += u*v
count += 1
if count == 0:
res[i, j] = res[j, i] = -9999
continue
um = sum_u / count
vm = sum_v / count
n = sum_uv - sum_u * vm - sum_v * um + um * vm * count
du = sqrt(sum_u2 - 2 * sum_u * um + um * um * count)
dv = sqrt(sum_v2 - 2 * sum_v * vm + vm * vm * count)
r = 1 - n / (du * dv)
res[i, j] = res[j, i] = r
return res.base
To check the output without NAN:
不使用NAN检查输出:
import numpy as np
from scipy.spatial.distance import pdist, squareform, correlation
x = np.random.rand(2000, 20)
np.allclose(pair_correlation(x), squareform(pdist(x, "correlation")))
To check the output with NAN:
用NAN检查输出:
x = np.random.rand(2000, 20)
x[x < 0.3] = np.nan
r = pair_correlation(x)
i, j = 200, 60 # change this
mask = ~(np.isnan(x[i]) | np.isnan(x[j]))
u = x[i, mask]
v = x[j, mask]
assert abs(correlation(u, v) - r[i, j]) < 1e-12