I was using this function (see bottom) to calculate both Pearson and Pval starting from two dataframes, but I am not confident with Pval results: it seems that too many negative correlations are significant.
我使用这个函数(见底)从两个dataframes开始计算Pearson和Pval,但是我对Pval结果没有信心:似乎太多的负相关性很重要。
Is there a more elegant way (like one-line-code), in order to calculate Pval along with Pearson?
是否有一种更优雅的方式(比如一行代码),以便与Pearson一起计算Pval ?
These two answers (pandas.DataFrame corrwith() method) and (correlation matrix of one dataframe with another) provided elegant solutions, but P values calculation is missing.
这两个答案(熊猫。DataFrame corrwith()方法)和(一个DataFrame与另一个DataFrame的相关矩阵)提供了优雅的解决方案,但是P值的计算丢失了。
Here is the code:
这是代码:
def pearson_cross_map(df1, df2):
"""Correlate each Mvar with each Nvar.
Parameters
----------
df1 : dataframe1
Shape Mobs X Mvar.
df2 : dataframe2
Shape Nobs X Nvar.
Returns
-------
DFcorr, dataframe Mvar x Nvar in which each element is a Pearson
correlation coefficient.
DFpval, dataframe Mvar x Nvar in which each element is a P value (one-tailed).
"""
intersection = (df1.index & df2.index).tolist()
df1 = df1.convert_objects(convert_numeric=True)
df1 = df1.T[intersection].T
df1 = df1.loc[:, (df1 != 0).any(axis=0)].sort().sort(axis=1)
df2 = df2.convert_objects(convert_numeric=True)
df2 = df2.T[intersection].T
df2 = df2.loc[:, (df2 != 0).any(axis=0)].sort().sort(axis=1)
x = df1.T.values
y = df2.T.values
mu_x = x.mean(1)
mu_y = y.mean(1)
n = x.shape[1]
s_x = x.std(1, ddof=n - 1)
s_y = y.std(1, ddof=n - 1)
cov = np.dot(x,y.T) - n * np.dot(mu_x[:, np.newaxis], mu_y[np.newaxis, :])
DFcoeff = pd.DataFrame(cov / np.dot(s_x[:, np.newaxis], s_y[np.newaxis, :]))
DFcoeff.index = df1.columns.tolist()
DFcoeff.columns = df2.columns.tolist()
n = len(intersection)
r = DFcoeff
t = r*np.sqrt((n-2)/(1-r*r))
DFpval = pd.DataFrame(stats.t.cdf(t, n-2))
DFpval.index = df1.columns.tolist()
DFpval.columns = df2.columns.tolist()
return DFcoeff, DFpval
Thank you!
谢谢你们!
2 个解决方案
#1
2
You require Pearson correlation testing and not just correlation calculation. Hence, use the scipy.stats.pearsonr method which returns the estimated Pearson coefficient and 2-tailed pvalue.
您需要的是Pearson相关测试,而不仅仅是相关计算。因此,使用scipy.stats。该方法返回估计的皮尔逊系数和2尾pvalue。
Since the method requires a series input, consider iterating through each column of both dataframes to update pre-assigned matrices. Even cast to dataframe with needed columns and index:
由于该方法需要一系列的输入,所以可以考虑遍历dataframes的每个列来更新预先分配的矩阵。甚至将需要的列和索引转换为dataframe:
import numpy as np
import pandas as pd
from scipy.stats import pearsonr
df1 = pd.DataFrame(np.random.rand(10, 5), columns=['Col1', 'Col2', 'Col3', 'Col4', 'Col5'])
df2 = pd.DataFrame(np.random.rand(10, 5), columns=['Col1', 'Col2', 'Col3', 'Col4', 'Col5'])
coeffmat = np.zeros((df1.shape[1], df2.shape[1]))
pvalmat = np.zeros((df1.shape[1], df2.shape[1]))
for i in range(df1.shape[1]):
for j in range(df2.shape[1]):
corrtest = pearsonr(df1[df1.columns[i]], df2[df2.columns[j]])
coeffmat[i,j] = corrtest[0]
pvalmat[i,j] = corrtest[1]
dfcoeff = pd.DataFrame(coeffmat, columns=df2.columns, index=df1.columns)
print(dfcoeff)
# Col1 Col2 Col3 Col4 Col5
# Col1 -0.791083 0.459101 -0.488463 -0.289265 0.494897
# Col2 0.059446 -0.395072 0.310900 0.297532 0.201669
# Col3 -0.062592 0.391469 -0.450600 -0.136554 0.299579
# Col4 -0.470203 0.797971 -0.193561 -0.338896 -0.244132
# Col5 -0.057848 -0.037053 0.042798 0.176966 -0.157344
dfpvals = pd.DataFrame(pvalmat, columns=df2.columns, index=df1.columns)
print(dfpvals)
# Col1 Col2 Col3 Col4 Col5
# Col1 0.006421 0.181967 0.152007 0.417574 0.145871
# Col2 0.870421 0.258506 0.381919 0.403770 0.576357
# Col3 0.863615 0.263268 0.191245 0.706796 0.400385
# Col4 0.170260 0.005666 0.592096 0.338101 0.496668
# Col5 0.873881 0.919058 0.906551 0.624783 0.664206
#2
0
You could compare this with bootstrap significance (i.e. if you shuffle randomly one series, what is the probability that you will get the same or greater correlation). This is not the same thing as Pearson's p-value as the latter was derived with assumption that your data is normally distributed, so you could get somewhat different result if it is not the case.
您可以将其与bootstrap的重要性进行比较(例如,如果您随机地随机选择一个系列,您将获得相同或更大相关性的概率是多少)。这和Pearson的p值是不一样的,因为后者假设你的数据是正态分布的,所以如果不是这样,你可能会得到不同的结果。
bootstrapLen = 1000
leng= 10000
X, Y= [np.random.randn(leng) for _ in [1,2]]
correlation = np.correlate(X,Y)/leng
bootstrap = [ abs(np.correlate(X,Y[np.random.permutation(leng)])/leng) for _ in range(bootstrapLen)]
bootstrap = np.sort(np.ravel(bootstrap))
significance = np.searchsorted(bootstrap, abs(correlation)) / bootstrapLen
print("correlation is {} with significance {}".format(correlation,significance))
#1
2
You require Pearson correlation testing and not just correlation calculation. Hence, use the scipy.stats.pearsonr method which returns the estimated Pearson coefficient and 2-tailed pvalue.
您需要的是Pearson相关测试,而不仅仅是相关计算。因此,使用scipy.stats。该方法返回估计的皮尔逊系数和2尾pvalue。
Since the method requires a series input, consider iterating through each column of both dataframes to update pre-assigned matrices. Even cast to dataframe with needed columns and index:
由于该方法需要一系列的输入,所以可以考虑遍历dataframes的每个列来更新预先分配的矩阵。甚至将需要的列和索引转换为dataframe:
import numpy as np
import pandas as pd
from scipy.stats import pearsonr
df1 = pd.DataFrame(np.random.rand(10, 5), columns=['Col1', 'Col2', 'Col3', 'Col4', 'Col5'])
df2 = pd.DataFrame(np.random.rand(10, 5), columns=['Col1', 'Col2', 'Col3', 'Col4', 'Col5'])
coeffmat = np.zeros((df1.shape[1], df2.shape[1]))
pvalmat = np.zeros((df1.shape[1], df2.shape[1]))
for i in range(df1.shape[1]):
for j in range(df2.shape[1]):
corrtest = pearsonr(df1[df1.columns[i]], df2[df2.columns[j]])
coeffmat[i,j] = corrtest[0]
pvalmat[i,j] = corrtest[1]
dfcoeff = pd.DataFrame(coeffmat, columns=df2.columns, index=df1.columns)
print(dfcoeff)
# Col1 Col2 Col3 Col4 Col5
# Col1 -0.791083 0.459101 -0.488463 -0.289265 0.494897
# Col2 0.059446 -0.395072 0.310900 0.297532 0.201669
# Col3 -0.062592 0.391469 -0.450600 -0.136554 0.299579
# Col4 -0.470203 0.797971 -0.193561 -0.338896 -0.244132
# Col5 -0.057848 -0.037053 0.042798 0.176966 -0.157344
dfpvals = pd.DataFrame(pvalmat, columns=df2.columns, index=df1.columns)
print(dfpvals)
# Col1 Col2 Col3 Col4 Col5
# Col1 0.006421 0.181967 0.152007 0.417574 0.145871
# Col2 0.870421 0.258506 0.381919 0.403770 0.576357
# Col3 0.863615 0.263268 0.191245 0.706796 0.400385
# Col4 0.170260 0.005666 0.592096 0.338101 0.496668
# Col5 0.873881 0.919058 0.906551 0.624783 0.664206
#2
0
You could compare this with bootstrap significance (i.e. if you shuffle randomly one series, what is the probability that you will get the same or greater correlation). This is not the same thing as Pearson's p-value as the latter was derived with assumption that your data is normally distributed, so you could get somewhat different result if it is not the case.
您可以将其与bootstrap的重要性进行比较(例如,如果您随机地随机选择一个系列,您将获得相同或更大相关性的概率是多少)。这和Pearson的p值是不一样的,因为后者假设你的数据是正态分布的,所以如果不是这样,你可能会得到不同的结果。
bootstrapLen = 1000
leng= 10000
X, Y= [np.random.randn(leng) for _ in [1,2]]
correlation = np.correlate(X,Y)/leng
bootstrap = [ abs(np.correlate(X,Y[np.random.permutation(leng)])/leng) for _ in range(bootstrapLen)]
bootstrap = np.sort(np.ravel(bootstrap))
significance = np.searchsorted(bootstrap, abs(correlation)) / bootstrapLen
print("correlation is {} with significance {}".format(correlation,significance))