I need to get the correlation between two dataframes columns. Both of them have the same columns but the correlation is not working because of alignment probably.
我需要得到两个dataframes列之间的相关性。它们都有相同的列,但是由于对齐可能导致相关性不工作。
I don't really care about the index of the dataframe, i just wan't to correlate the values in the cells, treating each column as a random distribution.
我并不关心dataframe的索引,我只是不希望关联单元格中的值,将每个列视为随机分布。
I'm not sure if it's my pandas or my math skills that are lacking, but i don't get what is the purpose of the alignment in this case.
我不确定这是我的熊猫还是我的数学技能,但我不知道在这种情况下,这种排列的目的是什么。
Here is my code:
这是我的代码:
def correlation(indv1, indv2):
frame1 = pd.DataFrame(indv1).select_dtypes(include=['float64', 'int64']) # Filtra o individuo para ficar apenas com valores int ou float
frame2 = pd.DataFrame(indv2).select_dtypes(include=['float64', 'int64'])
result = frame1.corrwith(frame2)
return result.sum()
Here is what i've tried:
以下是我尝试过的:
- aligning the dataframes with
DataFrame.align
, but i'm not sure how to do it - 将DataFrame与DataFrame对齐。对齐,但是我不知道怎么做
- reindexing the dataframes with
DataFrame.reindex
but it also generates NaN from alignment - 用DataFrame替换数据。重新索引但它也从对齐中产生NaN
- using
DataFrame.reset_index
but it creates another column with the old indexes - 使用DataFrame。reset_index但是它使用旧的索引创建另一个列
Here is a sample that is going wrong:
这是一个出错的例子:
test1 = pd.Series(np.random.random(3), index=[0, 1, 2])
test2 = pd.Series(np.random.random(3), index=[3, 4, 5])
print(correlation(test1, test2))
If you print the result array of the correlation function, it shows NaN.
如果您打印相关函数的结果数组,它将显示NaN。
Here is what i want to do (per column):
以下是我想做的(每栏):
with X being a value from the cell and mi and sigma being the mean and std. dev. of the column.
X是单元格的值,mi和sigma是列的均值和标准差。
1 个解决方案
#1
2
You're neglecting the mathematical index for the summation. Those are (Xi - muX)(Yi - muY)
. It definitely matters how they are aligned.
你忽略了求和的数学指标。那些是(Xi - muX)(Yi - muY)。它们如何对齐当然很重要。
If you don't care to align the indices but want to correlate on their existing order and you know that the lengths are the same, try this instead:
如果你不关心对齐这些指标,但想把它们的现有顺序联系起来,你知道长度是一样的,试试这个:
def correlation(indv1, indv2):
frame1 = pd.DataFrame(indv1).select_dtypes(include=['float64', 'int64']) # Filtra o individuo para ficar apenas com valores int ou float
frame2 = pd.DataFrame(indv2).select_dtypes(include=['float64', 'int64'])
# Part I changed /--------------------\
result = frame1.corrwith(frame2.set_index(frame1.index))
return result.sum()
Demo
演示
np.random.seed([3, 1415])
test1 = pd.Series(np.random.random(3), index=[0, 1, 2])
test2 = pd.Series(np.random.random(3), index=[3, 4, 5])
print(correlation(test1, test2))
-0.719774418655
#1
2
You're neglecting the mathematical index for the summation. Those are (Xi - muX)(Yi - muY)
. It definitely matters how they are aligned.
你忽略了求和的数学指标。那些是(Xi - muX)(Yi - muY)。它们如何对齐当然很重要。
If you don't care to align the indices but want to correlate on their existing order and you know that the lengths are the same, try this instead:
如果你不关心对齐这些指标,但想把它们的现有顺序联系起来,你知道长度是一样的,试试这个:
def correlation(indv1, indv2):
frame1 = pd.DataFrame(indv1).select_dtypes(include=['float64', 'int64']) # Filtra o individuo para ficar apenas com valores int ou float
frame2 = pd.DataFrame(indv2).select_dtypes(include=['float64', 'int64'])
# Part I changed /--------------------\
result = frame1.corrwith(frame2.set_index(frame1.index))
return result.sum()
Demo
演示
np.random.seed([3, 1415])
test1 = pd.Series(np.random.random(3), index=[0, 1, 2])
test2 = pd.Series(np.random.random(3), index=[3, 4, 5])
print(correlation(test1, test2))
-0.719774418655