熊猫-没有对齐的两个数据aframes之间的相关性

时间:2021-07-21 22:54:31

I need to get the correlation between two dataframes columns. Both of them have the same columns but the correlation is not working because of alignment probably.

我需要得到两个dataframes列之间的相关性。它们都有相同的列,但是由于对齐可能导致相关性不工作。

I don't really care about the index of the dataframe, i just wan't to correlate the values in the cells, treating each column as a random distribution.

我并不关心dataframe的索引,我只是不希望关联单元格中的值,将每个列视为随机分布。

I'm not sure if it's my pandas or my math skills that are lacking, but i don't get what is the purpose of the alignment in this case.

我不确定这是我的熊猫还是我的数学技能,但我不知道在这种情况下,这种排列的目的是什么。

Here is my code:

这是我的代码:

def correlation(indv1, indv2):
    frame1 = pd.DataFrame(indv1).select_dtypes(include=['float64', 'int64']) # Filtra o individuo para ficar apenas com valores int ou float
    frame2 = pd.DataFrame(indv2).select_dtypes(include=['float64', 'int64'])
    result = frame1.corrwith(frame2)
    return result.sum()

Here is what i've tried:

以下是我尝试过的:

  • aligning the dataframes with DataFrame.align, but i'm not sure how to do it
  • 将DataFrame与DataFrame对齐。对齐,但是我不知道怎么做
  • reindexing the dataframes with DataFrame.reindex but it also generates NaN from alignment
  • 用DataFrame替换数据。重新索引但它也从对齐中产生NaN
  • using DataFrame.reset_index but it creates another column with the old indexes
  • 使用DataFrame。reset_index但是它使用旧的索引创建另一个列

Here is a sample that is going wrong:

这是一个出错的例子:

test1 = pd.Series(np.random.random(3), index=[0, 1, 2])
test2 = pd.Series(np.random.random(3), index=[3, 4, 5])
print(correlation(test1, test2))

If you print the result array of the correlation function, it shows NaN.

如果您打印相关函数的结果数组,它将显示NaN。

Here is what i want to do (per column):

以下是我想做的(每栏):

熊猫-没有对齐的两个数据aframes之间的相关性

with X being a value from the cell and mi and sigma being the mean and std. dev. of the column.

X是单元格的值,mi和sigma是列的均值和标准差。

1 个解决方案

#1


2  

You're neglecting the mathematical index for the summation. Those are (Xi - muX)(Yi - muY). It definitely matters how they are aligned.

你忽略了求和的数学指标。那些是(Xi - muX)(Yi - muY)。它们如何对齐当然很重要。

If you don't care to align the indices but want to correlate on their existing order and you know that the lengths are the same, try this instead:

如果你不关心对齐这些指标,但想把它们的现有顺序联系起来,你知道长度是一样的,试试这个:

def correlation(indv1, indv2):
    frame1 = pd.DataFrame(indv1).select_dtypes(include=['float64', 'int64']) # Filtra o individuo para ficar apenas com valores int ou float
    frame2 = pd.DataFrame(indv2).select_dtypes(include=['float64', 'int64'])
    # Part I changed                /--------------------\
    result = frame1.corrwith(frame2.set_index(frame1.index))
    return result.sum()

Demo

演示

np.random.seed([3, 1415])
test1 = pd.Series(np.random.random(3), index=[0, 1, 2])
test2 = pd.Series(np.random.random(3), index=[3, 4, 5])
print(correlation(test1, test2))

-0.719774418655

#1


2  

You're neglecting the mathematical index for the summation. Those are (Xi - muX)(Yi - muY). It definitely matters how they are aligned.

你忽略了求和的数学指标。那些是(Xi - muX)(Yi - muY)。它们如何对齐当然很重要。

If you don't care to align the indices but want to correlate on their existing order and you know that the lengths are the same, try this instead:

如果你不关心对齐这些指标,但想把它们的现有顺序联系起来,你知道长度是一样的,试试这个:

def correlation(indv1, indv2):
    frame1 = pd.DataFrame(indv1).select_dtypes(include=['float64', 'int64']) # Filtra o individuo para ficar apenas com valores int ou float
    frame2 = pd.DataFrame(indv2).select_dtypes(include=['float64', 'int64'])
    # Part I changed                /--------------------\
    result = frame1.corrwith(frame2.set_index(frame1.index))
    return result.sum()

Demo

演示

np.random.seed([3, 1415])
test1 = pd.Series(np.random.random(3), index=[0, 1, 2])
test2 = pd.Series(np.random.random(3), index=[3, 4, 5])
print(correlation(test1, test2))

-0.719774418655