比较两个pandas数据帧的差异

时间:2021-07-14 22:57:58

I've got a script updating 5-10 columns worth of data , but sometimes the start csv will be identical to the end csv so instead of writing an identical csvfile I want it to do nothing...

我有一个更新5-10列数据的脚本,但有时启动csv将与结束csv相同,所以不要写相同的csv文件,我希望它什么都不做......

How can I compare two dataframes to check if they're the same or not?

如何比较两个数据帧以检查它们是否相同?

csvdata = pandas.read_csv('csvfile.csv')
csvdata_old = csvdata

# ... do stuff with csvdata dataframe

if csvdata_old != csvdata:
    csvdata.to_csv('csvfile.csv', index=False)

Any ideas?

7 个解决方案

#1


35  

You also need to be careful to create a copy of the DataFrame, otherwise the csvdata_old will be updated with csvdata (since it points to the same object):

您还需要小心创建DataFrame的副本,否则csvdata_old将使用csvdata更新(因为它指向同一个对象):

csvdata_old = csvdata.copy()

To check whether they are equal, you can use assert_frame_equal as in this answer:

要检查它们是否相等,可以在此答案中使用assert_frame_equal:

from pandas.util.testing import assert_frame_equal
assert_frame_equal(csvdata, csvdata_old)

You can wrap this in a function with something like:

你可以将它包装在一个函数中,例如:

try:
    assert_frame_equal(csvdata, csvdata_old)
    return True
except:  # appeantly AssertionError doesn't catch all
    return False

There was discussion of a better way...

讨论了更好的方法......

#2


9  

Not sure if this existed at the time the question was posted, but pandas now has a built-in function to test equality between two dataframes: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.equals.html.

不确定问题发布时是否存在,但是pandas现在有一个内置函数来测试两个数据帧之间的相等性:http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame .equals.html。

#3


7  

Check using: df_1.equals(df_2) # Returns True or False, details herebelow

检查使用:df_1.equals(df_2)#返回True或False,详情请见下文

In [45]: import numpy as np

In [46]: import pandas as pd

In [47]: np.random.seed(5)

In [48]: df_1= pd.DataFrame(np.random.randn(3,3))

In [49]: df_1
Out[49]: 
          0         1         2
0  0.441227 -0.330870  2.430771
1 -0.252092  0.109610  1.582481
2 -0.909232 -0.591637  0.187603

In [50]: np.random.seed(5)

In [51]: df_2= pd.DataFrame(np.random.randn(3,3))

In [52]: df_2
Out[52]: 
          0         1         2
0  0.441227 -0.330870  2.430771
1 -0.252092  0.109610  1.582481
2 -0.909232 -0.591637  0.187603

In [53]: df_1.equals(df_2)
Out[53]: True


In [54]: df_3= pd.DataFrame(np.random.randn(3,3))

In [55]: df_3
Out[55]: 
          0         1         2
0 -0.329870 -1.192765 -0.204877
1 -0.358829  0.603472 -1.664789
2 -0.700179  1.151391  1.857331

In [56]: df_1.equals(df_3)
Out[56]: False

#4


5  

This compares the values of two dataframes note the number of row/columns needs to be the same between tables

这比较了两个数据帧的值,注意表之间的行/列数需要相同

comparison_array = table.values == expected_table.values
print (comparison_array)

>>>[[True, True, True]
    [True, False, True]]

if False in comparison_array:
    print ("Not the same")

#Return the position of the False values
np.where(comparison_array==False)

>>>(array([1]), array([1]))

You could then use this index information to return the value that does not match between tables. Since it's zero indexed, it's referring to the 2nd array in the 2nd position which is correct.

然后,您可以使用此索引信息返回表之间不匹配的值。由于它是零索引,它指的是第二个位置的第二个数组是正确的。

#5


4  

A more accurate comparison should check for index names separately, because DataFrame.equals does not test for that. All the other properties (index values (single/multiindex), values, columns, dtypes) are checked by it correctly.

更准确的比较应该分别检查索引名称,因为DataFrame.equals不会测试它。正确检查所有其他属性(索引值(单/多索引),值,列,dtypes)。

df1 = pd.DataFrame([[1, 'a'], [2, 'b'], [3, 'c']], columns=['num', 'name'])
df1 = df1.set_index('name')
df2 = pd.DataFrame([[1, 'a'], [2, 'b'], [3, 'c']], columns=['num', 'another_name'])
df2 = df2.set_index('another_name')

df1.equals(df2)
True

df1.index.names == df2.index.names
False

Note: using index.names instead of index.name makes it work for multi-indexed dataframes as well.

注意:使用index.names而不是index.name也可以使它适用于多索引数据帧。

#6


3  

Not sure if this is helpful or not, but I whipped together this quick python method for returning just the differences between two dataframes that both have the same columns and shape.

不确定这是否有用,但我将这个快速的python方法混合在一起,只返回两个具有相同列和形状的数据帧之间的差异。

def get_different_rows(source_df, new_df):
    """Returns just the rows from the new dataframe that differ from the source dataframe"""
    merged_df = source_df.merge(new_df, indicator=True, how='outer')
    changed_rows_df = merged_df[merged_df['_merge'] == 'right_only']
    return changed_rows_df.drop('_merge', axis=1)

#7


1  

In my case, I had a weird error, whereby even though the indices, column-names and values were same, the DataFrames didnt match. I tracked it down to the data-types, and it seems pandas can sometimes use different datatypes, resulting in such problems

在我的情况下,我有一个奇怪的错误,即使索引,列名和值相同,DataFrames也不匹配。我跟踪它到数据类型,似乎熊猫有时可以使用不同的数据类型,导致这样的问题

For example:

param2 = pd.DataFrame({'a': [1]}) param1 = pd.DataFrame({'a': [1], 'b': [2], 'c': [2], 'step': ['alpha']})

param2 = pd.DataFrame({'a':[1]})param1 = pd.DataFrame({'a':[1],'b':[2],'c':[2],'step' : ['α']})

if you check param1.dtypes and param2.dtypes, you will find that 'a' is of type object for param1 and is of type int64 for param2. Now, if you do some manipulation using a combination of param1 and param2, other parameters of the dataframe will deviate from the default ones.

如果你检查param1.dtypes和param2.dtypes,你会发现'a'是param1的object类型,对于param2是int64类型。现在,如果使用param1和param2的组合进行某些操作,则数据帧的其他参数将偏离默认值。

So after the final dataframe is generated, even though the actual values that are printed out are same, final_df1.equals(final_df2), may turn out to be not-equal, because those samll parameters like Axis 1, ObjectBlock, IntBlock maynot be the same.

因此,在生成最终数据帧之后,即使打印出的实际值相同,final_df1.equals(final_df2)也可能会变得不相等,因为像Axis 1,ObjectBlock,IntBlock这样的samll参数可能不是相同。

A easy way to get around this and compare the values is to use

一种简单的方法来解决这个问题并比较这些值就是使用它

final_df1==final_df2.

However, this will do a element by element comparison, so it wont work if you are using it to assert a statement for example in pytest.

但是,这将逐元素进行元素比较,因此如果您使用它来断言例如pytest中的语句,它将无法工作。

TL;DR

What works well is

效果很好的是

all(final_df1 == final_df2).

全部(final_df1 == final_df2)。

This does a element by element comparison, while neglecting the parameters not important for comparison.

这是逐元素比较,而忽略了对比较不重要的参数。

TL;DR2

If your values and indices are same, but final_df1.equals(final_df2) is showing False, you can use final_df1._data and final_df2._data to check the rest of the elements of the dataframes.

如果您的值和索引相同,但final_df1.equals(final_df2)显示为False,则可以使用final_df1._data和final_df2._data来检查数据帧的其余元素。

#1


35  

You also need to be careful to create a copy of the DataFrame, otherwise the csvdata_old will be updated with csvdata (since it points to the same object):

您还需要小心创建DataFrame的副本,否则csvdata_old将使用csvdata更新(因为它指向同一个对象):

csvdata_old = csvdata.copy()

To check whether they are equal, you can use assert_frame_equal as in this answer:

要检查它们是否相等,可以在此答案中使用assert_frame_equal:

from pandas.util.testing import assert_frame_equal
assert_frame_equal(csvdata, csvdata_old)

You can wrap this in a function with something like:

你可以将它包装在一个函数中,例如:

try:
    assert_frame_equal(csvdata, csvdata_old)
    return True
except:  # appeantly AssertionError doesn't catch all
    return False

There was discussion of a better way...

讨论了更好的方法......

#2


9  

Not sure if this existed at the time the question was posted, but pandas now has a built-in function to test equality between two dataframes: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.equals.html.

不确定问题发布时是否存在,但是pandas现在有一个内置函数来测试两个数据帧之间的相等性:http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame .equals.html。

#3


7  

Check using: df_1.equals(df_2) # Returns True or False, details herebelow

检查使用:df_1.equals(df_2)#返回True或False,详情请见下文

In [45]: import numpy as np

In [46]: import pandas as pd

In [47]: np.random.seed(5)

In [48]: df_1= pd.DataFrame(np.random.randn(3,3))

In [49]: df_1
Out[49]: 
          0         1         2
0  0.441227 -0.330870  2.430771
1 -0.252092  0.109610  1.582481
2 -0.909232 -0.591637  0.187603

In [50]: np.random.seed(5)

In [51]: df_2= pd.DataFrame(np.random.randn(3,3))

In [52]: df_2
Out[52]: 
          0         1         2
0  0.441227 -0.330870  2.430771
1 -0.252092  0.109610  1.582481
2 -0.909232 -0.591637  0.187603

In [53]: df_1.equals(df_2)
Out[53]: True


In [54]: df_3= pd.DataFrame(np.random.randn(3,3))

In [55]: df_3
Out[55]: 
          0         1         2
0 -0.329870 -1.192765 -0.204877
1 -0.358829  0.603472 -1.664789
2 -0.700179  1.151391  1.857331

In [56]: df_1.equals(df_3)
Out[56]: False

#4


5  

This compares the values of two dataframes note the number of row/columns needs to be the same between tables

这比较了两个数据帧的值,注意表之间的行/列数需要相同

comparison_array = table.values == expected_table.values
print (comparison_array)

>>>[[True, True, True]
    [True, False, True]]

if False in comparison_array:
    print ("Not the same")

#Return the position of the False values
np.where(comparison_array==False)

>>>(array([1]), array([1]))

You could then use this index information to return the value that does not match between tables. Since it's zero indexed, it's referring to the 2nd array in the 2nd position which is correct.

然后,您可以使用此索引信息返回表之间不匹配的值。由于它是零索引,它指的是第二个位置的第二个数组是正确的。

#5


4  

A more accurate comparison should check for index names separately, because DataFrame.equals does not test for that. All the other properties (index values (single/multiindex), values, columns, dtypes) are checked by it correctly.

更准确的比较应该分别检查索引名称,因为DataFrame.equals不会测试它。正确检查所有其他属性(索引值(单/多索引),值,列,dtypes)。

df1 = pd.DataFrame([[1, 'a'], [2, 'b'], [3, 'c']], columns=['num', 'name'])
df1 = df1.set_index('name')
df2 = pd.DataFrame([[1, 'a'], [2, 'b'], [3, 'c']], columns=['num', 'another_name'])
df2 = df2.set_index('another_name')

df1.equals(df2)
True

df1.index.names == df2.index.names
False

Note: using index.names instead of index.name makes it work for multi-indexed dataframes as well.

注意:使用index.names而不是index.name也可以使它适用于多索引数据帧。

#6


3  

Not sure if this is helpful or not, but I whipped together this quick python method for returning just the differences between two dataframes that both have the same columns and shape.

不确定这是否有用,但我将这个快速的python方法混合在一起,只返回两个具有相同列和形状的数据帧之间的差异。

def get_different_rows(source_df, new_df):
    """Returns just the rows from the new dataframe that differ from the source dataframe"""
    merged_df = source_df.merge(new_df, indicator=True, how='outer')
    changed_rows_df = merged_df[merged_df['_merge'] == 'right_only']
    return changed_rows_df.drop('_merge', axis=1)

#7


1  

In my case, I had a weird error, whereby even though the indices, column-names and values were same, the DataFrames didnt match. I tracked it down to the data-types, and it seems pandas can sometimes use different datatypes, resulting in such problems

在我的情况下,我有一个奇怪的错误,即使索引,列名和值相同,DataFrames也不匹配。我跟踪它到数据类型,似乎熊猫有时可以使用不同的数据类型,导致这样的问题

For example:

param2 = pd.DataFrame({'a': [1]}) param1 = pd.DataFrame({'a': [1], 'b': [2], 'c': [2], 'step': ['alpha']})

param2 = pd.DataFrame({'a':[1]})param1 = pd.DataFrame({'a':[1],'b':[2],'c':[2],'step' : ['α']})

if you check param1.dtypes and param2.dtypes, you will find that 'a' is of type object for param1 and is of type int64 for param2. Now, if you do some manipulation using a combination of param1 and param2, other parameters of the dataframe will deviate from the default ones.

如果你检查param1.dtypes和param2.dtypes,你会发现'a'是param1的object类型,对于param2是int64类型。现在,如果使用param1和param2的组合进行某些操作,则数据帧的其他参数将偏离默认值。

So after the final dataframe is generated, even though the actual values that are printed out are same, final_df1.equals(final_df2), may turn out to be not-equal, because those samll parameters like Axis 1, ObjectBlock, IntBlock maynot be the same.

因此,在生成最终数据帧之后,即使打印出的实际值相同,final_df1.equals(final_df2)也可能会变得不相等,因为像Axis 1,ObjectBlock,IntBlock这样的samll参数可能不是相同。

A easy way to get around this and compare the values is to use

一种简单的方法来解决这个问题并比较这些值就是使用它

final_df1==final_df2.

However, this will do a element by element comparison, so it wont work if you are using it to assert a statement for example in pytest.

但是,这将逐元素进行元素比较,因此如果您使用它来断言例如pytest中的语句,它将无法工作。

TL;DR

What works well is

效果很好的是

all(final_df1 == final_df2).

全部(final_df1 == final_df2)。

This does a element by element comparison, while neglecting the parameters not important for comparison.

这是逐元素比较,而忽略了对比较不重要的参数。

TL;DR2

If your values and indices are same, but final_df1.equals(final_df2) is showing False, you can use final_df1._data and final_df2._data to check the rest of the elements of the dataframes.

如果您的值和索引相同,但final_df1.equals(final_df2)显示为False,则可以使用final_df1._data和final_df2._data来检查数据帧的其余元素。