Pandas DataFrames与NaNs相等比较

In the context of unit testing some functions, I'm trying to establish the equality of 2 DataFrames using python pandas:

在单元测试一些函数的上下文中，我试图使用python pandas建立2个DataFrame的相等性：

ipdb> expect
                            1   2
2012-01-01 00:00:00+00:00 NaN   3
2013-05-14 12:00:00+00:00   3 NaN

ipdb> df
identifier                  1   2
timestamp
2012-01-01 00:00:00+00:00 NaN   3
2013-05-14 12:00:00+00:00   3 NaN

ipdb> df[1][0]
nan

ipdb> df[1][0], expect[1][0]
(nan, nan)

ipdb> df[1][0] == expect[1][0]
False

ipdb> df[1][1] == expect[1][1]
True

ipdb> type(df[1][0])
<type 'numpy.float64'>

ipdb> type(expect[1][0])
<type 'numpy.float64'>

ipdb> (list(df[1]), list(expect[1]))
([nan, 3.0], [nan, 3.0])

ipdb> df1, df2 = (list(df[1]), list(expect[1])) ;; df1 == df2
False

Given that I'm trying to test the entire of expect against the entire of df, including NaN positions, what am I doing wrong?

鉴于我正试图测试整个df的整体预期，包括NaN的位置，我做错了什么？

What is the simplest way to compare equality of Series/DataFrames including NaNs?

比较包含NaN的Series / DataFrame的相等性的最简单方法是什么？

5 个解决方案

#1

You can use assert_frame_equals with check_names=False (so as not to check the index/columns names), which will raise if they are not equal:

您可以将assert_frame_equals与check_names = False一起使用（以便不检查索引/列名称），如果它们不相等则会引发：

In [11]: from pandas.util.testing import assert_frame_equal

In [12]: assert_frame_equal(df, expected, check_names=False)

You can wrap this in a function with something like:

你可以将它包装在一个函数中，例如：

try:
    assert_frame_equal(df, expected, check_names=False)
    return True
except AssertionError:
    return False

In more recent pandas this functionality has been added as .equals:

在最近的大熊猫中，此功能已添加为.equals：

df.equals(expected)

#2

One of the properties of NaN is that NaN != NaN is True.

NaN的一个特性是NaN！= NaN是真的。

Check out this answer for a nice way to do this using numexpr.

看看这个答案，找到一个使用numexpr做到这一点的好方法。

(a == b) | ((a != a) & (b != b))

says this (in pseudocode):

说这个（伪代码）：

a == b or (isnan(a) and isnan(b))

So, either a equals b, or both a and b are NaN.

因此，a等于b，或者a和b都是NaN。

If you have small frames then assert_frame_equal will be okay. However, for large frames (10M rows) assert_frame_equal is pretty much useless. I had to interrupt it, it was taking so long.

如果你有小帧，那么assert_frame_equal就可以了。但是，对于大帧（10M行），assert_frame_equal几乎没用。我不得不打断它，花了这么长时间。

In [1]: df = DataFrame(rand(1e7, 15))

In [2]: df = df[df > 0.5]

In [3]: df2 = df.copy()

In [4]: df
Out[4]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000000 entries, 0 to 9999999
Columns: 15 entries, 0 to 14
dtypes: float64(15)

In [5]: timeit (df == df2) | ((df != df) & (df2 != df2))
1 loops, best of 3: 598 ms per loop

timeit of the (presumably) desired single bool indicating whether the two DataFrames are equal:

（推测）所需单个bool的timeit，指示两个DataFrame是否相等：

In [9]: timeit ((df == df2) | ((df != df) & (df2 != df2))).values.all()
1 loops, best of 3: 687 ms per loop

#3

Like @PhillipCloud answer, but more written out

喜欢@PhillipCloud的答案，但更多的是写出来的

In [26]: df1 = DataFrame([[np.nan,1],[2,np.nan]])

In [27]: df2 = df1.copy()

They really are equivalent

他们真的是等同的

In [28]: result = df1 == df2

In [29]: result[pd.isnull(df1) == pd.isnull(df2)] = True

In [30]: result
Out[30]: 
      0     1
0  True  True
1  True  True

A nan in df2 that doesn't exist in df1

df2中的nan，df1中不存在

In [31]: df2 = DataFrame([[np.nan,1],[np.nan,np.nan]])

In [32]: result = df1 == df2

In [33]: result[pd.isnull(df1) == pd.isnull(df2)] = True

In [34]: result
Out[34]: 
       0     1
0   True  True
1  False  True

You can also fill with a value you know not to be in the frame

您还可以填写一个您不知道在框架中的值

In [38]: df1.fillna(-999) == df1.fillna(-999)
Out[38]: 
      0     1
0  True  True
1  True  True

#4

df.fillna(0) == df2.fillna(0)

You can use fillna(). Documenation here.

你可以使用fillna（）。记录在这里。

from pandas import DataFrame

# create a dataframe with NaNs
df = DataFrame([{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}])
df2 = df

# comparison fails!
print df == df2

# all is well 
print df.fillna(0) == df2.fillna(0)

#5

Any equality comparison using == with np.NaN is False, even np.NaN == np.NaN is False.

使用==和np.NaN的任何相等比较都是False，即使np.NaN == np.NaN也是False。

Simply, df1.fillna('NULL') == df2.fillna('NULL'), if 'NULL' is not a value in the original data.

简单地说，df1.fillna（'NULL'）== df2.fillna（'NULL'），如果'NULL'不是原始数据中的值。

To be safe, do the following:

为安全起见，请执行以下操作：

Example a) Compare two dataframes with NaN values

示例a）将两个数据帧与NaN值进行比较

bools = (df1 == df2)
bools[pd.isnull(df1) & pd.isnull(df2)] = True
assert bools.all().all()

Example b) Filter rows in df1 that do not match with df2

示例b）过滤df1中与df2不匹配的行

bools = (df1 != df2)
bools[pd.isnull(df1) & pd.isnull(df2)] = False
df_outlier = df1[bools.all(axis=1)]

(Note: this is wrong - bools[pd.isnull(df1) == pd.isnull(df2)] = False)

（注意：这是错误的 - bools [pd.isnull（df1）== pd.isnull（df2）] = False）

#1