如何找到两个Pandas DataFrame之间的集合差异

时间:2021-08-20 04:06:00

I'd like to check the difference between two DataFrame columns. I tried using the command:

我想检查两个DataFrame列之间的区别。我尝试使用命令:

np.setdiff1d(train.columns, train_1.columns)

which results in an empty array:

这导致一个空数组:

array([], dtype=object)

However, the number of columns in the dataframes are different:

但是,数据框中的列数不同:

len(train.columns), len(train_1.columns) = (51, 56)

which means that the two DataFrame are obviously different.

这意味着两个DataFrame明显不同。

What is wrong here?

这有什么不对?

1 个解决方案

#1


1  

The results are correct, however, setdiff1d is order dependent. It will only check for elements in the first input array that do not occur in the second array.

结果是正确的,但是,setdiff1d依赖于顺序。它只会检查第一个输入数组中第二个数组中没有出现的元素。

If you do not care which of the dataframes have the unique columns you can use setxor1d. It will return "the unique values that are in only one (not both) of the input arrays", see the documentation.

如果您不关心哪个数据帧具有唯一列,则可以使用setxor1d。它将返回“仅在一个(不是两个)输入数组中的唯一值”,请参阅文档。

import numpy

colsA = ['a', 'b', 'c', 'd']
colsB = ['b','c']

c = numpy.setxor1d(colsA, colsB)

Will return you an array containing 'a' and 'd'.

将返回一个包含'a'和'd'的数组。


If you want to use setdiff1d you need to check for differences both ways:

如果你想使用setdiff1d,你需要检查两种方式的差异:

//columns in train.columns that are not in train_1.columns
c1 = np.setdiff1d(train.columns, train_1.columns)

//columns in train_1.columns that are not in train.columns
c2 = np.setdiff1d(train_1.columns, train.columns)

#1


1  

The results are correct, however, setdiff1d is order dependent. It will only check for elements in the first input array that do not occur in the second array.

结果是正确的,但是,setdiff1d依赖于顺序。它只会检查第一个输入数组中第二个数组中没有出现的元素。

If you do not care which of the dataframes have the unique columns you can use setxor1d. It will return "the unique values that are in only one (not both) of the input arrays", see the documentation.

如果您不关心哪个数据帧具有唯一列,则可以使用setxor1d。它将返回“仅在一个(不是两个)输入数组中的唯一值”,请参阅文档。

import numpy

colsA = ['a', 'b', 'c', 'd']
colsB = ['b','c']

c = numpy.setxor1d(colsA, colsB)

Will return you an array containing 'a' and 'd'.

将返回一个包含'a'和'd'的数组。


If you want to use setdiff1d you need to check for differences both ways:

如果你想使用setdiff1d,你需要检查两种方式的差异:

//columns in train.columns that are not in train_1.columns
c1 = np.setdiff1d(train.columns, train_1.columns)

//columns in train_1.columns that are not in train.columns
c2 = np.setdiff1d(train_1.columns, train.columns)