I'd like to check the difference between two DataFrame columns. I tried using the command:
我想检查两个DataFrame列之间的区别。我尝试使用命令:
np.setdiff1d(train.columns, train_1.columns)
which results in an empty array:
这导致一个空数组:
array([], dtype=object)
However, the number of columns in the dataframes are different:
但是,数据框中的列数不同:
len(train.columns), len(train_1.columns) = (51, 56)
which means that the two DataFrame are obviously different.
这意味着两个DataFrame明显不同。
What is wrong here?
这有什么不对?
1 个解决方案
#1
1
The results are correct, however, setdiff1d
is order dependent. It will only check for elements in the first input array that do not occur in the second array.
结果是正确的,但是,setdiff1d依赖于顺序。它只会检查第一个输入数组中第二个数组中没有出现的元素。
If you do not care which of the dataframes have the unique columns you can use setxor1d
. It will return "the unique values that are in only one (not both) of the input arrays", see the documentation.
如果您不关心哪个数据帧具有唯一列,则可以使用setxor1d。它将返回“仅在一个(不是两个)输入数组中的唯一值”,请参阅文档。
import numpy
colsA = ['a', 'b', 'c', 'd']
colsB = ['b','c']
c = numpy.setxor1d(colsA, colsB)
Will return you an array containing 'a' and 'd'.
将返回一个包含'a'和'd'的数组。
If you want to use setdiff1d
you need to check for differences both ways:
如果你想使用setdiff1d,你需要检查两种方式的差异:
//columns in train.columns that are not in train_1.columns
c1 = np.setdiff1d(train.columns, train_1.columns)
//columns in train_1.columns that are not in train.columns
c2 = np.setdiff1d(train_1.columns, train.columns)
#1
1
The results are correct, however, setdiff1d
is order dependent. It will only check for elements in the first input array that do not occur in the second array.
结果是正确的,但是,setdiff1d依赖于顺序。它只会检查第一个输入数组中第二个数组中没有出现的元素。
If you do not care which of the dataframes have the unique columns you can use setxor1d
. It will return "the unique values that are in only one (not both) of the input arrays", see the documentation.
如果您不关心哪个数据帧具有唯一列,则可以使用setxor1d。它将返回“仅在一个(不是两个)输入数组中的唯一值”,请参阅文档。
import numpy
colsA = ['a', 'b', 'c', 'd']
colsB = ['b','c']
c = numpy.setxor1d(colsA, colsB)
Will return you an array containing 'a' and 'd'.
将返回一个包含'a'和'd'的数组。
If you want to use setdiff1d
you need to check for differences both ways:
如果你想使用setdiff1d,你需要检查两种方式的差异:
//columns in train.columns that are not in train_1.columns
c1 = np.setdiff1d(train.columns, train_1.columns)
//columns in train_1.columns that are not in train.columns
c2 = np.setdiff1d(train_1.columns, train.columns)