如何检查两个数据帧是否相等[重复]

时间:2021-05-17 22:52:01

This question already has an answer here:

这个问题已经有了答案:

Say I have large datasets in R and I just want to know whether two of them they are the same. I use this often when I'm experimenting different algorithms to achieve the same result. For example, say we have the following datasets:

假设我有很大的数据集在R中我只想知道它们中的两个是否相同。当我尝试不同的算法以获得相同的结果时,我经常使用这种方法。例如,假设我们有以下数据集:

df1 <- data.frame(num = 1:5, let = letters[1:5])
df2 <- df1
df3 <- data.frame(num = c(1:5, NA), let = letters[1:6])
df4 <- df3

So this is what I do to compare them:

这就是我比较它们的方法

table(x == y, useNA = 'ifany')

Which works great when the datasets have no NAs:

当数据集没有NAs时,效果很好:

> table(df1 == df2, useNA = 'ifany')
TRUE 
  10 

But not so much when they have NAs:

但是当他们有NAs的时候就不是这么多了:

> table(df3 == df4, useNA = 'ifany')
TRUE <NA> 
  11    1 

In the example, it's easy to dismiss the NA as not a problem since we know that both dataframes are equal. The problem is that NA == <anything> yields NA, so whenever one of the datasets has an NA, it doesn't matter what the other one has on that same position, the result is always going to be NA.

在本例中,很容易将NA排除为不是问题,因为我们知道两个dataframes都是相等的。问题是,NA == <任何> 都产生NA,所以当一个数据集有NA时,不管另一个在相同位置上有什么,结果总是NA。

So using table() to compare datasets doesn't seem ideal to me. How can I better check if two data frames are identical?

所以使用table()来比较数据集对我来说并不理想。如何更好地检查两个数据帧是否相同?

P.S.: Note this is not a duplicate of R - comparing several datasets, Comparing 2 datasets in R or Compare datasets in R

注::注意,这不是重复的R -比较几个数据集,比较R中的两个数据集或比较R中的数据集

2 个解决方案

#1


37  

Look up all.equal. It has some riders but it might work for you.

查找all.equal。它有一些乘客,但它可能对你有用。

all.equal(df3,df4)
# [1] TRUE
all.equal(df2,df1)
# [1] TRUE

#2


13  

As Metrics pointed out, one could also use identical() to compare the datasets. The difference between this approach and that of Codoremifa is that identical() will just yield TRUE of FALSE, depending whether the objects being compared are identical or not, whereas all.equal() will either return TRUE or hints about the differences between the objects. For instance, consider the following:

正如Metrics所指出的,还可以使用same()来比较数据集。这个方法和Codoremifa的不同之处在于,same()只会产生TRUE of FALSE,这取决于被比较的对象是否相同,而all.equal()会返回TRUE或暗示对象之间的差异。例如,考虑以下事项:

> identical(df1, df3)
[1] FALSE

> all.equal(df1, df3)
[1] "Attributes: < Component 2: Numeric: lengths (5, 6) differ >"                                
[2] "Component 1: Numeric: lengths (5, 6) differ"                                                
[3] "Component 2: Lengths: 5, 6"                                                                 
[4] "Component 2: Attributes: < Component 2: Lengths (5, 6) differ (string compare on first 5) >"
[5] "Component 2: Lengths (5, 6) differ (string compare on first 5)"   

Moreover, from what I've tested identical() seems to run much faster than all.equal().

而且,从我测试的结果来看,same()似乎比all.equal()跑得快得多。

#1


37  

Look up all.equal. It has some riders but it might work for you.

查找all.equal。它有一些乘客,但它可能对你有用。

all.equal(df3,df4)
# [1] TRUE
all.equal(df2,df1)
# [1] TRUE

#2


13  

As Metrics pointed out, one could also use identical() to compare the datasets. The difference between this approach and that of Codoremifa is that identical() will just yield TRUE of FALSE, depending whether the objects being compared are identical or not, whereas all.equal() will either return TRUE or hints about the differences between the objects. For instance, consider the following:

正如Metrics所指出的,还可以使用same()来比较数据集。这个方法和Codoremifa的不同之处在于,same()只会产生TRUE of FALSE,这取决于被比较的对象是否相同,而all.equal()会返回TRUE或暗示对象之间的差异。例如,考虑以下事项:

> identical(df1, df3)
[1] FALSE

> all.equal(df1, df3)
[1] "Attributes: < Component 2: Numeric: lengths (5, 6) differ >"                                
[2] "Component 1: Numeric: lengths (5, 6) differ"                                                
[3] "Component 2: Lengths: 5, 6"                                                                 
[4] "Component 2: Attributes: < Component 2: Lengths (5, 6) differ (string compare on first 5) >"
[5] "Component 2: Lengths (5, 6) differ (string compare on first 5)"   

Moreover, from what I've tested identical() seems to run much faster than all.equal().

而且,从我测试的结果来看,same()似乎比all.equal()跑得快得多。