This question already has an answer here:
这个问题已经有了答案:
- regarding matrix comparison in R 1 answer
- 关于r1答案中的矩阵比较
Say I have large datasets in R and I just want to know whether two of them they are the same. I use this often when I'm experimenting different algorithms to achieve the same result. For example, say we have the following datasets:
假设我有很大的数据集在R中我只想知道它们中的两个是否相同。当我尝试不同的算法以获得相同的结果时,我经常使用这种方法。例如,假设我们有以下数据集:
df1 <- data.frame(num = 1:5, let = letters[1:5])
df2 <- df1
df3 <- data.frame(num = c(1:5, NA), let = letters[1:6])
df4 <- df3
So this is what I do to compare them:
这就是我比较它们的方法
table(x == y, useNA = 'ifany')
Which works great when the datasets have no NAs:
当数据集没有NAs时,效果很好:
> table(df1 == df2, useNA = 'ifany')
TRUE
10
But not so much when they have NAs:
但是当他们有NAs的时候就不是这么多了:
> table(df3 == df4, useNA = 'ifany')
TRUE <NA>
11 1
In the example, it's easy to dismiss the NA
as not a problem since we know that both dataframes are equal. The problem is that NA == <anything>
yields NA
, so whenever one of the datasets has an NA
, it doesn't matter what the other one has on that same position, the result is always going to be NA
.
在本例中,很容易将NA排除为不是问题,因为我们知道两个dataframes都是相等的。问题是,NA == <任何> 都产生NA,所以当一个数据集有NA时,不管另一个在相同位置上有什么,结果总是NA。
So using table()
to compare datasets doesn't seem ideal to me. How can I better check if two data frames are identical?
所以使用table()来比较数据集对我来说并不理想。如何更好地检查两个数据帧是否相同?
P.S.: Note this is not a duplicate of R - comparing several datasets, Comparing 2 datasets in R or Compare datasets in R
注::注意,这不是重复的R -比较几个数据集,比较R中的两个数据集或比较R中的数据集
2 个解决方案
#1
37
Look up all.equal. It has some riders but it might work for you.
查找all.equal。它有一些乘客,但它可能对你有用。
all.equal(df3,df4)
# [1] TRUE
all.equal(df2,df1)
# [1] TRUE
#2
13
As Metrics pointed out, one could also use identical()
to compare the datasets. The difference between this approach and that of Codoremifa is that identical()
will just yield TRUE
of FALSE
, depending whether the objects being compared are identical or not, whereas all.equal()
will either return TRUE
or hints about the differences between the objects. For instance, consider the following:
正如Metrics所指出的,还可以使用same()来比较数据集。这个方法和Codoremifa的不同之处在于,same()只会产生TRUE of FALSE,这取决于被比较的对象是否相同,而all.equal()会返回TRUE或暗示对象之间的差异。例如,考虑以下事项:
> identical(df1, df3)
[1] FALSE
> all.equal(df1, df3)
[1] "Attributes: < Component 2: Numeric: lengths (5, 6) differ >"
[2] "Component 1: Numeric: lengths (5, 6) differ"
[3] "Component 2: Lengths: 5, 6"
[4] "Component 2: Attributes: < Component 2: Lengths (5, 6) differ (string compare on first 5) >"
[5] "Component 2: Lengths (5, 6) differ (string compare on first 5)"
Moreover, from what I've tested identical()
seems to run much faster than all.equal()
.
而且,从我测试的结果来看,same()似乎比all.equal()跑得快得多。
#1
37
Look up all.equal. It has some riders but it might work for you.
查找all.equal。它有一些乘客,但它可能对你有用。
all.equal(df3,df4)
# [1] TRUE
all.equal(df2,df1)
# [1] TRUE
#2
13
As Metrics pointed out, one could also use identical()
to compare the datasets. The difference between this approach and that of Codoremifa is that identical()
will just yield TRUE
of FALSE
, depending whether the objects being compared are identical or not, whereas all.equal()
will either return TRUE
or hints about the differences between the objects. For instance, consider the following:
正如Metrics所指出的,还可以使用same()来比较数据集。这个方法和Codoremifa的不同之处在于,same()只会产生TRUE of FALSE,这取决于被比较的对象是否相同,而all.equal()会返回TRUE或暗示对象之间的差异。例如,考虑以下事项:
> identical(df1, df3)
[1] FALSE
> all.equal(df1, df3)
[1] "Attributes: < Component 2: Numeric: lengths (5, 6) differ >"
[2] "Component 1: Numeric: lengths (5, 6) differ"
[3] "Component 2: Lengths: 5, 6"
[4] "Component 2: Attributes: < Component 2: Lengths (5, 6) differ (string compare on first 5) >"
[5] "Component 2: Lengths (5, 6) differ (string compare on first 5)"
Moreover, from what I've tested identical()
seems to run much faster than all.equal()
.
而且,从我测试的结果来看,same()似乎比all.equal()跑得快得多。