从列表中查找唯一值

时间:2022-05-18 04:44:54

Suppose you have a list of values

假设您有一个值列表

x <- list(a=c(1,2,3), b = c(2,3,4), c=c(4,5,6))

I would like to find unique values from all list elements combined. So far, the following code did the trick

我想找到所有列表元素组合的唯一值。到目前为止,以下代码完成了这一操作

unique(unlist(x))

Does anyone know a more efficient way? I have a hefty list with a lot of values and would appreciate any speed-up.

有谁知道更有效的方式?我有一个很有价值的大量清单,并会欣赏任何加速。

1 个解决方案

#1


37  

This solution suggested by Marek is the best answer to the original Q. See below for a discussion of other approaches and why Marek's is the most useful.

Marek建议的这个解决方案是对原始Q的最佳答案。请参阅下面的讨论其他方法以及Marek最有用的原因。

> unique(unlist(x, use.names = FALSE))
[1] 1 2 3 4 5 6

Discussion

A faster solution is to compute unique() on the components of your x first and then do a final unique() on those results. This will only work if the components of the list have the same number of unique values, as they do in both examples below. E.g.:

更快的解决方案是首先在x的组件上计算unique(),然后对这些结果执行最后的唯一()。这仅在列表的组件具有相同数量的唯一值时才有效,如下面两个示例中所示。例如。:

First your version, then my double unique approach:

首先是你的版本,然后我的双重独特方法:

> unique(unlist(x))
[1] 1 2 3 4 5 6
> unique.default(sapply(x, unique))
[1] 1 2 3 4 5 6

We have to call unique.default as there is a matrix method for unique that keeps one margin fixed; this is fine as a matrix can be treated as a vector.

我们必须调用unique.default,因为有一个独特的矩阵方法可以保持一个边距固定;这很好,因为矩阵可以被视为一个向量。

Marek, in the comments to this answer, notes that the slow speed of the unlist approach is potentially due to the names on the list. Marek's solution is to make use of the use.names argument to unlist, which if used, results in a faster solution than the double unique version above. For the simple x of Roman's post we get

Marek在对此答案的评论中指出,取消列表方法的速度很慢可能是由于列表中的名称。 Marek的解决方案是使用use.names参数进行取消列表,如果使用该参数,则会产生比上述双独特版本更快的解决方案。对于我们得到的罗马文章的简单x

> unique(unlist(x, use.names = FALSE))
[1] 1 2 3 4 5 6

Marek's solution will work even when the number of unique elements differs between components.

即使组件之间的独特元素数量不同,Marek的解决方案也能正常工作。

Here is a larger example with some timings of all three methods:

这是一个更大的例子,包含所有三种方法的一些时间:

## Create a large list (1000 components of length 100 each)
DF <- as.list(data.frame(matrix(sample(1:10, 1000*1000, replace = TRUE), 
                                ncol = 1000)))

Here are results for the two approaches using DF:

以下是使用DF的两种方法的结果:

> ## Do the three approaches give the same result:
> all.equal(unique.default(sapply(DF, unique)), unique(unlist(DF)))
[1] TRUE
> all.equal(unique(unlist(DF, use.names = FALSE)), unique(unlist(DF)))
[1] TRUE
> ## Timing Roman's original:
> system.time(replicate(10, unique(unlist(DF))))
   user  system elapsed 
  12.884   0.077  12.966
> ## Timing double unique version:
> system.time(replicate(10, unique.default(sapply(DF, unique))))
   user  system elapsed 
  0.648   0.000   0.653
> ## timing of Marek's solution:
> system.time(replicate(10, unique(unlist(DF, use.names = FALSE))))
   user  system elapsed 
  0.510   0.000   0.512

Which shows that the double unique is a lot quicker to applying unique() to the individual components and then unique() those smaller sets of unique values, but this speed-up is purely due to the names on the list DF. If we tell unlist to not use the names, Marek's solution is marginally quicker than the double unique for this problem. As Marek's solution is using the correct tool properly, and it is quicker than the work-around, it is the preferred solution.

这表明,对于单个组件应用unique()然后将unique()应用于那些较小的唯一值集时,double unique要快得多,但这种加速纯粹是由于列表DF上的名称。如果我们告诉unlist不使用这些名称,那么Marek的解决方案比这个问题的双重解决方案快一点。由于Marek的解决方案正确使用了正确的工具,并且它比解决方案更快,因此它是首选解决方案。

The big gotcha with the double unique approach is that it will only work if, as in the two examples here, each component of the input list (DF or x) has the same number of unique values. In such cases sapply simplifies the result to a matrix which allows us to apply unique.default. If the components of the input list have differing numbers of unique values, the double unique solution will fail.

使用双独特方法的最大问题是,只有在这里的两个示例中,输入列表(DF或x)的每个组件具有相同数量的唯一值时才会起作用。在这种情况下,sapply将结果简化为矩阵,允许我们应用unique.default。如果输入列表的组件具有不同数量的唯一值,则双唯一解决方案将失败。

#1


37  

This solution suggested by Marek is the best answer to the original Q. See below for a discussion of other approaches and why Marek's is the most useful.

Marek建议的这个解决方案是对原始Q的最佳答案。请参阅下面的讨论其他方法以及Marek最有用的原因。

> unique(unlist(x, use.names = FALSE))
[1] 1 2 3 4 5 6

Discussion

A faster solution is to compute unique() on the components of your x first and then do a final unique() on those results. This will only work if the components of the list have the same number of unique values, as they do in both examples below. E.g.:

更快的解决方案是首先在x的组件上计算unique(),然后对这些结果执行最后的唯一()。这仅在列表的组件具有相同数量的唯一值时才有效,如下面两个示例中所示。例如。:

First your version, then my double unique approach:

首先是你的版本,然后我的双重独特方法:

> unique(unlist(x))
[1] 1 2 3 4 5 6
> unique.default(sapply(x, unique))
[1] 1 2 3 4 5 6

We have to call unique.default as there is a matrix method for unique that keeps one margin fixed; this is fine as a matrix can be treated as a vector.

我们必须调用unique.default,因为有一个独特的矩阵方法可以保持一个边距固定;这很好,因为矩阵可以被视为一个向量。

Marek, in the comments to this answer, notes that the slow speed of the unlist approach is potentially due to the names on the list. Marek's solution is to make use of the use.names argument to unlist, which if used, results in a faster solution than the double unique version above. For the simple x of Roman's post we get

Marek在对此答案的评论中指出,取消列表方法的速度很慢可能是由于列表中的名称。 Marek的解决方案是使用use.names参数进行取消列表,如果使用该参数,则会产生比上述双独特版本更快的解决方案。对于我们得到的罗马文章的简单x

> unique(unlist(x, use.names = FALSE))
[1] 1 2 3 4 5 6

Marek's solution will work even when the number of unique elements differs between components.

即使组件之间的独特元素数量不同,Marek的解决方案也能正常工作。

Here is a larger example with some timings of all three methods:

这是一个更大的例子,包含所有三种方法的一些时间:

## Create a large list (1000 components of length 100 each)
DF <- as.list(data.frame(matrix(sample(1:10, 1000*1000, replace = TRUE), 
                                ncol = 1000)))

Here are results for the two approaches using DF:

以下是使用DF的两种方法的结果:

> ## Do the three approaches give the same result:
> all.equal(unique.default(sapply(DF, unique)), unique(unlist(DF)))
[1] TRUE
> all.equal(unique(unlist(DF, use.names = FALSE)), unique(unlist(DF)))
[1] TRUE
> ## Timing Roman's original:
> system.time(replicate(10, unique(unlist(DF))))
   user  system elapsed 
  12.884   0.077  12.966
> ## Timing double unique version:
> system.time(replicate(10, unique.default(sapply(DF, unique))))
   user  system elapsed 
  0.648   0.000   0.653
> ## timing of Marek's solution:
> system.time(replicate(10, unique(unlist(DF, use.names = FALSE))))
   user  system elapsed 
  0.510   0.000   0.512

Which shows that the double unique is a lot quicker to applying unique() to the individual components and then unique() those smaller sets of unique values, but this speed-up is purely due to the names on the list DF. If we tell unlist to not use the names, Marek's solution is marginally quicker than the double unique for this problem. As Marek's solution is using the correct tool properly, and it is quicker than the work-around, it is the preferred solution.

这表明,对于单个组件应用unique()然后将unique()应用于那些较小的唯一值集时,double unique要快得多,但这种加速纯粹是由于列表DF上的名称。如果我们告诉unlist不使用这些名称,那么Marek的解决方案比这个问题的双重解决方案快一点。由于Marek的解决方案正确使用了正确的工具,并且它比解决方案更快,因此它是首选解决方案。

The big gotcha with the double unique approach is that it will only work if, as in the two examples here, each component of the input list (DF or x) has the same number of unique values. In such cases sapply simplifies the result to a matrix which allows us to apply unique.default. If the components of the input list have differing numbers of unique values, the double unique solution will fail.

使用双独特方法的最大问题是,只有在这里的两个示例中,输入列表(DF或x)的每个组件具有相同数量的唯一值时才会起作用。在这种情况下,sapply将结果简化为矩阵,允许我们应用unique.default。如果输入列表的组件具有不同数量的唯一值,则双唯一解决方案将失败。