在数组中查找唯一值的最快方法

I'm trying to find a fastest way for finding unique values in a array and to remove 0 as a possibility of unique value.

我正在尝试找到一种在数组中查找唯一值的最快方法，并删除0作为唯一值的可能性。

Right now I have two solutions:

现在我有两个解决方案：

result1 = setxor(0, dataArray(1:end,1)); % This gives the correct solution
result2 = unique(dataArray(1:end,1)); % This solution is faster but doesn't give the same result as result1

dataArray is equivalent to :

dataArray相当于：

dataArray = [0 0; 0 2; 0 4; 0 6; 1 0; 1 2; 1 4; 1 6; 2 0; 2 2; 2 4; 2 6]; % This is a small array, but in my case there are usually over 10 000 lines.

So in this case, result1 is equal to [1; 2] and result2 is equal to [0; 1; 2]. The unique function is faster but I don't want 0 to be considered. Is there a way to do this with unique and not consider 0 as a unique value? Is there an another alternative?

所以在这种情况下，result1等于[1; 2]和result2等于[0; 1; 2]。独特的功能更快，但我不想要0考虑。有没有办法用独特的方式做到这一点，而不是将0视为唯一值？还有另一种选择吗？

EDIT

编辑

I wanted to time the various solutions.

我想为各种解决方案计时。

clc
dataArray = floor(10*rand(10e3,10));
dataArray(mod(dataArray(:,1),3)==0)=0;
% Initial
tic
for ii = 1:10000
   FCT1 = setxor(0, dataArray(:,1));
end
toc
% My solution
tic
for ii = 1:10000
   FCT2 = unique(dataArray(dataArray(:,1)>0,1));
end
toc
% Pursuit solution
tic
for ii = 1:10000
   FCT3 = unique(dataArray(:, 1));
   FCT3(FCT3==0) = [];
end
toc
% Pursuit solution with chappjc comment
tic
for ii = 1:10000
   FCT32 = unique(dataArray(:, 1));
   FCT32 = FCT32(FCT32~=0);
end
toc
% chappjc solution
tic
for ii = 1:10000
   FCT4 = setdiff(unique(dataArray(:,1)),0);
end
toc
% chappjc 2nd solution
tic
for ii = 1:10000
   FCT5 = find(accumarray(dataArray(:,1)+1,1))-1;
   FCT5 = FCT5(FCT5>0);
end
toc

And the results:

结果如下：

Elapsed time is 5.153571 seconds. % FCT1 Initial
Elapsed time is 3.837637 seconds. % FCT2 My solution
Elapsed time is 3.464652 seconds. % FCT3 Pursuit solution
Elapsed time is 3.414338 seconds. % FCT32 Pursuit solution with chappjc comment
Elapsed time is 4.097164 seconds. % FCT4 chappjc solution
Elapsed time is 0.936623 seconds. % FCT5 chappjc 2nd solution

However, the solution with sparse and accumarray only works with integer. These solutions won't work with double.

但是，具有稀疏和准确性的解决方案仅适用于整数。这些解决方案无法兼容。

4 个解决方案

#1

Here's a wacky suggestion with accumarray, demonstrated using Floris' test data:

这是一个古怪的建议与accumarray，使用Floris的测试数据证明：

a = floor(10*rand(100000, 1)); a(mod(a,3)==0)=0;
result = find(accumarray(nonzeros(a(:,1))+1,1))-1;

Thanks to Luis Mendo for pointing out that with nonzeros, it is not necessary to perform result = result(result>0)!

感谢Luis Mendo指出使用非零，没有必要执行result = result（result> 0）！

Note that this solution requires integer-valued data (not necessarily an integer data type, but just not with decimal components). Comparing floating point values for equality, as unique would do, is perilous. See here and here.

请注意，此解决方案需要整数值数据（不一定是整数数据类型，但不是十进制组件）。比较浮点值的相等性，就像唯一的那样，是危险的。看到这里和这里。

Original suggestion: Combine unique with setdiff:

原创建议：结合独特的setdiff：

result = setdiff(unique(a(:,1)),0)

Or remove with logical indexing after unique:

或者在唯一后使用逻辑索引删除：

result = unique(a(:,1));
result = result(result>0);

I generally prefer not to assign [] as in (result(result==0)=[];) since it gets very inefficient for large data sets.

我通常不希望将[]分配为（result（result == 0）= [];），因为它对于大型数据集来说效率非常低。

Removing zeros after unique should be faster since the it operates on less data (unless every element is unique, OR if a/dataArray is very short).

在unique之后删除零应该更快，因为它在更少的数据上运行（除非每个元素都是唯一的，或者如果/ dataArray非常短）。

#2

Just to add to the general clamor - here are three different methods. They all give the same answer, but slightly different timings:

只是为了增加一般的喧嚣 - 这里有三种不同的方法。他们都给出了相同的答案，但时间略有不同：

a = floor(10*rand(100000, 1));
a(mod(a,3)==0)=0;
tic
b1 = unique(a(:,1));
b1(b1==0) = [];
toc
tic
b2 = find(sparse(a(:,1)+1, 1, 1)) - 1;
b2(b2==0)=[];
toc
tic
b3 = setxor(0, a(:, 1), 'rows');
toc

display(b1)
display(b2)
display(b3)

On my machine, the timings (for an array of 100000 elements) were as follows:

在我的机器上，时间（对于100000个元素的数组）如下：

0.0087 s  - for unique
0.0142 s  - for find(sparse)
0.0302 s  = for setxor

I always like sparse for a problem like this - you get the count of elements at the same time as their unique values.

对于像这样的问题，我总是喜欢稀疏 - 你可以同时获得元素的数量和它们的唯一值。

EDIT per @chappj's suggestion. I added a fourth option

根据@chappj的建议编辑。我添加了第四个选项

b4 = find(accumarray(a(:,1)+1,1)-1);
b4(b4==0) = [];

Time:

时间：

0.0029 s , THREE TIMES FASTER THAN UNIQUE

Ladies and gentlemen, we have a winner.

女士们，先生们，我们有一个胜利者。

AFTERWORD the index-based methods (sparse and accumarray) only work with integer-valued inputs (although they can be of double type). This seemed OK based on the input array given in the question, but doesn't work for non-integer valued inputs. Of course, unique is a tricky concept when you have doubles - number that "look" the same may be represented differently. You might consider truncating the input array (sanitizing the data) to make sure this is not a problem. For example, if you did

AFTERWORD基于索引的方法（稀疏和准确）仅适用于整数值输入（尽管它们可以是双重类型）。根据问题中给出的输入数组，这似乎没问题，但不适用于非整数值输入。当然，当你有双打时，独特是一个棘手的概念 - “看起来”相同的数字可能用不同的方式表示。您可以考虑截断输入数组（清理数据）以确保这不是问题。例如，如果你这样做了

a = 0.001 * double(int(a * 1000));

You would round all values to no more than 3 significant figures, and because you went "via an int" you are sure that you don't end up with values that are "very subtly different" (say in the 8th digit or beyond). Of course in that case you could also do

您可以将所有值四舍五入到不超过3个有效数字，并且因为您“通过int”，您确信您最终不会得到“非常微妙地不同”的值（例如，在第8位或更高位）。当然在这种情况下你也可以这样做

a = round(a * 1000);
mina = min(a(:));
b = find(accumarray(a - mina + 1, 1)) + mina - 1;
b = 0.001 * b(b ~= 0);

This is "fairly robust" for non-integer values (in the above case it handles values with up to three significant digits; if you need more, the space requirements will eventually get too large and this method will be slower than unique, which in fact has to sort the data.)

对于非整数值，这是“相当强大的”（在上面的例子中，它处理最多三个有效数字的值;如果你需要更多，空间要求最终将变得太大，这个方法将比唯一的慢，这在事实上必须对数据进行排序。）

#3

Why not remove the zeros as a second step:

为什么不删除零作为第二步：

result2 = unique(.....);
result2 = (result2~=0);

#4

I also found another way to do it :

我还找到了另一种方法：

result2 = unique(dataArray(dataArray(:,1)>0,1));

#1