查找两个数组之间的(multiset)差异

时间:2021-11-05 12:03:29

Given arrays (say row vectors) A and B, how do I find an array C such that merging B and C will give A?

给定数组(比如行向量)A和B,我如何找到一个数组C,使得合并B和C将给出A?

For example, given

例如,给定

A = [2, 4, 6, 4, 3, 3, 1, 5, 5, 5];
B = [2, 3, 5, 5];

then

C = multiset_diff(A, B) % Should be [4, 6, 4, 3, 1, 5]

(the order of the result does not matter here).

(结果的顺序在这里无关紧要)。

For the same A, if B = [2, 4, 5], then the result should be [6, 4, 3, 3, 1, 5, 5].

对于相同的A,如果B = [2,4,5],则结果应为[6,4,3,3,1,5,5]。

(Since there were two 4s in A and one 4 in B, the result C should have 2 - 1 = 1 4 in it. Similarly for the other values.)

(由于A中有两个4,B中有一个4,结果C中应该有2 - 1 = 1 4。其他值类似。)

PS: Note that setdiff would remove all instances of 2, 3, and 5, whereas here they need to be removed just however many times they appear in B.

PS:请注意,setdiff将删除2,3和5的所有实例,而在这里它们需要被删除,但是它们多次出现在B中。


Performance: I ran some quick-n-dirty benchmarks locally, here are the results for future reference:

性能:我在本地运行了一些快速的基准测试,以下是未来参考的结果:

  • @heigele's nested loop method performs best for small lengths of A (say upto N = 50 or so elements). It does 3x better for small (N=20) As, and 1.5x better for medium-sized (N=50) As, compared to the next best method - which is:

    @ heigele的嵌套循环方法对于小长度的A(最多N = 50左右的元素)表现最佳。对于小型(N = 20)As,它的性能提高3倍,对于中型(N = 50)As,它与下一个最佳方法相比提高了1.5倍 - 这是:

  • @obchardon's histc-based method. This is the one performs the best when A's size N starts to be 100 and above. For eg., this does 3x better than the above nested loop method when N = 200.

    @ obchardon的基于histc的方法。当A的尺寸N开始为100或更高时,这是最好的。例如,当N = 200时,这比上面的嵌套循环方法好3倍。

@matt's for+find method does comparably to the histc method for small N, but quickly degrades in performance for larger N (which makes sense since the entire C == B(x) comparison is run every iteration).

@ matt的for + find方法与小N的histc方法相当,但是对于较大的N,性能会迅速下降(这是有道理的,因为每次迭代都会运行整个C == B(x)比较)。

(The other methods are either several times slower or invalid at the time of writing.)

(其他方法在写作时要慢几倍或无效。)

5 个解决方案

#1


3  

Still another approach using the histc function:

使用histc函数的另一种方法:

A = [2, 4, 6, 4, 3, 3, 1, 5, 5, 5];
B = [2, 3, 5, 5];

uA  = unique(A);
hca = histc(A,uA); 
hcb = histc(B,uA);
res = repelem(uA,hca-hcb)

We simply calculate the number of repeated elements for each vectors according to the unique value of vector A, then we use repelem to create the result.

我们根据向量A的唯一值简单地计算每个向量的重复元素的数量,然后我们使用repelem来创建结果。

This solution do not preserve the initial order but it don't seems to be a problem for you.

此解决方案不保留初始订单,但对您来说似乎不是问题。

I use histc for Octave compatibility, but this function is deprecated so you can also use histcounts

我使用histc进行Octave兼容性,但不推荐使用此函数,因此您也可以使用histcounts

#2


4  

Here's a vectorized way. Memory-inefficient, mostly for fun:

这是一种矢量化的方式。内存效率低下,主要是为了娱乐:

tA = sum(triu(bsxfun(@eq, A, A.')), 1);
tB = sum(triu(bsxfun(@eq, B, B.')), 1);
result = setdiff([A; tA].', [B; tB].', 'rows', 'stable');
result = result(:,1).';

The idea is to make each entry unique by tagging it with an occurrence number. The vectors become 2-column matrices, setdiff is applied with the 'rows' option, and then the tags are removed from the result.

我们的想法是通过使用事件编号标记每个条目来使其唯一。向量成为2列矩阵,setdiff应用'rows'选项,然后从结果中删除标记。

#3


4  

I'm not a fan of loops, but for random perturbations of A this was the best I came up with.

我不是循环的粉丝,但对于A的随机扰动,这是我想出的最好的。

C = A;
for x = 1:numel(B)
C(find(C == B(x), 1, 'first')) = [];
end

I was curious about looking at the affect of different orders of A on a solution approach so I setup a test like this:

我很想知道不同的A顺序对解决方案的影响,所以我设置了这样的测试:

Ctruth = [1 3 3 4 5 5 6];
for testNumber = 1:100
    Atest = A(randperm(numel(A)));
    C = myFunction(Atest,B);
    C = sort(C);
    assert(all(C==Ctruth));
end

#4


3  

You can use the second output of ismember to find the indexes where elements of B are in A, and diff to remove duplicates:

您可以使用ismember的第二个输出来查找B的元素在A中的索引,并使用diff来删除重复项:

This answer assumes that B is already sorted. If that is not the case, B has to be sorted before executing above solution.

这个答案假设B已经排序。如果不是这种情况,则必须在执行上述解决方案之前对B进行排序。

For the first example:

对于第一个例子:

A = [2, 4, 6, 4, 3, 3, 1, 5, 5, 5];
B = [2, 3, 5, 5];
%B = sort(B); Sort if B is not sorted.
[~,col] = ismember(B,A);
indx = find(diff(col)==0);
col(indx+1) = col(indx)+1;
A(col) = [];
C = A;

>>C

4     6     4     3     1     5

For the second example:

对于第二个例子:

A = [2, 4, 6, 4, 3, 3, 1, 5, 5, 5];
B = [2, 4, 5, 5];
%B = sort(B); Sort if B is not sorted.
[~,col] = ismember(B,A);
indx = find(diff(col)==0);
col(indx+1) = col(indx)+1;
A(col) = [];
C = A;
>>C

6     4     3     3     1     5

#5


3  

Strongly inspired by Matt, but on my machine 40% faster:

受到Matt的强烈启发,但在我的机器上快了40%:

function A = multiDiff(A,B)
for j = 1:numel(B)
    for i = 1:numel(A)
        if A(i) == B(j)
            A(i) = [];
            break;
        end
    end
end
end

#1


3  

Still another approach using the histc function:

使用histc函数的另一种方法:

A = [2, 4, 6, 4, 3, 3, 1, 5, 5, 5];
B = [2, 3, 5, 5];

uA  = unique(A);
hca = histc(A,uA); 
hcb = histc(B,uA);
res = repelem(uA,hca-hcb)

We simply calculate the number of repeated elements for each vectors according to the unique value of vector A, then we use repelem to create the result.

我们根据向量A的唯一值简单地计算每个向量的重复元素的数量,然后我们使用repelem来创建结果。

This solution do not preserve the initial order but it don't seems to be a problem for you.

此解决方案不保留初始订单,但对您来说似乎不是问题。

I use histc for Octave compatibility, but this function is deprecated so you can also use histcounts

我使用histc进行Octave兼容性,但不推荐使用此函数,因此您也可以使用histcounts

#2


4  

Here's a vectorized way. Memory-inefficient, mostly for fun:

这是一种矢量化的方式。内存效率低下,主要是为了娱乐:

tA = sum(triu(bsxfun(@eq, A, A.')), 1);
tB = sum(triu(bsxfun(@eq, B, B.')), 1);
result = setdiff([A; tA].', [B; tB].', 'rows', 'stable');
result = result(:,1).';

The idea is to make each entry unique by tagging it with an occurrence number. The vectors become 2-column matrices, setdiff is applied with the 'rows' option, and then the tags are removed from the result.

我们的想法是通过使用事件编号标记每个条目来使其唯一。向量成为2列矩阵,setdiff应用'rows'选项,然后从结果中删除标记。

#3


4  

I'm not a fan of loops, but for random perturbations of A this was the best I came up with.

我不是循环的粉丝,但对于A的随机扰动,这是我想出的最好的。

C = A;
for x = 1:numel(B)
C(find(C == B(x), 1, 'first')) = [];
end

I was curious about looking at the affect of different orders of A on a solution approach so I setup a test like this:

我很想知道不同的A顺序对解决方案的影响,所以我设置了这样的测试:

Ctruth = [1 3 3 4 5 5 6];
for testNumber = 1:100
    Atest = A(randperm(numel(A)));
    C = myFunction(Atest,B);
    C = sort(C);
    assert(all(C==Ctruth));
end

#4


3  

You can use the second output of ismember to find the indexes where elements of B are in A, and diff to remove duplicates:

您可以使用ismember的第二个输出来查找B的元素在A中的索引,并使用diff来删除重复项:

This answer assumes that B is already sorted. If that is not the case, B has to be sorted before executing above solution.

这个答案假设B已经排序。如果不是这种情况,则必须在执行上述解决方案之前对B进行排序。

For the first example:

对于第一个例子:

A = [2, 4, 6, 4, 3, 3, 1, 5, 5, 5];
B = [2, 3, 5, 5];
%B = sort(B); Sort if B is not sorted.
[~,col] = ismember(B,A);
indx = find(diff(col)==0);
col(indx+1) = col(indx)+1;
A(col) = [];
C = A;

>>C

4     6     4     3     1     5

For the second example:

对于第二个例子:

A = [2, 4, 6, 4, 3, 3, 1, 5, 5, 5];
B = [2, 4, 5, 5];
%B = sort(B); Sort if B is not sorted.
[~,col] = ismember(B,A);
indx = find(diff(col)==0);
col(indx+1) = col(indx)+1;
A(col) = [];
C = A;
>>C

6     4     3     3     1     5

#5


3  

Strongly inspired by Matt, but on my machine 40% faster:

受到Matt的强烈启发,但在我的机器上快了40%:

function A = multiDiff(A,B)
for j = 1:numel(B)
    for i = 1:numel(A)
        if A(i) == B(j)
            A(i) = [];
            break;
        end
    end
end
end