找到数组中最常见的条目

You are given a 32-bit unsigned integer array with length up to 2³², with the property that more than half of the entries in the array are equal to N, for some 32-bit unsigned integer N. Find N looking at each number in the array only once and using at most 2 kB of memory.

您将获得一个32位无符号整数数组,其长度最多为232,并且对于某些32位无符号整数N,该数组中超过一半条目的属性等于N.找到N查看每个数字该阵列只有一次,最多使用2 kB内存。

Your solution must be deterministic, and guaranteed to find N.

您的解决方案必须是确定性的,并保证找到N.

8 个解决方案

#1

Keep one integer for each bit, and increment this collection appropriately for each integer in the array.

为每个位保留一个整数,并为数组中的每个整数适当增加此集合。

At the end, some of the bits will have a count higher than half the length of the array - those bits determine N. Of course, the count will be higher than the number of times N occurred, but that doesn't matter. The important thing is that any bit which isn't part of N cannot occur more than half the times (because N has over half the entries) and any bit which is part of N must occur more than half the times (because it will occur every time N occurs, and any extras).

最后,一些位的计数将高于数组长度的一半 - 这些位确定N.当然,计数将高于N发生的次数,但这并不重要。重要的是,任何不属于N的位都不能超过一半的次数(因为N超过一半的条目)并且任何属于N的位必须出现超过一半的次数(因为它会发生)每次N发生,以及任何额外的)。

(No code at the moment - about to lose net access. Hopefully the above is clear enough though.)

(目前没有代码 - 即将失去网络访问权限。希望以上内容足够明确。)

#2

Boyer and Moore's "Linear Time Majority Vote Algorithm" - go down the array maintaining your current guess at the answer.

Boyer和Moore的“线性时间多数投票算法” - 沿阵列向下,保持您当前的猜测答案。

#3

You can do this with only two variables.

你只需要两个变量就可以做到这一点。

public uint MostCommon(UInt32[] numberList)
{
    uint suspect = 0;
    int suspicionStrength = -1; 
    foreach (uint number in numberList)
    {
        if (number==suspect)
        {
            suspicionStrength++;
        }
        else
        {
            suspicionStrength--;
        }

        if (suspicionStrength<=0)
        {
            suspect = number;
        }
    }
    return suspect;
}

Make the first number the suspect number, and continue looping through the list. If the number matches, increase the suspicion strength by one; if it doesn't match, lower the suspicion strength by one. If the suspicion strength hits 0 the current number becomes the suspect number. This will not work to find the most common number, only a number that is more than 50% of the group. Resist the urge to add a check if suspicionStrength is greater than half the list length - it will always result in more total comparisons.

将第一个数字设为可疑号码,然后继续循环浏览列表。如果数字匹配,将怀疑强度提高一个;如果不匹配,请将怀疑强度降低一个。如果怀疑强度达到0,则当前数字成为可疑数字。这不能找到最常见的数字,只能找到超过该组50%的数字。如果suspicionStrength大于列表长度的一半,则拒绝添加检查的冲动 - 它总是会导致更多的总比较。

P.S. I have not tested this code - use it at your own peril.

附:我没有测试过这段代码 - 使用它自担风险。

#4

Pseudo code (notepad C++ :-)) for Jon's algorithm:

Jon的算法的伪代码(记事本C ++ :-)):

int lNumbers = (size_of(arrNumbers)/size_of(arrNumbers[0]);

for (int i = 0; i < lNumbers; i++)
  for (int bi = 0; bi < 32; bi++)
    arrBits[i] = arrBits[i] + (arrNumbers[i] & (1 << bi)) == (1 << bi) ? 1 : 0;

int N = 0;

for (int bc = 0; bc < 32; bc++)
  if (arrBits[bc] > lNumbers/2)
    N = N | (1 << bc);

#5

Notice that if the sequence a0, a1, . . . , an−1 contains a leader, then after removing a pair of elements of different values, the remaining sequence still has the same leader. Indeed, if we remove two different elements then only one of them could be the leader. The leader in the new sequence occurs more than n/2 − 1 = (n−2)/2 times. Consequently, it is still the leader of the new sequence of n − 2 elements.

请注意,如果序列a0,a1,... 。。 ,an-1包含一个领导者,然后在删除一对不同值的元素后,剩下的序列仍然具有相同的领导者。实际上,如果我们删除两个不同的元素,那么其中只有一个可能是领导者。新序列中的领导者发生超过n / 2 - 1 =(n-2)/ 2次。因此,它仍然是新的n - 2元素序列的领导者。

Here is a Python implementation, with O(n) time complexity:

这是一个Python实现,具有O(n)时间复杂度:

def goldenLeader(A):
    n = len(A)
    size = 0
    for k in xrange(n):
        if (size == 0):
            size += 1
            value = A[k]
        else:
            if (value != A[k]):
                size -= 1
            else:
                size += 1
    candidate = -1
    if (size > 0):
        candidate = value
    leader = -1
    count = 0
    for k in xrange(n):
        if (A[k] == candidate):
            count += 1
    if (count > n // 2):
        leader = candidate
    return leader

#6

This is a standard problem in streaming algorithms (where you have a huge (potentially infinite) stream of data) and you have to calculate some statistics from this stream, passing through this stream once.

这是流式算法中的标准问题(您有一个巨大的(可能是无限的)数据流),您必须从此流计算一些统计信息,并通过此流一次。

Clearly you can approach it with hashing or sorting, but with potentially infinite stream you clearly run out of memory. So you have to do something smart here.

显然,您可以使用散列或排序来处理它,但是如果有可能无限的流,则显然会耗尽内存。所以你必须在这里做一些聪明的事情。

The majority element is the element that occurs more than half of the size of the array. This means that the majority element occurs more than all other elements combined or if you count the number of times, majority element appears, and subtract the number of all other elements, you will get a positive number.

多数元素是发生超过数组大小一半的元素。这意味着多数元素的组合比所有其他元素组合的多,或者如果计算多数元素的次数,并且减去所有其他元素的数量,您将得到一个正数。

So if you count the number of some element, and subtract the number of all other elements and get the number 0 - then your original element can't be a majority element. This if the basis for a correct algorithm:

因此,如果计算某个元素的数量,并减去所有其他元素的数量并得到数字0 - 那么您的原始元素不能是多数元素。这个如果是正确算法的基础:

Have two variables, counter and possible element. Iterate the stream, if the counter is 0 - your overwrite the possible element and initialize the counter, if the number is the same as possible element - increase the counter, otherwise decrease it. Python code:

有两个变量,计数器和可能的元素。迭代流,如果计数器为0 - 你覆盖可能的元素并初始化计数器,如果数字与可能的元素相同 - 增加计数器,否则减少它。 Python代码:

def majority_element(arr):
    counter, possible_element = 0, None
    for i in arr:
        if counter == 0:
            possible_element, counter = i, 1
        elif i == possible_element:
            counter += 1
        else:
            counter -= 1

    return possible_element

It is clear to see that the algorithm is O(n) with a very small constant before O(n) (like 3). Also it looks like the space complexity is O(1), because we have only three variable initialized. The problem is that one of these variables is a counter which potentially can grow up to n (when the array consists of the same numbers). And to store the number n you need O(log (n)) space. So from theoretical point of view it is O(n) time and O(log(n)) space. From practical, you can fit 2^128 number in a longint and this number of elements in the array is unimaginably huge.

很明显,算法是O(n),在O(n)之前具有非常小的常数(如3)。它看起来像空间复杂度是O(1),因为我们只有三个变量初始化。问题是这些变量之一是一个可能长到n的计数器(当数组由相同的数字组成时)。并且要存储数字n,您需要O(log(n))空间。因此从理论的角度来看,它是O(n)时间和O(log(n))空间。从实际情况来看,你可以在longint中拟合2 ^ 128个数字,并且数组中的这些元素数量难以想象。

Also note that the algorithm works only if there is a majority element. If such element does not exist it will still return some number, which will surely be wrong. (it is easy to modify the algorithm to tell whether the majority element exists)

另请注意,该算法仅在存在多数元素时才有效。如果这样的元素不存在,它仍然会返回一些数字,这肯定是错误的。 (很容易修改算法来判断多数元素是否存在)

History channel: this algorithm was invented somewhere in 1982 by Boyer, Moore and called Boyer–Moore majority vote algorithm.

历史频道:这个算法是1982年由Boyer,Moore发明的,并称为Boyer-Moore多数投票算法。

#7

I have recollections of this algorithm, which might or might not follow the 2K rule. It might need to be rewritten with stacks and the like to avoid breaking the memory limits due to function calls, but this might be unneeded since it only ever has a logarithmic number of such calls. Anyhow, I have vague recollections from college or a recursive solution to this which involved divide and conquer, the secret being that when you divide the groups in half, at least one of the halves still has more than half of its values equal to the max. The basic rule when dividing is that you return two candidate top values, one of which is the top value and one of which is some other value (that may or may not be 2nd place). I forget the algorithm itself.

我有这个算法的回忆,它可能会或可能不会遵循2K规则。它可能需要用堆栈等重写,以避免因函数调用而破坏内存限制,但这可能是不必要的,因为它只有这种调用的对数。无论如何,我对大学的模糊回忆或对此的递归解决方案涉及分而治之,秘诀是当你将组分成两半时,至少有一半的一半仍然有一半以上的值等于最大值。除法时的基本规则是返回两个候选顶值,其中一个是最高值,另一个是其他值(可能是也可能不是第二位)。我忘了算法本身。

#8

Proof of correctness for buti-oxa / Jason Hernandez's answer, assuming Jason's answer is the same as buti-oxa's answer and both work the way the algorithm described should work:

buti-oxa / Jason Hernandez答案的正确性证明,假设Jason的答案与buti-oxa的答案相同,并且两者的工作方式与所描述的算法应该有效:

We define adjusted suspicion strength as being equal to suspicion strength if top value is selected or -suspicion strength if top value is not selected. Every time you pick the right number, the current adjusted suspicion strength increases by 1. Each time you pick a wrong number, it either drops by 1 or increases by 1, depending on if the wrong number is currently selected. So, the minimum possible ending adjusted suspicion strength is equal to number-of[top values] - number-of[other values]

如果选择最高值,我们将调整后的怀疑强度定义为等于怀疑强度,或者如果未选择最高值,则将调整后的怀疑强度定义为-suspicion strength。每次选择正确的数字时,当前调整的怀疑强度会增加1.每次选错号码时,它会下降1或增加1,具体取决于当前是否选择了错误的数字。因此,最小可能的结束调整的怀疑强度等于[最高值]的数量 - [其他值]的数量

#1