从向量中删除最小的非唯一值

时间:2022-10-27 04:44:23

I have an unsorted vector of doubles (Actually objects with a double member which is used in this case). From this vector I need to remove the smallest non-unique value. However, it is not guaranteed that a non-unique value exists. It is allowed to sort the range.

我有一个未分类的双精度矢量(实际上是带有双成员的对象,在这种情况下使用)。从这个向量我需要删除最小的非唯一值。但是,不保证存在非唯一值。允许对范围进行排序。

As always I started with looking for an std::algorithm and found std::unique. In my first idea I would use this in combination with std::sort to move all non-unique values to the end of the vector and then use min_element over the non-unique values. However, std::unique will leave the non-unique values at the end in an unspecified state. And indeed I lost all non POD members.

像往常一样,我开始寻找std :: algorithm并找到std :: unique。在我的第一个想法中,我将结合使用std :: sort将所有非唯一值移动到向量的末尾,然后将min_element用于非唯一值。但是,std :: unique会将非唯一值保留在未指定状态的末尾。事实上,我失去了所有非POD成员。

Does anyone have a suggestion how to do this efficiently? It is important to do it efficiently since the code is used in the bottleneck of the program (which is already a bit too slow).

有没有人有建议如何有效地做到这一点?由于代码用于程序的瓶颈(已经有点太慢),因此有效地执行它非常重要。

5 个解决方案

#1


4  

(Am adding an additional answer, since 1) the focus of the first answer was using ready STL components, and 2) Howard Hinnant raised some interesting points since.)

(我添加了一个额外的答案,因为1)第一个答案的焦点是使用现成的STL组件,2)Howard Hinnant提出了一些有趣的观点。)

Thanks to Howard Hinnant for the principle of benchmarking the different approaches (as well as a very unique solution)! It has led to some stuff I personally find interesting (and don't entirely understand).

感谢Howard Hinnant对不同方法进行基准测试的原则(以及一个非常独特的解决方案)!它导致了一些我个人觉得有趣(并且不完全理解)的东西。

The execution of the benchmark, however, was somewhat problematic, IMHO.

然而,基准的执行有些问题,恕我直言。

The question stated that the problem was

问题是,问题是

... objects with a double member which is used in this case ... It is important to do it efficiently since the code is used in the bottleneck of the program

...具有双成员的对象,在这种情况下使用...由于代码用于程序的瓶颈,因此有效地执行它很重要

The test, however:

然而,测试:

  • Performed the operation on ints, which is an advantage for the sorting-based mechanism; while both double comparisons and hashes are more expensive than ints', the number of comparisons is Theta(n log(n)), while the number of hashes is O(n).

    对int进行操作,这是基于排序的机制的一个优点;虽然双重比较和散列都比整数更昂贵,但比较的数量是Theta(n log(n)),而散列的数量是O(n)。

  • Took the body of my main function, and wrapped it up in a function (rather than a class object), and did not use a pool allocator. Frankly, I consider this a flaw that renders the results meaningless, as it basically established the well known fact that dynamic allocations + needless re-initializations of large containers, are expensive.

    取得我的main函数的主体,并将其包装在一个函数(而不是一个类对象)中,并且没有使用池分配器。坦率地说,我认为这是一个使结果毫无意义的缺陷,因为它基本上建立了众所周知的事实,即动态分配+大型容器的不必要的重新初始化是昂贵的。

  • Relied on the fact that the sorting algorithms could just return the vector on which they operated (which could not be done for the original problem). In the following I let this point slide, as the problem for a vector of doubles is interesting in itself, but the OP should note that this might change things even more.

    依赖于排序算法可以只返回它们操作的向量这一事实(对于原始问题无法做到)。在下文中,我让这一点滑动,因为双向量的问题本身很有趣,但OP应该注意到这可能会更改东西。


So, in order to deal with the second issue, I initially used the probing-based hash-table from my own gcc libstdc++ pb_ds extension. This by itself reduced the running time of solution #1 below that of solution #2 (sort + adjacent_find), but it still remained more expensive than #3 (make_heap).

所以,为了处理第二个问题,我最初使用了自己的gcc libstdc ++ pb_ds扩展中的基于探测的哈希表。这本身将解决方案#1的运行时间减少到低于解决方案#2(sort + adjacent_find)的运行时间,但它仍然比#3(make_heap)更昂贵。

In order to reduce this further, I used the most degenerate form of "hash table" that seemed relevant.

为了进一步减少这种情况,我使用了似乎相关的最简并形式的“哈希表”。

template<typename T, class Hash=std::hash<T>>
class smallest_dup_remover
{
public:
    explicit smallest_dup_remover(std::size_t max_size) 
    {
        while(m_mask < max_size)
            m_mask *= 2;

        m_status.resize(m_mask);
        m_vals.resize(m_mask);

        --m_mask;
    }

    void operator()(std::vector<T> &vals)
    {
        std::fill(std::begin(m_status), std::end(m_status), 0);
        bool has = false;
        T min_;
        std::vector<T> spillover;
        spillover.reserve(vals.size());
        for(auto v: vals)
        {
            const std::size_t pos = m_hash(v) & m_mask;
            char &status = m_status[pos];
            switch(status)
            {
            case 0:
                status = 1;
                m_vals[pos] = v;
                break;
            case 1:
                if(m_vals[pos] == v)
                {
                    status = 2;
                    min_ = has? std::min(min_, v): v;
                    has = true;
                }
                else
                    spillover.push_back(v);
                break;
            case 2:
               if(m_vals[pos] != v)
                    spillover.push_back(v);
            }
        }
        std::sort(std::begin(spillover), std::end(spillover));
        auto it = std::adjacent_find(std::begin(spillover), std::end(spillover));
        if(has && it == std::end(spillover))
            remove_min(vals, min_);
        else if(has && it != std::end(spillover))
            remove_min(vals, std::min(min_, *it));
        else if(!has && it != std::end(spillover))
            remove_min(vals, *it);
    }

private:
    void remove_min(std::vector<T> &vals, T t)
    {
        vals.erase(std::find(vals.begin(), vals.end(), t)); 
    }

private:
    size_t m_mask = 1;
    std::vector<char> m_status;
    std::vector<T> m_vals;
    Hash m_hash;
};

The data structure contains three vectors:

数据结构包含三个向量:

  • a "status" vector, containing codes for 0, 1, and "many"

    一个“状态”向量,包含0,1和“很多”的代码

  • a "values" vector, containing "hashed values"

    一个“值”向量,包含“散列值”

  • a "spillover" vector, for collisions

    用于碰撞的“溢出”矢量

Objects with a "many" status are compared on the fly for the minimal. Collision objects (that is, those that have collided with other objects) are pushed to the "spillover" vector. The spillover vector is scrutinized for the lowest duplicate using the method from #2. This is compared to the lowest found value from the "many" values.

具有“许多”状态的对象在运行中进行最小化比较。碰撞对象(即与其他对象发生碰撞的对象)被推送到“溢出”向量。使用#2中的方法仔细检查溢出矢量的最低重复。将其与“很多”值中的最低找到值进行比较。


Here is the code for a benchmark that retests this benchmark, and here is the code producing the following graphs.

以下是重新测试此基准测试的基准测试代码,以下是生成以下图表的代码。

(Reminder #1 is hash-based, #2 is quicksort-based, and #3 is heap-based.)

(提醒#1是基于散列的,#2是基于快速排序的,#3是基于堆的。)

Starting with the test performed by Howard Hinnant before (values are generated randomly from a range of size 1.5 the length of the values), here are the results:

从之前由Howard Hinnant执行的测试开始(值从值的长度1.5的范围内随机生成),结果如下:

从向量中删除最小的非唯一值

So indeed his excellent heap-based algorithm performs best for this case, but it looks very different than before. In particular, the hash-based algorithm is not as abysmal when profiling its memory allocations.

所以他的优秀的基于堆的算法确实在这种情况下表现最佳,但它看起来与以前完全不同。特别是,在分析其内存分配时,基于散列的算法并不那么糟糕。

However, suppose we change range to a completely random range. Then the results change to this:

但是,假设我们将范围更改为完全随机的范围。然后结果改为:

从向量中删除最小的非唯一值

In this case, the hash-based solution works best, the sort-based one next, and the heap-based solution works worst.

在这种情况下,基于哈希的解决方案效果最好,接下来是基于排序的解决方案,而基于堆的解决方案效果最差。

In order to verify the reason, here are two more tests.

为了验证原因,这里还有两个测试。

Here is a test with completely random values + two zero values (that is, the lowest duplicate is zero):

这是一个完全随机值+两个零值的测试(即最低重复值为零):

从向量中删除最小的非唯一值

and finally, here is the case where all the values are generated from 100 possible value (irrespective of the length):

最后,这里是所有值都是从100个可能的值生成的情况(无论长度如何):

从向量中删除最小的非唯一值

What's happening is as follows. The heap-based solution is the most dependent of the three on the distribution. The MakeHeap algorithm is linear time, and if a duplicate is almost immediately encountered following that, it turned out to be a linear algorithm (but without any hashing). Conversely, taking the other extreme, there are no duplicates at all. Essentially, this algorithm then becomes heapsort. The inferiority of heapsort to quicksort is both understood theoretically as well as much verified in practice.

发生的事情如下。基于堆的解决方案是三个最依赖于分布的解决方案。 MakeHeap算法是线性时间,如果在此之后几乎立即遇到重复,则结果是线性算法(但没有任何散列)。相反,采取另一种极端,根本没有重复。从本质上讲,这个算法然后变成了heapsort。从理论上理解了heapsort到quicksort的劣势,并且在实践中得到了很多验证。

So, the heap-based algorithm is actually a surprising and nice algorithm. It does have high variance though, and it might be a consideration for avoiding it in practice.

因此,基于堆的算法实际上是一个令人惊讶的好算法。它虽然确实有很大的差异,但它可能是在实践中避免它的考虑因素。


Some observations:

  • The graphs don't seem to make sense: where is the n log(n) behavior, at least for solution #2?

    这些图似乎没有意义:n log(n)行为在哪里,至少对于解决方案#2?

  • Why does the Hinnant test work so similarly to the random + lower repeated tests? With a 1.5 X range, and given the fact that this is very similar to Bootstrap Resampling, with the known ~37% repetition result, I just don't see it.

    为什么Hinnant测试与随机+低重复测试的工作方式类似? 1.5 X范围,并且考虑到这与Bootstrap Resampling非常相似,重复结果已知~37%,我只是看不到它。

  • As Howard Hinnant noted, it really depends on the distribution. However, the situation is very far from the previous benchmark.

    正如Howard Hinnant所说,这实际上取决于分布。但是,情况与之前的基准相差甚远。

  • Some practical points:

    一些实用点:

    • OP, you might want to re-time this for your original problem, using the true distributions and the overhead of twice-copying your original vector of structs to+from the sorting solutions.

      OP,你可能想要重新计算原始问题,使用真正的分布和两次从结构解决方案中复制原始结构向量的开销。

    • I've thought much about how to parallelize this problem, without anything interesting. One way to do so (possibly raising more questions than answers) is run Howard Hinnant's solution on one thread, a different one on another, and use the first result found. Given that it is so much faster for some distributions, and so much slower for others, it might cover your bases.

      我已经考虑过如何并行化这个问题,没有任何有趣的东西。一种方法(可能提出更多问题而不是答案)是在一个线程上运行Howard Hinnant的解决方案,在另一个线程上运行另一个,并使用找到的第一个结果。鉴于它对某些发行版来说要快得多,而对其他发行版来说要慢得多,它可能会覆盖你的基础。

#2


7  

You can do this in (expected) linear time.

您可以在(预期)线性时间内执行此操作。

  • use an unordered_map to count the elements. This is (expected) linear in the number of values.

    使用unordered_map来计算元素。这是(预期)值的数量的线性。

  • Find the smallest item among non-uniques using a naive loop.

    使用朴素循环查找非唯一中的最小项目。

Here is a possible implementation:

这是一个可能的实现:

#include <unordered_map>
#include <iostream>
#include <vector>

using namespace std;

int main()
{
    const vector<double> elems{1, 3.2, 3.2, 2};
    unordered_map<double, size_t> c;
    for(const double &d: elems)
        ++c[d];
    bool has = false;
    double min_;
    for(const auto &e: c)
        if(e.second > 1)
        {
            min_ = has? min(e.first, min_): e.first;
            has = true;
        }
    cout << boolalpha << has << " " << min_ << endl;
    return 0;
}

Edit As Howard Hinnant & Lightness Races In Orbit have pointed out, this contains both allocations & hashes. It will therefore be linear but with a relatively large factor. Other, sorting-based solutions, might work better for small sizes. When/if profiling, it's important to use a good allocator, e.g., Google's tcmalloc.

编辑作为Howard Hinnant&Lightness Races In Orbit指出,这包含分配和哈希。因此它将是线性的但具有相对大的因子。其他基于排序的解决方案可能适用于小尺寸。在进行分析时,使用好的分配器很重要,例如Google的tcmalloc。

#3


6  

Well, if you can sort the range then this is easy. Sort it in ascending order, then just iterate through until you encounter two equivalent, adjacent elements. Done.

好吧,如果你可以对范围进行排序,那么这很容易。按升序排序,然后迭代直到遇到两个相等的相邻元素。完成。

Something like this:

像这样的东西:

T findSmallestNonunique(std::vector<T> v)
{
   std::sort(std::begin(v), std::end(v));
   auto it = std::adjacent_find(std::begin(v), std::end(v));
   if (it == std::end(v))
      throw std::runtime_error("No such element found");
   return *it;
}

Here's a demonstration:

这是一个演示:

#include <vector>
#include <algorithm>
#include <stdexcept>
#include <iostream>

template <typename Container>
typename Container::value_type findSmallestNonunique(Container c)
{
   std::sort(std::begin(c), std::end(c));
   auto it = std::adjacent_find(std::begin(c), std::end(c));

   if (it == std::end(c))
      throw std::runtime_error("No such element found");

   return *it;
}

int main(int argc, char** argv)
{
    std::vector<int> v;
    for (int i = 1; i < argc; i++)
        v.push_back(std::stoi(argv[i]));

    std::cout << findSmallestNonunique(v) << std::endl;
}

// g++ -std=c++14 -O2 -Wall -pedantic -pthread main.cpp \
// && ./a.out 1 2 2 3 4 5 5 6 7 \
// && ./a.out 5 2 8 3 9 3 0 1 4 \
// && ./a.out 5 8 9 2 0 1 3 4 7
// 
// 2
// 3
// terminate called after throwing an instance of 'std::runtime_error'
//   what():  No such element found

Note that here I'm not performing the search in-place, but I could do by taking the container by reference. (This relies on you being allowed to sort the original input.)

请注意,这里我没有就地进行搜索,但我可以通过引用获取容器。 (这取决于您是否可以对原始输入进行排序。)

Due to the sort operation, this could be as "bad" as O(N×log(N)), but it's simple and easy to maintain, and does not require any allocations/copies (except for a single copy of the entire dataset which, as stated above, you may trivially completely avoid). You may want to use something else if your input is large or you are expecting to have a failed match in a majority of cases. As always: profile!

由于排序操作,这可能与O(N×log(N))一样“差”,但它简单易维护,不需要任何分配/拷贝(整个数据集的单个副本除外)如上所述,你可以完全避免)。如果您的输入很大,或者您希望在大多数情况下匹配失败,则可能需要使用其他内容。一如既往:简介!

#4


6  

Well, here's an algorithm that actually removes the smallest non-unique item (as opposed to just prints it).

好吧,这是一个算法,实际上删除了最小的非唯一项目(而不是只打印它)。

template <typename Container>
void
removeSmallestNonunique(Container& c)
{
    using value_type = typename Container::value_type;
    if (c.size() > 1)
    {
        std::make_heap(c.begin(), c.end(), std::greater<value_type>{});
        std::pop_heap(c.begin(), c.end(), std::greater<value_type>{});
        for (auto e = std::prev(c.end()); e != c.begin(); --e)
        {
            std::pop_heap(c.begin(), e, std::greater<value_type>{});
            if (*e == e[-1])
            {
                c.erase(e);
                break;
            }
        }
    }
}

I chose this algorithm mainly because Lightness Races in Orbit didn't. I don't know whether this will be faster than sort/adjacent_find or not. And the answer almost certainly depends on the input.

我选择这个算法主要是因为Orbit中的Lightness Races没有。我不知道这是否比sort / adjacent_find更快。答案几乎肯定取决于输入。

For example if there are no duplicates, then this algorithm is definitely slower than sort/adjacent_find. If the input is very, very large, and the minimum unique is likely to be early in the sorted range, this algorithm is likely to be faster than sort/adjacent_find.

例如,如果没有重复项,则此算法肯定比sort / adjacent_find慢。如果输入非常非常大,并且最小唯一可能在排序范围的早期,则此算法可能比sort / adjacent_find更快。

And everything I've said above is just guessing. I'm quitting prior to performing the needed measurements on statistically likely inputs to the actual problem.

我上面所说的一切都只是在猜测。我在对实际问题的统计可能输入执行所需测量之前放弃了。

Perhaps Omid can include this in his test and provide a summary answer. :-)

也许奥米德可以在他的测试中包含这个并提供一个总结答案。 :-)

7 hours later ... Timings

7个小时后......时间

I took Omid's code, corrected a minor mistake in it, corrected the other two algorithms to actually erase an element, and changed the test harness to vary the size and the number of duplicates more widely.

我接受了奥米德的代码,纠正了其中的一个小错误,纠正了其他两个算法以实际擦除元素,并更改了测试工具以更广泛地改变大小和重复数量。

Here is the code I tested using clang / libc++ at -O3:

这是我在-O3使用clang / libc ++测试的代码:

#include <unordered_map>
#include <iostream>
#include <vector>
#include <algorithm>
#include <random>
#include <chrono>
#include <cassert>

template <typename Container>
void
erase_using_hashTable(Container& vec)
{
    using T = typename Container::value_type;
    std::unordered_map<T, int> c;
    for (const auto& elem : vec){
        ++c[elem];
    }
    bool has = false;
    T min_;
    for (const auto& e : c)
    {
        if (e.second > 1)
        {
            min_ = has ? std::min(e.first, min_) : e.first;
            has = true;
        }
    }
    if (has)
        vec.erase(std::find(vec.begin(), vec.end(), min_));
}

template <typename Container>
void 
eraseSmallestNonunique(Container& c)
{
   std::sort(std::begin(c), std::end(c));
   auto it = std::adjacent_find(std::begin(c), std::end(c));

   if (it != std::end(c))
       c.erase(it);
}

template <typename Container>
void
removeSmallestNonunique(Container& c)
{
    using value_type = typename Container::value_type;
    if (c.size() > 1)
    {
        std::make_heap(c.begin(), c.end(), std::greater<value_type>{});
        std::pop_heap(c.begin(), c.end(), std::greater<value_type>{});
        for (auto e = std::prev(c.end()); e != c.begin(); --e)
        {
            std::pop_heap(c.begin(), e, std::greater<value_type>{});
            if (*e == e[-1])
            {
                c.erase(e);
                break;
            }
        }
    }
}

template<typename iterator>
iterator partition_and_find_smallest_duplicate(iterator begin, iterator end)
{
    using std::swap;
    if (begin == end)
        return end; // empty sequence

    // The range begin,end is split in four partitions:
    // 1. equal to the pivot
    // 2. smaller than the pivot
    // 3. unclassified
    // 4. greater than the pivot

    // pick pivot (TODO: randomize pivot?)
    iterator pivot = begin;
    iterator first = next(begin);
    iterator last = end;

    while (first != last) {
        if (*first > *pivot) {
            --last;
            swap(*first, *last);
        } else if (*first < *pivot) {
            ++first;
        } else {
            ++pivot;
            swap(*pivot, *first);
            ++first;
        }
    }

    // look for duplicates in the elements smaller than the pivot
    auto res = partition_and_find_smallest_duplicate(next(pivot), first);
    if (res != first)
        return res;

    // if we have more than just one equal to the pivot, it is the smallest duplicate
    if (pivot != begin)
        return pivot;

    // neither, look for duplicates in the elements greater than the pivot
    return partition_and_find_smallest_duplicate(last, end);
}

template<typename container>
void remove_smallest_duplicate(container& c)
{
    using std::swap;
    auto it = partition_and_find_smallest_duplicate(c.begin(), c.end());
    if (it != c.end())
    {
        swap(*it, c.back());
        c.pop_back();
    }
}

int  main()
{
    const int MaxArraySize = 5000000;
    const int minArraySize = 5;
    const int numberOfTests = 3;

    //std::ofstream file;
    //file.open("test.txt");
    std::mt19937 generator;

    for (int t = minArraySize; t <= MaxArraySize; t *= 10)
    {
        const int range = 3*t/2;
        std::uniform_int_distribution<int> distribution(0,range);

        std::cout << "Array size = " << t << "  range = " << range << '\n';

        std::chrono::duration<double> avg{},avg2{}, avg3{}, avg4{};
        for (int n = 0; n < numberOfTests; n++)
        {
            std::vector<int> save_vec;
            save_vec.reserve(t);
            for (int i = 0; i < t; i++){//por kardan array ba anasor random
                save_vec.push_back(distribution(generator));
            }
            //method1
            auto vec = save_vec;
            auto start = std::chrono::steady_clock::now();
            erase_using_hashTable(vec);
            auto end = std::chrono::steady_clock::now();
            avg += end - start;
            auto answer1 = vec;
            std::sort(answer1.begin(), answer1.end());

            //method2
            vec = save_vec;
            start = std::chrono::steady_clock::now();
            eraseSmallestNonunique(vec);
            end = std::chrono::steady_clock::now();
            avg2 += end - start;
            auto answer2 = vec;
            std::sort(answer2.begin(), answer2.end());
            assert(answer2 == answer1);

            //method3
            vec = save_vec;
            start = std::chrono::steady_clock::now();
            removeSmallestNonunique(vec);
            end = std::chrono::steady_clock::now();
            avg3 += end - start;
            auto answer3 = vec;
            std::sort(answer3.begin(), answer3.end());
            assert(answer3 == answer2);

            //method4
            vec = save_vec;
            start = std::chrono::steady_clock::now();
            remove_smallest_duplicate(vec);
            end = std::chrono::steady_clock::now();
            avg4 += end - start;
            auto answer4 = vec;
            std::sort(answer4.begin(), answer4.end());
            assert(answer4 == answer3);
        }
        //file << avg/numberOfTests <<" "<<avg2/numberOfTests<<'\n';
        //file << "__\n";
        std::cout <<   "Method1 : " << (avg  / numberOfTests).count() << 's'
                  << "\nMethod2 : " << (avg2 / numberOfTests).count() << 's'
                  << "\nMethod3 : " << (avg3 / numberOfTests).count() << 's'
                  << "\nMethod4 : " << (avg4 / numberOfTests).count() << 's'
                  << "\n\n";
    }

}

And here are my results:

以下是我的结果:

Array size = 5  range = 7
Method1 : 8.61967e-06s
Method2 : 1.49667e-07s
Method3 : 2.69e-07s
Method4 : 2.47667e-07s

Array size = 50  range = 75
Method1 : 2.0749e-05s
Method2 : 1.404e-06s
Method3 : 9.23e-07s
Method4 : 8.37e-07s

Array size = 500  range = 750
Method1 : 0.000163868s
Method2 : 1.6899e-05s
Method3 : 4.39767e-06s
Method4 : 3.78733e-06s

Array size = 5000  range = 7500
Method1 : 0.00124788s
Method2 : 0.000258637s
Method3 : 3.32683e-05s
Method4 : 4.70797e-05s

Array size = 50000  range = 75000
Method1 : 0.0131954s
Method2 : 0.00344415s
Method3 : 0.000346838s
Method4 : 0.000183092s

Array size = 500000  range = 750000
Method1 : 0.25375s
Method2 : 0.0400779s
Method3 : 0.00331022s
Method4 : 0.00343761s

Array size = 5000000  range = 7500000
Method1 : 3.82532s
Method2 : 0.466848s
Method3 : 0.0426554s
Method4 : 0.0278986s

Update

I've updated results above with Ulrich Eckhardt's algorithm. His algorithm is quite competitive. Nice job Ulrich!

我已经用Ulrich Eckhardt的算法更新了上面的结果。他的算法很有竞争力。好工作Ulrich!

I should warn the readers of this answer that Ulrich's algorithm is vulnerable to the "quick-sort O(N^2) problem" wherein for specific inputs the algorithm can degenerate badly. The general algorithm is fixable, and Ulrich is obviously aware of the vulnerability as evidenced by this comment:

我应该向读者警告这个答案,Ulrich的算法容易受到“快速排序O(N ^ 2)问题”的影响,其中对于特定输入,算法可能会严重退化。一般算法是可修复的,Ulrich显然已经意识到这个漏洞,这个评论证明了这一点:

// pick pivot (TODO: randomize pivot?)

That is one defense against the O(N^2) problem, and there are others, such as detecting unreasonable recursion/iteration and switching to another algorithm mid-stream (such as Method 3 or Method 2). As written Method 4 is badly impacted when given an in-order sequence, and is disastrous when given a reverse-order sequence. On my platform Method 3 is also sub-optimal to Method 2 for these cases, though not nearly as bad as Method 4.

这是针对O(N ^ 2)问题的一种防御,还有其他问题,例如检测不合理的递归/迭代以及切换到中间流的另一算法(例如方法3或方法2)。如上所述,当给定有序序列时,方法4受到严重影响,并且当给出逆序序列时,方法4是灾难性的。在我的平台上,对于这些情况,方法3对于方法2也是次优的,尽管不如方法4差。

Finding the ideal technique for combating the O(N^2) problem for quick-sort-like algorithms is somewhat of a black-art, but well worth the time. I would definitely consider Method 4 as a valuable tool in the toolbox.

找到理想的技术来解决类似快速排序的算法的O(N ^ 2)问题,这在某种程度上是黑色的,但非常值得花时间。我肯定会认为方法4是工具箱中的一个有价值的工具。

#5


5  

Firstly, concerning the task of removing the element, the hardest thing is to find it, but actually removing it is easy (swap with the last element and then pop_back()). I'll therefore only address the finding. Also, you mention that sorting the sequence is acceptable, but I take from that that not just sorting but any kind of reordering is acceptable.

首先,关于删除元素的任务,最难的是找到它,但实际上删除它很容易(与最后一个元素交换然后pop_back())。因此,我只会解决这个问题。此外,你提到排序序列是可以接受的,但我从中得出的不仅仅是排序,而且任何类型的重新排序都是可以接受的。

Take a look at the quicksort algorithm. It picks a random element and then partitions the sequence to the left and right. If you write the partitioning so that you make a distinction between not just "less" and "not less", you can partially sort the sequence and locate the smallest duplicate on the run.

看一下quicksort算法。它选择一个随机元素,然后将序列分为左右两侧。如果您编写分区以便区分“less”和“not less”,则可以对序列进行部分排序并在运行中找到最小的副本。

Following steps should do the job:

以下步骤应该做的工作:

  • First, pick a random pivot and partition the sequence. At the same time, you can detect detect whether the pivot is a duplicate or not. Note that if you find a duplicate here, you can discard(!) anything greater, so you don't even have to invest into storage capacity and bandwidth for both partitions.
  • 首先,选择一个随机的枢轴并对序列进行分区。同时,您可以检测到检查枢轴是否重复。请注意,如果您在此处发现重复内容,则可以丢弃(!)更大的内容,因此您甚至不必为两个分区投入存储容量和带宽。

  • Then, recurse to the sequence of smaller elements.
  • 然后,递归到较小元素的序列。

  • If there was a set of duplicates in the smaller partition, those are your solution.
  • 如果较小的分区中有一组重复项,那么这些是您的解决方案。

  • If the first pivot had duplicates, that is your solution.
  • 如果第一个支点有重复,那就是你的解决方案。

  • Else, recurse to the larger elements looking for duplicates.
  • 否则,递归寻找重复的较大元素。

The advantage over regular sorting is that you don't sort the whole sequence if you find duplicates in the lower numbers. On a sequence of unique numbers, you will fully sort them though. Compared with the suggested counting of elements using a hashmap, this indeed has a higher asymptotic complexity. Whether it performs better depends on your implementation and input data though.

常规排序的优点是,如果您在较低的数字中找到重复项,则不会对整个序列进行排序。在一系列唯一数字上,您将完全对它们进行排序。与使用散列映射建议的元素计数相比,这确实具有更高的渐近复杂度。它是否表现更好取决于您的实现和输入数据。

Note that this requires that that the elements can be sorted and compared. You mention that you use double values, which are notoriously bad to sort when you have NaNs in there. I could imagine that the hashing algorithms in the standard containers work with NaNs though, so one more point for counting with a hash map.

请注意,这要求可以对元素进行排序和比较。你提到你使用双值,当你在那里有NaN时排序是非常糟糕的。我可以想象标准容器中的哈希算法可以使用NaNs,因此还有一点可以用哈希映射进行计数。

The following code implements above algorithm. It uses one recursive function to partition the input and to find duplicates, called from a second function that then finally removes the duplicate:

以下代码实现了上述算法。它使用一个递归函数来分区输入并查找重复项,从第二个函数调用,然后最终删除副本:

#include <vector>
#include <algorithm>
#include <iostream>

template<typename iterator>
iterator partition_and_find_smallest_duplicate(iterator begin, iterator end)
{
    using std::swap;

    std::cout << "find_duplicate(";
    for (iterator it=begin; it!=end; ++it)
        std::cout << *it << ", ";
    std::cout << ")\n";
    if (begin == end)
        return end; // empty sequence

    // The range begin,end is split in four partitions:
    // 1. equal to the pivot
    // 2. smaller than the pivot
    // 3. unclassified
    // 4. greater than the pivot

    // pick pivot (TODO: randomize pivot?)
    iterator pivot = begin;
    std::cout << "picking pivot: " << *pivot << '\n';

    iterator first = next(begin);
    iterator last = end;

    while (first != last) {
        if (*first > *pivot) {
            --last;
            swap(*first, *last);
        } else if (*first < *pivot) {
            ++first;
        } else {
            ++pivot;
            swap(*pivot, *first);
            ++first;
            std::cout << "found duplicate of pivot\n";
        }
    }

    // look for duplicates in the elements smaller than the pivot
    auto res = partition_and_find_smallest_duplicate(next(pivot), first);
    if (res != first)
        return res;

    // if we have more than just one equal to the pivot, it is the smallest duplicate
    if (pivot != begin)
        return pivot;

    // neither, look for duplicates in the elements greater than the pivot
    return partition_and_find_smallest_duplicate(last, end);
}

template<typename container>
void remove_smallest_duplicate(container& c)
{
    using std::swap;
    auto it = partition_and_find_smallest_duplicate(c.begin(), c.end());
    if (it != c.end())
    {
        std::cout << "removing duplicate: " << *it << std::endl;

        // swap with the last last element before popping
        // to avoid copying the elements in between
        swap(*it, c.back());
        c.pop_back();
    }
}

int main()
{
    std::vector<int> data = {66, 3, 11, 7, 75, 62, 62, 52, 9, 24, 58, 72, 37, 2, 9, 28, 15, 58, 3, 60, 2, 14};

    remove_smallest_duplicate(data);
}

#1


4  

(Am adding an additional answer, since 1) the focus of the first answer was using ready STL components, and 2) Howard Hinnant raised some interesting points since.)

(我添加了一个额外的答案,因为1)第一个答案的焦点是使用现成的STL组件,2)Howard Hinnant提出了一些有趣的观点。)

Thanks to Howard Hinnant for the principle of benchmarking the different approaches (as well as a very unique solution)! It has led to some stuff I personally find interesting (and don't entirely understand).

感谢Howard Hinnant对不同方法进行基准测试的原则(以及一个非常独特的解决方案)!它导致了一些我个人觉得有趣(并且不完全理解)的东西。

The execution of the benchmark, however, was somewhat problematic, IMHO.

然而,基准的执行有些问题,恕我直言。

The question stated that the problem was

问题是,问题是

... objects with a double member which is used in this case ... It is important to do it efficiently since the code is used in the bottleneck of the program

...具有双成员的对象,在这种情况下使用...由于代码用于程序的瓶颈,因此有效地执行它很重要

The test, however:

然而,测试:

  • Performed the operation on ints, which is an advantage for the sorting-based mechanism; while both double comparisons and hashes are more expensive than ints', the number of comparisons is Theta(n log(n)), while the number of hashes is O(n).

    对int进行操作,这是基于排序的机制的一个优点;虽然双重比较和散列都比整数更昂贵,但比较的数量是Theta(n log(n)),而散列的数量是O(n)。

  • Took the body of my main function, and wrapped it up in a function (rather than a class object), and did not use a pool allocator. Frankly, I consider this a flaw that renders the results meaningless, as it basically established the well known fact that dynamic allocations + needless re-initializations of large containers, are expensive.

    取得我的main函数的主体,并将其包装在一个函数(而不是一个类对象)中,并且没有使用池分配器。坦率地说,我认为这是一个使结果毫无意义的缺陷,因为它基本上建立了众所周知的事实,即动态分配+大型容器的不必要的重新初始化是昂贵的。

  • Relied on the fact that the sorting algorithms could just return the vector on which they operated (which could not be done for the original problem). In the following I let this point slide, as the problem for a vector of doubles is interesting in itself, but the OP should note that this might change things even more.

    依赖于排序算法可以只返回它们操作的向量这一事实(对于原始问题无法做到)。在下文中,我让这一点滑动,因为双向量的问题本身很有趣,但OP应该注意到这可能会更改东西。


So, in order to deal with the second issue, I initially used the probing-based hash-table from my own gcc libstdc++ pb_ds extension. This by itself reduced the running time of solution #1 below that of solution #2 (sort + adjacent_find), but it still remained more expensive than #3 (make_heap).

所以,为了处理第二个问题,我最初使用了自己的gcc libstdc ++ pb_ds扩展中的基于探测的哈希表。这本身将解决方案#1的运行时间减少到低于解决方案#2(sort + adjacent_find)的运行时间,但它仍然比#3(make_heap)更昂贵。

In order to reduce this further, I used the most degenerate form of "hash table" that seemed relevant.

为了进一步减少这种情况,我使用了似乎相关的最简并形式的“哈希表”。

template<typename T, class Hash=std::hash<T>>
class smallest_dup_remover
{
public:
    explicit smallest_dup_remover(std::size_t max_size) 
    {
        while(m_mask < max_size)
            m_mask *= 2;

        m_status.resize(m_mask);
        m_vals.resize(m_mask);

        --m_mask;
    }

    void operator()(std::vector<T> &vals)
    {
        std::fill(std::begin(m_status), std::end(m_status), 0);
        bool has = false;
        T min_;
        std::vector<T> spillover;
        spillover.reserve(vals.size());
        for(auto v: vals)
        {
            const std::size_t pos = m_hash(v) & m_mask;
            char &status = m_status[pos];
            switch(status)
            {
            case 0:
                status = 1;
                m_vals[pos] = v;
                break;
            case 1:
                if(m_vals[pos] == v)
                {
                    status = 2;
                    min_ = has? std::min(min_, v): v;
                    has = true;
                }
                else
                    spillover.push_back(v);
                break;
            case 2:
               if(m_vals[pos] != v)
                    spillover.push_back(v);
            }
        }
        std::sort(std::begin(spillover), std::end(spillover));
        auto it = std::adjacent_find(std::begin(spillover), std::end(spillover));
        if(has && it == std::end(spillover))
            remove_min(vals, min_);
        else if(has && it != std::end(spillover))
            remove_min(vals, std::min(min_, *it));
        else if(!has && it != std::end(spillover))
            remove_min(vals, *it);
    }

private:
    void remove_min(std::vector<T> &vals, T t)
    {
        vals.erase(std::find(vals.begin(), vals.end(), t)); 
    }

private:
    size_t m_mask = 1;
    std::vector<char> m_status;
    std::vector<T> m_vals;
    Hash m_hash;
};

The data structure contains three vectors:

数据结构包含三个向量:

  • a "status" vector, containing codes for 0, 1, and "many"

    一个“状态”向量,包含0,1和“很多”的代码

  • a "values" vector, containing "hashed values"

    一个“值”向量,包含“散列值”

  • a "spillover" vector, for collisions

    用于碰撞的“溢出”矢量

Objects with a "many" status are compared on the fly for the minimal. Collision objects (that is, those that have collided with other objects) are pushed to the "spillover" vector. The spillover vector is scrutinized for the lowest duplicate using the method from #2. This is compared to the lowest found value from the "many" values.

具有“许多”状态的对象在运行中进行最小化比较。碰撞对象(即与其他对象发生碰撞的对象)被推送到“溢出”向量。使用#2中的方法仔细检查溢出矢量的最低重复。将其与“很多”值中的最低找到值进行比较。


Here is the code for a benchmark that retests this benchmark, and here is the code producing the following graphs.

以下是重新测试此基准测试的基准测试代码,以下是生成以下图表的代码。

(Reminder #1 is hash-based, #2 is quicksort-based, and #3 is heap-based.)

(提醒#1是基于散列的,#2是基于快速排序的,#3是基于堆的。)

Starting with the test performed by Howard Hinnant before (values are generated randomly from a range of size 1.5 the length of the values), here are the results:

从之前由Howard Hinnant执行的测试开始(值从值的长度1.5的范围内随机生成),结果如下:

从向量中删除最小的非唯一值

So indeed his excellent heap-based algorithm performs best for this case, but it looks very different than before. In particular, the hash-based algorithm is not as abysmal when profiling its memory allocations.

所以他的优秀的基于堆的算法确实在这种情况下表现最佳,但它看起来与以前完全不同。特别是,在分析其内存分配时,基于散列的算法并不那么糟糕。

However, suppose we change range to a completely random range. Then the results change to this:

但是,假设我们将范围更改为完全随机的范围。然后结果改为:

从向量中删除最小的非唯一值

In this case, the hash-based solution works best, the sort-based one next, and the heap-based solution works worst.

在这种情况下,基于哈希的解决方案效果最好,接下来是基于排序的解决方案,而基于堆的解决方案效果最差。

In order to verify the reason, here are two more tests.

为了验证原因,这里还有两个测试。

Here is a test with completely random values + two zero values (that is, the lowest duplicate is zero):

这是一个完全随机值+两个零值的测试(即最低重复值为零):

从向量中删除最小的非唯一值

and finally, here is the case where all the values are generated from 100 possible value (irrespective of the length):

最后,这里是所有值都是从100个可能的值生成的情况(无论长度如何):

从向量中删除最小的非唯一值

What's happening is as follows. The heap-based solution is the most dependent of the three on the distribution. The MakeHeap algorithm is linear time, and if a duplicate is almost immediately encountered following that, it turned out to be a linear algorithm (but without any hashing). Conversely, taking the other extreme, there are no duplicates at all. Essentially, this algorithm then becomes heapsort. The inferiority of heapsort to quicksort is both understood theoretically as well as much verified in practice.

发生的事情如下。基于堆的解决方案是三个最依赖于分布的解决方案。 MakeHeap算法是线性时间,如果在此之后几乎立即遇到重复,则结果是线性算法(但没有任何散列)。相反,采取另一种极端,根本没有重复。从本质上讲,这个算法然后变成了heapsort。从理论上理解了heapsort到quicksort的劣势,并且在实践中得到了很多验证。

So, the heap-based algorithm is actually a surprising and nice algorithm. It does have high variance though, and it might be a consideration for avoiding it in practice.

因此,基于堆的算法实际上是一个令人惊讶的好算法。它虽然确实有很大的差异,但它可能是在实践中避免它的考虑因素。


Some observations:

  • The graphs don't seem to make sense: where is the n log(n) behavior, at least for solution #2?

    这些图似乎没有意义:n log(n)行为在哪里,至少对于解决方案#2?

  • Why does the Hinnant test work so similarly to the random + lower repeated tests? With a 1.5 X range, and given the fact that this is very similar to Bootstrap Resampling, with the known ~37% repetition result, I just don't see it.

    为什么Hinnant测试与随机+低重复测试的工作方式类似? 1.5 X范围,并且考虑到这与Bootstrap Resampling非常相似,重复结果已知~37%,我只是看不到它。

  • As Howard Hinnant noted, it really depends on the distribution. However, the situation is very far from the previous benchmark.

    正如Howard Hinnant所说,这实际上取决于分布。但是,情况与之前的基准相差甚远。

  • Some practical points:

    一些实用点:

    • OP, you might want to re-time this for your original problem, using the true distributions and the overhead of twice-copying your original vector of structs to+from the sorting solutions.

      OP,你可能想要重新计算原始问题,使用真正的分布和两次从结构解决方案中复制原始结构向量的开销。

    • I've thought much about how to parallelize this problem, without anything interesting. One way to do so (possibly raising more questions than answers) is run Howard Hinnant's solution on one thread, a different one on another, and use the first result found. Given that it is so much faster for some distributions, and so much slower for others, it might cover your bases.

      我已经考虑过如何并行化这个问题,没有任何有趣的东西。一种方法(可能提出更多问题而不是答案)是在一个线程上运行Howard Hinnant的解决方案,在另一个线程上运行另一个,并使用找到的第一个结果。鉴于它对某些发行版来说要快得多,而对其他发行版来说要慢得多,它可能会覆盖你的基础。

#2


7  

You can do this in (expected) linear time.

您可以在(预期)线性时间内执行此操作。

  • use an unordered_map to count the elements. This is (expected) linear in the number of values.

    使用unordered_map来计算元素。这是(预期)值的数量的线性。

  • Find the smallest item among non-uniques using a naive loop.

    使用朴素循环查找非唯一中的最小项目。

Here is a possible implementation:

这是一个可能的实现:

#include <unordered_map>
#include <iostream>
#include <vector>

using namespace std;

int main()
{
    const vector<double> elems{1, 3.2, 3.2, 2};
    unordered_map<double, size_t> c;
    for(const double &d: elems)
        ++c[d];
    bool has = false;
    double min_;
    for(const auto &e: c)
        if(e.second > 1)
        {
            min_ = has? min(e.first, min_): e.first;
            has = true;
        }
    cout << boolalpha << has << " " << min_ << endl;
    return 0;
}

Edit As Howard Hinnant & Lightness Races In Orbit have pointed out, this contains both allocations & hashes. It will therefore be linear but with a relatively large factor. Other, sorting-based solutions, might work better for small sizes. When/if profiling, it's important to use a good allocator, e.g., Google's tcmalloc.

编辑作为Howard Hinnant&Lightness Races In Orbit指出,这包含分配和哈希。因此它将是线性的但具有相对大的因子。其他基于排序的解决方案可能适用于小尺寸。在进行分析时,使用好的分配器很重要,例如Google的tcmalloc。

#3


6  

Well, if you can sort the range then this is easy. Sort it in ascending order, then just iterate through until you encounter two equivalent, adjacent elements. Done.

好吧,如果你可以对范围进行排序,那么这很容易。按升序排序,然后迭代直到遇到两个相等的相邻元素。完成。

Something like this:

像这样的东西:

T findSmallestNonunique(std::vector<T> v)
{
   std::sort(std::begin(v), std::end(v));
   auto it = std::adjacent_find(std::begin(v), std::end(v));
   if (it == std::end(v))
      throw std::runtime_error("No such element found");
   return *it;
}

Here's a demonstration:

这是一个演示:

#include <vector>
#include <algorithm>
#include <stdexcept>
#include <iostream>

template <typename Container>
typename Container::value_type findSmallestNonunique(Container c)
{
   std::sort(std::begin(c), std::end(c));
   auto it = std::adjacent_find(std::begin(c), std::end(c));

   if (it == std::end(c))
      throw std::runtime_error("No such element found");

   return *it;
}

int main(int argc, char** argv)
{
    std::vector<int> v;
    for (int i = 1; i < argc; i++)
        v.push_back(std::stoi(argv[i]));

    std::cout << findSmallestNonunique(v) << std::endl;
}

// g++ -std=c++14 -O2 -Wall -pedantic -pthread main.cpp \
// && ./a.out 1 2 2 3 4 5 5 6 7 \
// && ./a.out 5 2 8 3 9 3 0 1 4 \
// && ./a.out 5 8 9 2 0 1 3 4 7
// 
// 2
// 3
// terminate called after throwing an instance of 'std::runtime_error'
//   what():  No such element found

Note that here I'm not performing the search in-place, but I could do by taking the container by reference. (This relies on you being allowed to sort the original input.)

请注意,这里我没有就地进行搜索,但我可以通过引用获取容器。 (这取决于您是否可以对原始输入进行排序。)

Due to the sort operation, this could be as "bad" as O(N×log(N)), but it's simple and easy to maintain, and does not require any allocations/copies (except for a single copy of the entire dataset which, as stated above, you may trivially completely avoid). You may want to use something else if your input is large or you are expecting to have a failed match in a majority of cases. As always: profile!

由于排序操作,这可能与O(N×log(N))一样“差”,但它简单易维护,不需要任何分配/拷贝(整个数据集的单个副本除外)如上所述,你可以完全避免)。如果您的输入很大,或者您希望在大多数情况下匹配失败,则可能需要使用其他内容。一如既往:简介!

#4


6  

Well, here's an algorithm that actually removes the smallest non-unique item (as opposed to just prints it).

好吧,这是一个算法,实际上删除了最小的非唯一项目(而不是只打印它)。

template <typename Container>
void
removeSmallestNonunique(Container& c)
{
    using value_type = typename Container::value_type;
    if (c.size() > 1)
    {
        std::make_heap(c.begin(), c.end(), std::greater<value_type>{});
        std::pop_heap(c.begin(), c.end(), std::greater<value_type>{});
        for (auto e = std::prev(c.end()); e != c.begin(); --e)
        {
            std::pop_heap(c.begin(), e, std::greater<value_type>{});
            if (*e == e[-1])
            {
                c.erase(e);
                break;
            }
        }
    }
}

I chose this algorithm mainly because Lightness Races in Orbit didn't. I don't know whether this will be faster than sort/adjacent_find or not. And the answer almost certainly depends on the input.

我选择这个算法主要是因为Orbit中的Lightness Races没有。我不知道这是否比sort / adjacent_find更快。答案几乎肯定取决于输入。

For example if there are no duplicates, then this algorithm is definitely slower than sort/adjacent_find. If the input is very, very large, and the minimum unique is likely to be early in the sorted range, this algorithm is likely to be faster than sort/adjacent_find.

例如,如果没有重复项,则此算法肯定比sort / adjacent_find慢。如果输入非常非常大,并且最小唯一可能在排序范围的早期,则此算法可能比sort / adjacent_find更快。

And everything I've said above is just guessing. I'm quitting prior to performing the needed measurements on statistically likely inputs to the actual problem.

我上面所说的一切都只是在猜测。我在对实际问题的统计可能输入执行所需测量之前放弃了。

Perhaps Omid can include this in his test and provide a summary answer. :-)

也许奥米德可以在他的测试中包含这个并提供一个总结答案。 :-)

7 hours later ... Timings

7个小时后......时间

I took Omid's code, corrected a minor mistake in it, corrected the other two algorithms to actually erase an element, and changed the test harness to vary the size and the number of duplicates more widely.

我接受了奥米德的代码,纠正了其中的一个小错误,纠正了其他两个算法以实际擦除元素,并更改了测试工具以更广泛地改变大小和重复数量。

Here is the code I tested using clang / libc++ at -O3:

这是我在-O3使用clang / libc ++测试的代码:

#include <unordered_map>
#include <iostream>
#include <vector>
#include <algorithm>
#include <random>
#include <chrono>
#include <cassert>

template <typename Container>
void
erase_using_hashTable(Container& vec)
{
    using T = typename Container::value_type;
    std::unordered_map<T, int> c;
    for (const auto& elem : vec){
        ++c[elem];
    }
    bool has = false;
    T min_;
    for (const auto& e : c)
    {
        if (e.second > 1)
        {
            min_ = has ? std::min(e.first, min_) : e.first;
            has = true;
        }
    }
    if (has)
        vec.erase(std::find(vec.begin(), vec.end(), min_));
}

template <typename Container>
void 
eraseSmallestNonunique(Container& c)
{
   std::sort(std::begin(c), std::end(c));
   auto it = std::adjacent_find(std::begin(c), std::end(c));

   if (it != std::end(c))
       c.erase(it);
}

template <typename Container>
void
removeSmallestNonunique(Container& c)
{
    using value_type = typename Container::value_type;
    if (c.size() > 1)
    {
        std::make_heap(c.begin(), c.end(), std::greater<value_type>{});
        std::pop_heap(c.begin(), c.end(), std::greater<value_type>{});
        for (auto e = std::prev(c.end()); e != c.begin(); --e)
        {
            std::pop_heap(c.begin(), e, std::greater<value_type>{});
            if (*e == e[-1])
            {
                c.erase(e);
                break;
            }
        }
    }
}

template<typename iterator>
iterator partition_and_find_smallest_duplicate(iterator begin, iterator end)
{
    using std::swap;
    if (begin == end)
        return end; // empty sequence

    // The range begin,end is split in four partitions:
    // 1. equal to the pivot
    // 2. smaller than the pivot
    // 3. unclassified
    // 4. greater than the pivot

    // pick pivot (TODO: randomize pivot?)
    iterator pivot = begin;
    iterator first = next(begin);
    iterator last = end;

    while (first != last) {
        if (*first > *pivot) {
            --last;
            swap(*first, *last);
        } else if (*first < *pivot) {
            ++first;
        } else {
            ++pivot;
            swap(*pivot, *first);
            ++first;
        }
    }

    // look for duplicates in the elements smaller than the pivot
    auto res = partition_and_find_smallest_duplicate(next(pivot), first);
    if (res != first)
        return res;

    // if we have more than just one equal to the pivot, it is the smallest duplicate
    if (pivot != begin)
        return pivot;

    // neither, look for duplicates in the elements greater than the pivot
    return partition_and_find_smallest_duplicate(last, end);
}

template<typename container>
void remove_smallest_duplicate(container& c)
{
    using std::swap;
    auto it = partition_and_find_smallest_duplicate(c.begin(), c.end());
    if (it != c.end())
    {
        swap(*it, c.back());
        c.pop_back();
    }
}

int  main()
{
    const int MaxArraySize = 5000000;
    const int minArraySize = 5;
    const int numberOfTests = 3;

    //std::ofstream file;
    //file.open("test.txt");
    std::mt19937 generator;

    for (int t = minArraySize; t <= MaxArraySize; t *= 10)
    {
        const int range = 3*t/2;
        std::uniform_int_distribution<int> distribution(0,range);

        std::cout << "Array size = " << t << "  range = " << range << '\n';

        std::chrono::duration<double> avg{},avg2{}, avg3{}, avg4{};
        for (int n = 0; n < numberOfTests; n++)
        {
            std::vector<int> save_vec;
            save_vec.reserve(t);
            for (int i = 0; i < t; i++){//por kardan array ba anasor random
                save_vec.push_back(distribution(generator));
            }
            //method1
            auto vec = save_vec;
            auto start = std::chrono::steady_clock::now();
            erase_using_hashTable(vec);
            auto end = std::chrono::steady_clock::now();
            avg += end - start;
            auto answer1 = vec;
            std::sort(answer1.begin(), answer1.end());

            //method2
            vec = save_vec;
            start = std::chrono::steady_clock::now();
            eraseSmallestNonunique(vec);
            end = std::chrono::steady_clock::now();
            avg2 += end - start;
            auto answer2 = vec;
            std::sort(answer2.begin(), answer2.end());
            assert(answer2 == answer1);

            //method3
            vec = save_vec;
            start = std::chrono::steady_clock::now();
            removeSmallestNonunique(vec);
            end = std::chrono::steady_clock::now();
            avg3 += end - start;
            auto answer3 = vec;
            std::sort(answer3.begin(), answer3.end());
            assert(answer3 == answer2);

            //method4
            vec = save_vec;
            start = std::chrono::steady_clock::now();
            remove_smallest_duplicate(vec);
            end = std::chrono::steady_clock::now();
            avg4 += end - start;
            auto answer4 = vec;
            std::sort(answer4.begin(), answer4.end());
            assert(answer4 == answer3);
        }
        //file << avg/numberOfTests <<" "<<avg2/numberOfTests<<'\n';
        //file << "__\n";
        std::cout <<   "Method1 : " << (avg  / numberOfTests).count() << 's'
                  << "\nMethod2 : " << (avg2 / numberOfTests).count() << 's'
                  << "\nMethod3 : " << (avg3 / numberOfTests).count() << 's'
                  << "\nMethod4 : " << (avg4 / numberOfTests).count() << 's'
                  << "\n\n";
    }

}

And here are my results:

以下是我的结果:

Array size = 5  range = 7
Method1 : 8.61967e-06s
Method2 : 1.49667e-07s
Method3 : 2.69e-07s
Method4 : 2.47667e-07s

Array size = 50  range = 75
Method1 : 2.0749e-05s
Method2 : 1.404e-06s
Method3 : 9.23e-07s
Method4 : 8.37e-07s

Array size = 500  range = 750
Method1 : 0.000163868s
Method2 : 1.6899e-05s
Method3 : 4.39767e-06s
Method4 : 3.78733e-06s

Array size = 5000  range = 7500
Method1 : 0.00124788s
Method2 : 0.000258637s
Method3 : 3.32683e-05s
Method4 : 4.70797e-05s

Array size = 50000  range = 75000
Method1 : 0.0131954s
Method2 : 0.00344415s
Method3 : 0.000346838s
Method4 : 0.000183092s

Array size = 500000  range = 750000
Method1 : 0.25375s
Method2 : 0.0400779s
Method3 : 0.00331022s
Method4 : 0.00343761s

Array size = 5000000  range = 7500000
Method1 : 3.82532s
Method2 : 0.466848s
Method3 : 0.0426554s
Method4 : 0.0278986s

Update

I've updated results above with Ulrich Eckhardt's algorithm. His algorithm is quite competitive. Nice job Ulrich!

我已经用Ulrich Eckhardt的算法更新了上面的结果。他的算法很有竞争力。好工作Ulrich!

I should warn the readers of this answer that Ulrich's algorithm is vulnerable to the "quick-sort O(N^2) problem" wherein for specific inputs the algorithm can degenerate badly. The general algorithm is fixable, and Ulrich is obviously aware of the vulnerability as evidenced by this comment:

我应该向读者警告这个答案,Ulrich的算法容易受到“快速排序O(N ^ 2)问题”的影响,其中对于特定输入,算法可能会严重退化。一般算法是可修复的,Ulrich显然已经意识到这个漏洞,这个评论证明了这一点:

// pick pivot (TODO: randomize pivot?)

That is one defense against the O(N^2) problem, and there are others, such as detecting unreasonable recursion/iteration and switching to another algorithm mid-stream (such as Method 3 or Method 2). As written Method 4 is badly impacted when given an in-order sequence, and is disastrous when given a reverse-order sequence. On my platform Method 3 is also sub-optimal to Method 2 for these cases, though not nearly as bad as Method 4.

这是针对O(N ^ 2)问题的一种防御,还有其他问题,例如检测不合理的递归/迭代以及切换到中间流的另一算法(例如方法3或方法2)。如上所述,当给定有序序列时,方法4受到严重影响,并且当给出逆序序列时,方法4是灾难性的。在我的平台上,对于这些情况,方法3对于方法2也是次优的,尽管不如方法4差。

Finding the ideal technique for combating the O(N^2) problem for quick-sort-like algorithms is somewhat of a black-art, but well worth the time. I would definitely consider Method 4 as a valuable tool in the toolbox.

找到理想的技术来解决类似快速排序的算法的O(N ^ 2)问题,这在某种程度上是黑色的,但非常值得花时间。我肯定会认为方法4是工具箱中的一个有价值的工具。

#5


5  

Firstly, concerning the task of removing the element, the hardest thing is to find it, but actually removing it is easy (swap with the last element and then pop_back()). I'll therefore only address the finding. Also, you mention that sorting the sequence is acceptable, but I take from that that not just sorting but any kind of reordering is acceptable.

首先,关于删除元素的任务,最难的是找到它,但实际上删除它很容易(与最后一个元素交换然后pop_back())。因此,我只会解决这个问题。此外,你提到排序序列是可以接受的,但我从中得出的不仅仅是排序,而且任何类型的重新排序都是可以接受的。

Take a look at the quicksort algorithm. It picks a random element and then partitions the sequence to the left and right. If you write the partitioning so that you make a distinction between not just "less" and "not less", you can partially sort the sequence and locate the smallest duplicate on the run.

看一下quicksort算法。它选择一个随机元素,然后将序列分为左右两侧。如果您编写分区以便区分“less”和“not less”,则可以对序列进行部分排序并在运行中找到最小的副本。

Following steps should do the job:

以下步骤应该做的工作:

  • First, pick a random pivot and partition the sequence. At the same time, you can detect detect whether the pivot is a duplicate or not. Note that if you find a duplicate here, you can discard(!) anything greater, so you don't even have to invest into storage capacity and bandwidth for both partitions.
  • 首先,选择一个随机的枢轴并对序列进行分区。同时,您可以检测到检查枢轴是否重复。请注意,如果您在此处发现重复内容,则可以丢弃(!)更大的内容,因此您甚至不必为两个分区投入存储容量和带宽。

  • Then, recurse to the sequence of smaller elements.
  • 然后,递归到较小元素的序列。

  • If there was a set of duplicates in the smaller partition, those are your solution.
  • 如果较小的分区中有一组重复项,那么这些是您的解决方案。

  • If the first pivot had duplicates, that is your solution.
  • 如果第一个支点有重复,那就是你的解决方案。

  • Else, recurse to the larger elements looking for duplicates.
  • 否则,递归寻找重复的较大元素。

The advantage over regular sorting is that you don't sort the whole sequence if you find duplicates in the lower numbers. On a sequence of unique numbers, you will fully sort them though. Compared with the suggested counting of elements using a hashmap, this indeed has a higher asymptotic complexity. Whether it performs better depends on your implementation and input data though.

常规排序的优点是,如果您在较低的数字中找到重复项,则不会对整个序列进行排序。在一系列唯一数字上,您将完全对它们进行排序。与使用散列映射建议的元素计数相比,这确实具有更高的渐近复杂度。它是否表现更好取决于您的实现和输入数据。

Note that this requires that that the elements can be sorted and compared. You mention that you use double values, which are notoriously bad to sort when you have NaNs in there. I could imagine that the hashing algorithms in the standard containers work with NaNs though, so one more point for counting with a hash map.

请注意,这要求可以对元素进行排序和比较。你提到你使用双值,当你在那里有NaN时排序是非常糟糕的。我可以想象标准容器中的哈希算法可以使用NaNs,因此还有一点可以用哈希映射进行计数。

The following code implements above algorithm. It uses one recursive function to partition the input and to find duplicates, called from a second function that then finally removes the duplicate:

以下代码实现了上述算法。它使用一个递归函数来分区输入并查找重复项,从第二个函数调用,然后最终删除副本:

#include <vector>
#include <algorithm>
#include <iostream>

template<typename iterator>
iterator partition_and_find_smallest_duplicate(iterator begin, iterator end)
{
    using std::swap;

    std::cout << "find_duplicate(";
    for (iterator it=begin; it!=end; ++it)
        std::cout << *it << ", ";
    std::cout << ")\n";
    if (begin == end)
        return end; // empty sequence

    // The range begin,end is split in four partitions:
    // 1. equal to the pivot
    // 2. smaller than the pivot
    // 3. unclassified
    // 4. greater than the pivot

    // pick pivot (TODO: randomize pivot?)
    iterator pivot = begin;
    std::cout << "picking pivot: " << *pivot << '\n';

    iterator first = next(begin);
    iterator last = end;

    while (first != last) {
        if (*first > *pivot) {
            --last;
            swap(*first, *last);
        } else if (*first < *pivot) {
            ++first;
        } else {
            ++pivot;
            swap(*pivot, *first);
            ++first;
            std::cout << "found duplicate of pivot\n";
        }
    }

    // look for duplicates in the elements smaller than the pivot
    auto res = partition_and_find_smallest_duplicate(next(pivot), first);
    if (res != first)
        return res;

    // if we have more than just one equal to the pivot, it is the smallest duplicate
    if (pivot != begin)
        return pivot;

    // neither, look for duplicates in the elements greater than the pivot
    return partition_and_find_smallest_duplicate(last, end);
}

template<typename container>
void remove_smallest_duplicate(container& c)
{
    using std::swap;
    auto it = partition_and_find_smallest_duplicate(c.begin(), c.end());
    if (it != c.end())
    {
        std::cout << "removing duplicate: " << *it << std::endl;

        // swap with the last last element before popping
        // to avoid copying the elements in between
        swap(*it, c.back());
        c.pop_back();
    }
}

int main()
{
    std::vector<int> data = {66, 3, 11, 7, 75, 62, 62, 52, 9, 24, 58, 72, 37, 2, 9, 28, 15, 58, 3, 60, 2, 14};

    remove_smallest_duplicate(data);
}