当桶数增加时，为什么.NET group by会慢得多

Given this simple piece of code and 10mln array of random numbers:

给定这段简单的代码和10mln随机数组:

static int Main(string[] args)
    {
        int size = 10000000;
        int num =  10; //increase num to reduce number of buckets
        int numOfBuckets = size/num;
        int[] ar = new int[size];
        Random r = new Random(); //initialize with randum numbers
        for (int i = 0; i < size; i++)
            ar[i] = r.Next(size);

        var s = new Stopwatch();
        s.Start();
        var group = ar.GroupBy(i => i / num);
        var l = group.Count();
        s.Stop();

        Console.WriteLine(s.ElapsedMilliseconds);
        Console.ReadLine();
        return 0;
    }

I did some performance on grouping, so when the number of buckets is 10k the estimated execution time is 0.7s, for 100k buckets it is 2s, for 1m buckets it is 7.5s.

我在分组方面做了一些表现,所以当桶的数量是10k时,估计的执行时间是0.7s,对于100k桶,它是2s,对于1m桶,它是7.5s。

I wonder why is that. I imagine that if the GroupBy is implemented using HashTable there might be problem with collisions. For example initially the hashtable is prepard to work for let's say 1000 groups and then when the number of groups is growing it needs to increase the size and do the rehashing. If these was the case I could then write my own grouping where I would initialize the HashTable with expected number of buckets, I did that but it was only slightly faster.

我想知道为什么会这样。我想如果使用HashTable实现GroupBy,可能会出现冲突问题。例如,最初哈希表是准备工作让我们说1000组,然后当组的数量增长时,它需要增加大小并进行重新散列。如果是这种情况我可以编写自己的分组,我会用预期数量的桶来初始化HashTable,我做了但是它只是稍微快一些。

So my question is, why number of buckets influences groupBy performance that much?

所以我的问题是,为什么数量的桶会影响groupBy的性能呢?

EDIT: running under release mode change the results to 0.55s, 1.6s, 6.5s respectively.

编辑:在发布模式下运行将结果分别更改为0.55s,1.6s,6.5s。

I also changed the group.ToArray to piece of code below just to force execution of grouping :

我也更改了group.ToArray下面的一段代码只是为了强制执行分组:

foreach (var g in group)
    array[g.Key] = 1;

where array is initialized before timer with appropriate size, the results stayed almost the same.

其中数组在具有适当大小的计时器之前初始化,结果几乎保持不变。

EDIT2: You can see the working code from mellamokb in here pastebin.com/tJUYUhGL

编辑2:您可以在这里看到mellamokb的工作代码pastebin.com/tJUYUhGL

5 个解决方案

#1

I'm pretty certain this is showing the effects of memory locality (various levels of caching) and alos object allocation.

我很确定这是显示内存局部性(各种级别的缓存)和alos对象分配的影响。

To verify this, I took three steps:

为了验证这一点,我采取了三个步骤:

Improve the benchmarking to avoid unnecessary parts and to garbage collect between tests

改进基准测试以避免不必要的部分和测试之间的垃圾收集

Remove the LINQ part by populating a Dictionary (which is effecively what GroupBy does behind the scenes)

通过填充字典来删除LINQ部分(这有效地是GroupBy在幕后所做的)

Remove even Dictionary<,> and show the same trend for plain arrays.

删除甚至是Dictionary <,>并显示普通数组的相同趋势。

In order to show this for arrays, I needed to increase the input size, but it does show the same kind of growth.

为了向阵列显示这个,我需要增加输入大小,但它确实显示了相同的增长。

Here's a short but complete program which can be used to test both the dictionary and the array side - just flip which line is commented out in the middle:

这是一个简短但完整的程序,可用于测试字典和数组端 - 只需翻转中间注释掉的行:

using System;
using System.Collections.Generic;
using System.Diagnostics;

class Test
{    
    const int Size = 100000000;
    const int Iterations = 3;

    static void Main()
    {
        int[] input = new int[Size];
        // Use the same seed for repeatability
        var rng = new Random(0);
        for (int i = 0; i < Size; i++)
        {
            input[i] = rng.Next(Size);
        }

        // Switch to PopulateArray to change which method is tested
        Func<int[], int, TimeSpan> test = PopulateDictionary;

        for (int buckets = 10; buckets <= Size; buckets *= 10)
        {
            TimeSpan total = TimeSpan.Zero;
            for (int i = 0; i < Iterations; i++)
            {
                // Switch which line is commented to change the test
                // total += PopulateDictionary(input, buckets);
                total += PopulateArray(input, buckets);
                GC.Collect();
                GC.WaitForPendingFinalizers();
            }
            Console.WriteLine("{0,9}: {1,7}ms", buckets, (long) total.TotalMilliseconds);
        }
    }

    static TimeSpan PopulateDictionary(int[] input, int buckets)
    {
        int divisor = input.Length / buckets;
        var dictionary = new Dictionary<int, int>(buckets);
        var stopwatch = Stopwatch.StartNew();
        foreach (var item in input)
        {
            int key = item / divisor;
            int count;
            dictionary.TryGetValue(key, out count);
            count++;
            dictionary[key] = count;
        }
        stopwatch.Stop();
        return stopwatch.Elapsed;
    }

    static TimeSpan PopulateArray(int[] input, int buckets)
    {
        int[] output = new int[buckets];
        int divisor = input.Length / buckets;
        var stopwatch = Stopwatch.StartNew();
        foreach (var item in input)
        {
            int key = item / divisor;
            output[key]++;
        }
        stopwatch.Stop();
        return stopwatch.Elapsed;
    }
}

Results on my machine:

在我的机器上的结果:

PopulateDictionary:

       10:   10500ms
      100:   10556ms
     1000:   10557ms
    10000:   11303ms
   100000:   15262ms
  1000000:   54037ms
 10000000:   64236ms // Why is this slower? See later.
100000000:   56753ms

PopulateArray:

       10:    1298ms
      100:    1287ms
     1000:    1290ms
    10000:    1286ms
   100000:    1357ms
  1000000:    2717ms
 10000000:    5940ms
100000000:    7870ms

An earlier version of PopulateDictionary used an Int32Holder class, and created one for each bucket (when the lookup in the dictionary failed). This was faster when there was a small number of buckets (presumably because we were only going through the dictionary lookup path once per iteration instead of twice) but got significantly slower, and ended up running out of memory. This would contribute to fragmented memory access as well, of course. Note that PopulateDictionary specifies the capacity to start with, to avoid effects of data copying within the test.

早期版本的PopulateDictionary使用Int32Holder类,并为每个桶创建一个(当字典中的查找失败时)。当存在少量桶时(这可能是因为我们每次迭代只通过字典查找路径而不是两次),但是速度明显变慢,并且最终耗尽内存。当然,这也会导致碎片化的内存访问。请注意,PopulateDictionary指定了开始时的容量,以避免测试中数据复制的影响。

The aim of using the PopulateArray method is to remove as much framework code as possible, leaving less to the imagination. I haven't yet tried using an array of a custom struct (with various different struct sizes) but that may be something you'd like to try too.

使用PopulateArray方法的目的是尽可能多地删除框架代码,从而减少想象力。我还没有尝试使用自定义结构的数组(具有各种不同的结构大小),但这可能是你想要尝试的东西。

EDIT: I can reproduce the oddity of the slower result for 10000000 than 100000000 at will, regardless of test ordering. I don't understand why yet. It may well be specific to the exact processor and cache I'm using...

编辑:无论测试顺序如何,我都可以随意重现1000000比1000000000慢的结果的奇怪性。我不明白为什么。它可能特定于我正在使用的确切处理器和缓存...

--EDIT--

The reason why 10000000 is slower than the 100000000 results has to do with the way hashing works. A few more tests explain this.

10000000比100000000结果慢的原因与散列的工作方式有关。还有一些测试解释了这一点。

First off, let's look at the operations. There's Dictionary.FindEntry, which is used in the [] indexing and in Dictionary.TryGetValue, and there's Dictionary.Insert, which is used in the [] indexing and in Dictionary.Add. If we would just do a FindEntry, the timings would go up as we expect it:

首先,让我们来看看操作。有Dictionary.FindEntry,它在[]索引和Dictionary.TryGetValue中使用,还有Dictionary.Insert,它在[]索引和Dictionary.Add中使用。如果我们只是做一个FindEntry,时间会像我们预期的那样上升:

static TimeSpan PopulateDictionary1(int[] input, int buckets)
{
    int divisor = input.Length / buckets;
    var dictionary = new Dictionary<int, int>(buckets);
    var stopwatch = Stopwatch.StartNew();
    foreach (var item in input)
    {
        int key = item / divisor;
        int count;
        dictionary.TryGetValue(key, out count);
    }
    stopwatch.Stop();
    return stopwatch.Elapsed;
}

This is implementation doesn't have to deal with hash collisions (because there are none), which makes the behavior as we expect it. Once we start dealing with collisions, the timings start to drop. If we have as much buckets as elements, there are obviously less collisions... To be exact, we can figure out exactly how many collisions there are by doing:

这是实现不必处理散列冲突(因为没有),这使得行为符合我们的预期。一旦我们开始处理碰撞,时间就会开始下降。如果我们拥有与元素一样多的存储桶,那么碰撞明显更少......确切地说,我们可以通过以下方式确切地计算出有多少碰撞:

static TimeSpan PopulateDictionary(int[] input, int buckets)
{
    int divisor = input.Length / buckets;
    int c1, c2;
    c1 = c2 = 0;
    var dictionary = new Dictionary<int, int>(buckets);
    var stopwatch = Stopwatch.StartNew();
    foreach (var item in input)
    {
        int key = item / divisor;
        int count;
        if (!dictionary.TryGetValue(key, out count))
        {
            dictionary.Add(key, 1);
            ++c1;
        }
        else
        {
            count++;
            dictionary[key] = count;
            ++c2;
        }
    }
    stopwatch.Stop();
    Console.WriteLine("{0}:{1}", c1, c2);
    return stopwatch.Elapsed;
}

The result is something like this:

结果是这样的:

10:99999990
       10:    4683ms
100:99999900
      100:    4946ms
1000:99999000
     1000:    4732ms
10000:99990000
    10000:    4964ms
100000:99900000
   100000:    7033ms
1000000:99000000
  1000000:   22038ms
9999538:90000462       <<-
 10000000:   26104ms
63196841:36803159      <<-
100000000:   25045ms

Note the value of '36803159'. This answers the question why the last result is faster than the first result: it simply has to do less operations -- and since caching fails anyways, that factor doesn't make a difference anymore.

请注意'36803159'的值。这回答了为什么最后一个结果比第一个结果更快的问题:它只需要做更少的操作 - 并且由于缓存无论如何都会失败,因此该因素不再有所作为。

#2

10k the estimated execution time is 0.7s, for 100k buckets it is 2s, for 1m buckets it is 7.5s.

10k估计执行时间为0.7s,对于100k桶,它是2s,对于1m桶,它是7.5s。

This is an important pattern to recognize when you profile code. It is one of the standard size vs execution time relationships in software algorithms. Just from seeing the behavior, you can tell a lot about the way the algorithm was implemented. And the other way around of course, from the algorithm you can predict the expected execution time. A relationship that's annotated in the Big Oh notation.

这是在您分析代码时识别的重要模式。它是软件算法中的标准大小与执行时间关系之一。只是从看到行为,你可以告诉很多关于算法的实现方式。当然,从算法中可以预测出预期的执行时间。在Big Oh表示法中注释的关系。

Speediest code you can get is amortized O(1), execution time barely increases when you double the size of the problem. The Dictionary<> class behaves that way, as John demonstrated. The increases in time as the problem set gets large is the "amortized" part. A side-effect of Dictionary having to perform linear O(n) searches in buckets that keep getting bigger.

您可以获得的最快速的代码是摊销O(1),当您将问题的大小加倍时,执行时间几乎不会增加。正如约翰所说,Dictionary <>类的行为就是这样。随着问题集的增加,时间的增加是“摊销”部分。 Dictionary的副作用必须在不断变大的桶中执行线性O(n)搜索。

A very common pattern is O(n). That tells you that there is a single for() loop in the algorithm that iterates over the collection. O(n^2) tells you there are two nested for() loops. O(n^3) has three, etcetera.

一种非常常见的模式是O(n)。这告诉你算法中有一个for()循环迭代集合。 O(n ^ 2)告诉你有两个嵌套for()循环。 O(n ^ 3)有三个,等等。

What you got is the one in between, O(log n). It is the standard complexity of a divide-and-conquer algorithm. In other words, each pass splits the problem in two, continuing with the smaller set. Very common, you see it back in sorting algorithms. Binary search is the one you find back in your text book. Note how log₂(10) = 3.3, very close to the increment you see in your test. Perf starts to tank a bit for very large sets due to the poor locality of reference, a cpu cache problem that's always associated with O(log n) algoritms.

你得到的是中间的那个,O(log n)。它是分而治之算法的标准复杂性。换句话说,每个过程将问题分成两部分,继续使用较小的集合。很常见,你会在排序算法中看到它。二进制搜索是您在教科书中找到的搜索。注意log 2(10)= 3.3,非常接近你在测试中看到的增量。由于引用的局部性较差,Perf开始对非常大的集合进行坦克,这是一个总是与O(log n)算法相关的cpu缓存问题。

The one thing that John's answer demonstrates is that his guess cannot be correct, GroupBy() certainly does not use a Dictionary<>. And it is not possible by design, Dictionary<> cannot provide an ordered collection. Where GroupBy() must be ordered, it says so in the MSDN Library:

约翰的答案表明的一件事是他的猜测不正确,GroupBy()当然不使用Dictionary <>。而且设计不可能,Dictionary <>无法提供有序的集合。必须订购GroupBy()的地方,它在MSDN Library中这样说:

The IGrouping objects are yielded in an order based on the order of the elements in source that produced the first key of each IGrouping. Elements in a grouping are yielded in the order they appear in source.

基于源中产生每个IGroup的第一个键的元素的顺序,顺序产生IGrouping对象。分组中的元素按它们在源中出现的顺序生成。

Not having to maintain order is what makes Dictionary<> fast. Keeping order always cost O(log n), a binary tree in your text book.

不必维持秩序是使Dictionary <>快速的原因。保持订单总是花费O(log n),这是教科书中的二叉树。

Long story short, if you don't actually care about order, and you surely would not for random numbers, then you don't want to use GroupBy(). You want to use a Dictionary<>.

长话短说,如果你真的不关心订单,你肯定不会随机数,那么你不想使用GroupBy()。您想使用Dictionary <>。

#3

There are (at least) two influence factors: First, a hash table lookup only takes O(1) if you have a perfect hash function, which does not exist. Thus, you have hash collisions.

有(至少)两个影响因素:首先,如果你有一个完美的哈希函数,哈希表查找只需要O(1),这是不存在的。因此,您有哈希冲突。

I guess more important, though, are caching effects. Modern CPUs have large caches, so for the smaller bucket count, the hash table itself might fit into the cache. As the hash table is frequently accessed, this might have a strong influence on the performance. If there are more buckets, more accesses to the RAM might be neccessary, which are slow compared to a cache hit.

不过,我认为更重要的是缓存效果。现代CPU具有大型缓存,因此对于较小的桶数,哈希表本身可能适合缓存。由于经常访问哈希表,这可能会对性能产生很大影响。如果存在更多桶,则可能需要更多访问RAM,这与缓存命中相比较慢。

#4

There are a few factors at work here.

这里有一些因素在起作用。

Hashes and groupings

哈希和分组

The way grouping works is by creating a hash table. Each individual group then supports an 'add' operation, which adds an element to the add list. To put it bluntly, it's like a Dictionary<Key, List<Value>>.

分组的工作方式是创建哈希表。然后,每个单独的组都支持“添加”操作,该操作会向添加列表添加元素。说白了,它就像一个Dictionary >。 ,list>

Hash tables are always overallocated. If you add an element to the hash, it checks if there is enough capacity, and if not, recreates the hash table with a larger capacity (To be exact: new capacity = count * 2 with count the number of groups). However, a larger capacity means that the bucket index is no longer correct, which means you have to re-build the entries in the hash table. The Resize() method in Lookup<Key, Value> does this.

哈希表总是被过度分配。如果向哈希添加一个元素,它会检查是否有足够的容量,如果没有,则重新创建具有更大容量的哈希表(准确地说:新容量=计数* 2,计算组的数量)。但是,容量越大意味着存储区索引不再正确,这意味着您必须在哈希表中重新构建条目。 Lookup 中的Resize()方法执行此操作。 ,value>

The 'groups' themselves work like a List<T>. These too are overallocated, but are easier to reallocate. To be precise: the data is simply copied (with Array.Copy in Array.Resize) and a new element is added. Since there's no re-hashing or calculation involved, this is quite a fast operation.

'groups'本身就像List 一样工作。这些也是过度分配,但更容易重新分配。确切地说:简单地复制数据(使用Array.Resize中的Array.Copy)并添加新元素。由于不涉及重新散列或计算,因此这是一个非常快速的操作。

The initial capacity of a grouping is 7. This means, for 10 elements you need to reallocate 1 time, for 100 elements 4 times, for 1000 elements 8 times, and so on. Because you have to re-hash more elements each time, your code gets a bit slower each time the number of buckets grows.

分组的初始容量为7.这意味着,对于10个元素,您需要重新分配1次,100个元素重新分配4次,1000个元素重新分配8次,依此类推。因为每次都必须重新散列更多元素,所以每次存储桶数量增加时代码会慢一些。

I think these overallocations are the largest contributors to the small growth in the timings as the number of buckets grow. The easiest way to test this theory is to do no overallocations at all (test 1), and simply put counters in an array. The result can be shown below in the code for FixArrayTest (or if you like FixBucketTest which is closer to how groupings work). As you can see, the timings of # buckets = 10...10000 are the same, which is correct according to this theory.

我认为随着桶数量的增长,这些分配对于时间的小幅增长是最大的贡献者。测试这个理论的最简单方法是根本不进行分配(测试1),只需将计数器放在一个数组中。结果可以在下面的FixArrayTest代码中显示(或者如果您喜欢FixBucketTest,它更接近分组的工作方式)。如您所见,#buckets = 10 ... 10000的时序是相同的,根据该理论是正确的。

Cache and random

缓存和随机

Caching and random number generators aren't friends.

缓存和随机数生成器不是朋友。

Our little test also shows that when the number of buckets grows above a certain threshold, memory comes into play. On my computer this is at an array size of roughly 4 MB (4 * number of buckets). Because the data is random, random chunks of RAM will be loaded and unloaded into the cache, which is a slow process. This is also the large jump in the speed. To see this in action, change the random numbers to a sequence (called 'test 2'), and - because the data pages can now be cached - the speed will remain the same overall.

我们的小测试还表明,当桶的数量增长超过某个阈值时,内存就会发挥作用。在我的计算机上,这是一个大约4 MB(4 *桶数)的数组大小。因为数据是随机的,所以RAM的随机块将被加载并卸载到缓存中,这是一个缓慢的过程。这也是速度的大幅提升。要查看此操作,请将随机数更改为序列(称为“测试2”),并且 - 因为现在可以缓存数据页 - 整体速度将保持不变。

Note that hashes overallocate, so you will hit the mark before you have a million entries in your grouping.

请注意,哈希值已经过分配,因此在分组中有一百万个条目之前,您将达到标记。

Test code

static void Main(string[] args)
{
    int size = 10000000;
    int[] ar = new int[size];

    //random number init with numbers [0,size-1]
    var r = new Random();
    for (var i = 0; i < size; i++)
    {
        ar[i] = r.Next(0, size);
        //ar[i] = i; // Test 2 -> uncomment to see the effects of caching more clearly
    }

    Console.WriteLine("Fixed dictionary:");
    for (var numBuckets = 10; numBuckets <= 1000000; numBuckets *= 10)
    {
        var num = (size / numBuckets);
        var timing = 0L;
        for (var i = 0; i < 5; i++)
        {
            timing += FixBucketTest(ar, num);
            //timing += FixArrayTest(ar, num); // test 1
        }
        var avg = ((float)timing) / 5.0f;

        Console.WriteLine("Avg Time: " + avg + " ms for " + numBuckets);
    }

    Console.WriteLine("Fixed array:");
    for (var numBuckets = 10; numBuckets <= 1000000; numBuckets *= 10)
    {
        var num = (size / numBuckets);
        var timing = 0L;
        for (var i = 0; i < 5; i++)
        {
            timing += FixArrayTest(ar, num); // test 1
        }
        var avg = ((float)timing) / 5.0f;

        Console.WriteLine("Avg Time: " + avg + " ms for " + numBuckets);
    }
}

static long FixBucketTest(int[] ar, int num)
{
    // This test shows that timings will not grow for the smaller numbers of buckets if you don't have to re-allocate
    System.Diagnostics.Stopwatch s = new Stopwatch();
    s.Start();
    var grouping = new Dictionary<int, List<int>>(ar.Length / num + 1); // exactly the right size
    foreach (var item in ar)
    {
        int idx = item / num;
        List<int> ll;
        if (!grouping.TryGetValue(idx, out ll))
        {
            grouping.Add(idx, ll = new List<int>());
        }
        //ll.Add(item); //-> this would complete a 'grouper'; however, we don't want the overallocator of List to kick in
    }
    s.Stop();
    return s.ElapsedMilliseconds;
}

// Test with arrays
static long FixArrayTest(int[] ar, int num)
{
    System.Diagnostics.Stopwatch s = new Stopwatch();
    s.Start();

    int[] buf = new int[(ar.Length / num + 1) * 10];
    foreach (var item in ar)
    {
        int code = (item & 0x7FFFFFFF) % buf.Length;
        buf[code]++;
    }

    s.Stop();
    return s.ElapsedMilliseconds;
}

#5

When executing bigger calculations, less physical memory is available on the computer, counting the buckets will be slower with less memory, as you expend the buckets, your memory will decrease.

执行更大的计算时,计算机上可用的物理内存越来越少,计算存储桶的速度越慢,内存越少,因为耗尽存储桶,内存将减少。

Try something like the following:

尝试以下内容:

int size = 2500000; //10000000 divided by 4
int[] ar = new int[size];
//random number init with numbers [0,size-1]
System.Diagnostics.Stopwatch s = new Stopwatch();
s.Start();

for (int i = 0; i<4; i++)
{
var group = ar.GroupBy(i => i / num); 
//the number of expected buckets is size / num.
var l = group.ToArray();
}

s.Stop();

calcuting 4 times with lower numbers.

用较低的数字计算4次。

#1