从链表中有效地选择一组随机元素。

Say I have a linked list of numbers of length N. N is very large and I don’t know in advance the exact value of N.

假设我有一个长度为N的链表，我不知道N的准确值。

How can I most efficiently write a function that will return k completely random numbers from the list?

我怎样才能最有效地写出一个函数，使k从列表中返回完全随机数?

6 个解决方案

#1

There's a very nice and efficient algorithm for this using a method called reservoir sampling.

有一种很好的算法可以使用一种叫做油藏采样的方法。

Let me start by giving you its history:

让我先介绍一下它的历史:

Knuth calls this Algorithm R on p. 144 of his 1997 edition of Seminumerical Algorithms (volume 2 of The Art of Computer Programming), and provides some code for it there. Knuth attributes the algorithm to Alan G. Waterman. Despite a lengthy search, I haven't been able to find Waterman's original document, if it exists, which may be why you'll most often see Knuth quoted as the source of this algorithm.

Knuth将这一算法称为他1997年版的半圆算法(计算机编程艺术的第二卷)的第144页，并为它提供了一些代码。Knuth将该算法归功于Alan G. Waterman。尽管进行了长时间的搜索，但我还没有找到Waterman的原始文档，如果它存在的话，这可能就是为什么你会经常看到Knuth引用这个算法的源代码。

McLeod and Bellhouse, 1983 (1) provide a more thorough discussion than Knuth as well as the first published proof (that I'm aware of) that the algorithm works.

McLeod和Bellhouse, 1983(1)提供了比Knuth更深入的讨论，以及第一个发布的证明(我知道)该算法是有效的。

Vitter 1985 (2) reviews Algorithm R and then presents an additional three algorithms which provide the same output, but with a twist. Rather than making a choice to include or skip each incoming element, his algorithm predetermines the number of incoming elements to be skipped. In his tests (which, admittedly, are out of date now) this decreased execution time dramatically by avoiding random number generation and comparisons on each in-coming number.

Vitter 1985(2)回顾算法R，然后提出另外三种算法，它们提供相同的输出，但有一个扭转。他的算法没有选择包含或跳过每个传入元素，而是预先确定要跳过的传入元素的数量。在他的测试中(不可否认，现在已经过时了)，通过避免随机数生成和对每个即将到来的数字进行比较，显著降低了执行时间。

In pseudocode the algorithm is:

伪代码的算法是:

Let R be the result array of size s
Let I be an input queue

> Fill the reservoir array
for j in the range [1,s]:
  R[j]=I.pop()

elements_seen=s
while I is not empty:
  elements_seen+=1
  j=random(1,elements_seen)       > This is inclusive
  if j<=s:
    R[j]=I.pop()
  else:
    I.pop()

Note that I've specifically written the code to avoid specifying the size of the input. That's one of the cool properties of this algorithm: you can run it without needing to know the size of the input beforehand and it still assures you that each element you encounter has an equal probability of ending up in R (that is, there is no bias). Furthermore, R contains a fair and representative sample of the elements the algorithm has considered at all times. This means you can use this as an online algorithm.

注意，我已经专门编写了代码，以避免指定输入的大小。这是这个算法的一个很酷的特性:你可以运行它，而无需事先知道输入的大小，而且它仍然保证你遇到的每个元素在R中都有相同的概率(也就是说，没有偏置)。此外，R包含了算法一直考虑的元素的一个公平且具有代表性的样本。这意味着您可以将其作为在线算法使用。

Why does this work?

为什么这个工作吗?

McLeod and Bellhouse (1983) provide a proof using the mathematics of combinations. It's pretty, but it would be a bit difficult to reconstruct it here. Therefore, I've generated an alternative proof which is easier to explain.

McLeod和Bellhouse(1983)提供了一种使用组合数学的证明。它很漂亮，但是在这里重建它会有点困难。因此，我生成了一个更容易解释的替代证明。

We proceed via proof by induction.

我们通过归纳法进行证明。

Say we want to generate a set of s elements and that we have already seen n>s elements.

假设我们想要生成一组s元素，我们已经看到了n个>的元素。

Let's assume that our current s elements have already each been chosen with probability s/n.

假设我们现在的s元素已经被选为s/n。

By the definition of the algorithm, we choose element n+1 with probability s/(n+1).

根据该算法的定义，我们选择元素n+1的概率为s/(n+1)。

Each element already part of our result set has a probability 1/s of being replaced.

我们的结果集的每个元素都有一个被替换的概率。

The probability that an element from the n-seen result set is replaced in the n+1-seen result set is therefore (1/s)*s/(n+1)=1/(n+1). Conversely, the probability that an element is not replaced is 1-1/(n+1)=n/(n+1).

在n+1的结果集中，n-seen结果集中的一个元素被替换的概率是(1/s)*s/(n+1)=1/(n+1)。相反，一个元素不被替换的概率是1-1/(n+1)=n/(n+1)。

Thus, the n+1-seen result set contains an element either if it was part of the n-seen result set and was not replaced---this probability is (s/n)*n/(n+1)=s/(n+1)---or if the element was chosen---with probability s/(n+1).

因此，n+1的结果集包含一个元素，如果它是n-seen结果集的一部分，并且没有被替换——这个概率是(s/n)*n/(n+1)=s/(n+1)——或者如果选择了元素——概率s/(n+1)。

The definition of the algorithm tells us that the first s elements are automatically included as the first n=s members of the result set. Therefore, the n-seen result set includes each element with s/n (=1) probability giving us the necessary base case for the induction.

该算法的定义告诉我们，第一个s元素被自动包含在结果集的第一个n=s成员中，因此，n-seen结果集包含了每个元素的s/n(=1)概率，给出了归纳的必要的基本情况。

References

引用

McLeod, A. Ian, and David R. Bellhouse. "A convenient algorithm for drawing a simple random sample." Journal of the Royal Statistical Society. Series C (Applied Statistics) 32.2 (1983): 182-184. (Link)

McLeod, A. Ian和David R. Bellhouse。“一个简单的随机抽样的简便算法。”皇家统计学会期刊。C系列(应用统计学)32.2(1983):182-184。(链接)
Vitter, Jeffrey S. "Random sampling with a reservoir." ACM Transactions on Mathematical Software (TOMS) 11.1 (1985): 37-57. (Link)

维特,Jeffrey S。“随机抽取一个水库。”ACM在数学软件上的交易(TOMS) 11.1(1985): 37-57。(链接)

#2

This is called a Reservoir Sampling problem. The simple solution is to assign a random number to each element of the list as you see it, then keep the top (or bottom) k elements as ordered by the random number.

这被称为油藏取样问题。简单的解决方案是将一个随机数分配给列表中的每个元素，然后按照随机数的顺序保持顶部(或底部)k个元素。

#3

I would suggest: First find your k random numbers. Sort them. Then traverse both the linked list and your random numbers once.

我建议:首先找到你的k个随机数。排序。然后遍历链表和随机数。

If you somehow don't know the length of your linked list (how?), then you could grab the first k into an array, then for node r, generate a random number in [0, r), and if that is less than k, replace the rth item of the array. (Not entirely convinced that doesn't bias...)

如果不知道链表的长度(如何?)，那么可以将第一个k捕获到一个数组中，然后对节点r，在[0,r)中生成一个随机数，如果它小于k，则替换数组的第rth项。(不完全相信这不是偏见…)

Other than that: "If I were you, I wouldn't be starting from here." Are you sure linked list is right for your problem? Is there not a better data structure, such as a good old flat array list.

除此之外:“如果我是你，我就不会从这里开始。”你确定链表对你的问题是正确的吗?是否没有更好的数据结构，比如一个好的旧的平面数组列表。

#4

If you don't know the length of the list, then you will have to traverse it complete to ensure random picks. The method I've used in this case is the one described by Tom Hawtin (54070). While traversing the list you keep k elements that form your random selection to that point. (Initially you just add the first k elements you encounter.) Then, with probability k/i, you replace a random element from your selection with the ith element of the list (i.e. the element you are at, at that moment).

如果您不知道列表的长度，那么您将必须遍历它以确保随机选择。我在这个案例中使用的方法是Tom Hawtin(54070)所描述的方法。当遍历列表时，你保留了k个元素，它们构成了你的随机选择。(一开始你只添加你遇到的第k个元素。)然后，使用概率k/i，将选择中的一个随机元素替换为列表的第i个元素(即此时的元素)。

It's easy to show that this gives a random selection. After seeing m elements (m > k), we have that each of the first m elements of the list are part of you random selection with a probability k/m. That this initially holds is trivial. Then for each element m+1, you put it in your selection (replacing a random element) with probability k/(m+1). You now need to show that all other elements also have probability k/(m+1) of being selected. We have that the probability is k/m * (k/(m+1)*(1-1/k) + (1-k/(m+1))) (i.e. probability that element was in the list times the probability that it is still there). With calculus you can straightforwardly show that this is equal to k/(m+1).

很容易看出这给出了随机选择。在看到m个元素(m > k)之后，我们得到了列表中的第一个m元素，它们都是随机选择的一部分，概率为k/m。最初的结论是微不足道的。然后对于每个元素m+1，你将它放入你的选择(替换一个随机元素)与概率k/(m+1)。您现在需要显示所有其他元素也有被选中的概率k/(m+1)。我们得到的概率是k/m *(k/(m+1)*(1-1/k) + (1-k/(m+1)))(即元素在列表中的概率乘以它仍然存在的概率)。用微积分你可以直接证明它等于k/(m+1)

#5

-3

Well, you do need to know what N is at runtime at least, even if this involves doing an extra pass over the list to count them. The simplest algorithm to do this is to just pick a random number in N and remove that item, repeated k times. Or, if it is permissible to return repeat numbers, don't remove the item.

你需要知道，至少在运行时N是多少，即使这需要在列表上多做一遍来计算它们。最简单的算法是在N中选择一个随机数，然后去掉这个项，重复k次。或者，如果允许返回重复数字，不要删除该项目。

Unless you have a VERY large N, and very stringent performance requirements, this algorithm runs with O(N*k) complexity, which should be acceptable.

除非你有一个非常大的N，并且非常严格的性能要求，这个算法运行的复杂度是O(N*k)，这是可以接受的。

Edit: Nevermind, Tom Hawtin's method is way better. Select the random numbers first, then traverse the list once. Same theoretical complexity, I think, but much better expected runtime.

编辑:Nevermind, Tom Hawtin的方法更好。首先选择随机数，然后遍历列表一次。我认为，同样的理论复杂性，但更好的预期运行时间。

#6

-3

Why can't you just do something like

你为什么不能做点什么呢?

List GetKRandomFromList(List input, int k)
  List ret = new List();
  for(i=0;i<k;i++)
    ret.Add(input[Math.Rand(0,input.Length)]);
  return ret;

I'm sure that you don't mean something that simple so can you specify further?

我敢肯定你的意思并不是那么简单，你能再详细说明一下吗?

#1