从PHP数组中有效地选择n个随机元素(不带shuffle)

时间:2021-04-05 21:22:29

I have the following code to pick $n elements from an array $array in PHP:

我有以下代码从PHP中的数组$数组中选择$ n元素:

shuffle($array);
$result = array_splice($array, 0, $n);

Given a large array but only a few elements (for example 5 out of 10000), this is relatively slow, so I would like to optimize it such that not all elements have to be shuffled. The values must be unique.

给定一个大数组但只有少数元素(例如10000个中的5个),这是相对较慢的,所以我想优化它,以便不是所有元素都必须被洗牌。值必须是唯一的。

I'm looking fo the most performant alternative. We can assume that $array has no duplicates and is 0-indexed.

我正在寻找最有效的选择。我们可以假设$ array没有重复,并且是0索引的。

5 个解决方案

#1


5  

$randomArray = [];
while (count($randomArray) < 5) {
  $randomKey = mt_rand(0, count($array)-1);
  $randomArray[$randomKey] = $array[$randomKey];
}

This will provide exactly 5 elements with no duplicates and very quickly. The keys will be preserved.

这将提供5个元素,没有重复,非常快。密钥将被保留。

Note: You'd have to make sure $array had 5 or more elements or add some sort of check to prevent an endless loop.

注意:您必须确保$ array有5个或更多元素或添加某种检查以防止无限循环。

#2


3  

This function performs a shuffle on only $n elements where $n is the number of random elements you want to pick. It will also work on associative arrays and sparse arrays. $array is the array to work on and $n is the number of random elements to retrieve.

此函数仅对$ n元素执行shuffle,其中$ n是要选择的随机元素的数量。它也适用于关联数组和稀疏数组。 $ array是要处理的数组,$ n是要检索的随机元素的数量。

If we define the $max_index as count($array) - 1 - $iteration.

如果我们将$ max_index定义为count($ array) - 1 - $ iteration。

It works by generating a random number between 0 and $max_index. Picking the key at that index, and replacing its index with the value at $max_index so that it can never be picked again, as $max_index will be one less at the next iteration and unreachable.

它的工作原理是生成0到$ max_index之间的随机数。在该索引处拾取密钥,并将其索引替换为$ max_index处的值,以便永远不会再次拾取它,因为$ max_index在下一次迭代时将少一个并且无法访问。

In summary this is the Richard Durstenfeld's Fisher-Yates shuffle but operating only on $n elements instead of the entire array.

总之,这是Richard Durstenfeld的Fisher-Yates shuffle,但只运行$ n元素而不是整个阵列。

function rand_pluck($array, $n) {
    $array_keys = array_keys($array);
    $array_length = count($array_keys);
    $max_index = $array_length -1;
    $iterations = min($n, $array_length);
    $random_array = array();
    while($iterations--) {
        $index = mt_rand(0, $max_index);
        $value = $array_keys[$index];
        $array_keys[$index] = $array_keys[$max_index];
        array_push($random_array, $array[$value]);
        $max_index--;
    }
    return $random_array;
}

#3


3  

The trick is to use a variation of shuffle or in other words a partial shuffle.

诀窍是使用shuffle的变体,或者换句话说是部分shuffle。

performance is not the only criterion, statistical efficiency, i.e unbiased sampling is as important (as the original shuffle solution is)

性能不是唯一的标准,统计效率,即无偏差抽样同样重要(正如原始的洗牌解决方案)

function random_pick( $a, $n ) 
{
  $N = count($a);
  $n = min($n, $N);
  $picked = array_fill(0, $n, 0); $backup = array_fill(0, $n, 0);
  // partially shuffle the array, and generate unbiased selection simultaneously
  // this is a variation on fisher-yates-knuth shuffle
  for ($i=0; $i<$n; $i++) // O(n) times
  { 
    $selected = mt_rand( 0, --$N ); // unbiased sampling N * N-1 * N-2 * .. * N-n+1
    $value = $a[ $selected ];
    $a[ $selected ] = $a[ $N ];
    $a[ $N ] = $value;
    $backup[ $i ] = $selected;
    $picked[ $i ] = $value;
  }
  // restore partially shuffled input array from backup
  // optional step, if needed it can be ignored, e.g $a is passed by value, hence copied
  for ($i=$n-1; $i>=0; $i--) // O(n) times
  { 
    $selected = $backup[ $i ];
    $value = $a[ $N ];
    $a[ $N ] = $a[ $selected ];
    $a[ $selected ] = $value;
    $N++;
  }
  return $picked;
}

NOTE the algorithm is strictly O(n) in both time and space, produces unbiased selections (it is a partial unbiased shuffling) and produces output which is proper array with consecutive keys (not needing extra array_values etc..)

注意,算法在时间和空间上都是严格的O(n),产生无偏的选择(它是部分无偏的混洗)并产生输出,这是具有连续键的正确数组(不需要额外的array_values等)。

Use example:

使用示例:

$randomly_picked = random_pick($my_array, 5);
// or if an associative array is used
$randomly_picked_keys = random_pick(array_keys($my_array), 5);
$randomly_picked = array_intersect_key($my_array, array_flip($randomly_picked_keys));

For further variations and extensions of shuffling for PHP:

有关PHP的改组的进一步变化和扩展:

  1. PHP - shuffle only part of an array
  2. PHP - 只是数组的一部分
  3. PHP shuffle with seed
  4. PHP用种子洗牌
  5. How can I take n elements at random from a Perl array?
  6. 如何从Perl数组中随机取出n个元素?

#4


2  

This will only show benifits for small n compared to an array shuffle, but you could

这只会显示小n与数组shuffle相比的好处,但你可以

  1. Choose a random index r n times, each time decreasing the limit by 1
  2. 选择一个随机索引r n次,每次将限制减少1
  3. Adjust for previously used indices
  4. 调整以前使用的指数
  5. Take value
  6. 取值
  7. Store used index
  8. 存储使用的索引

Pseudocode

伪代码

arr = []
used = []
for i = 0..n-1:
    r = rand 0..len-i
    d = 0
    for j = 0..used.length-1:
        if r >= used[j]:
            d += 1
    arr.append($array[r + d])
    used.append(r)
return arr

#5


2  

You could generate n-times a random number with mt_rand() and then fill these values in a new array. To go against the case where the same index gets returned twice we use the actual returned index to fill the new array and check always if the index exists in the new array, if so we use while to loop through it as long as we get a duplicate index. At the end we use array_values() to get a 0-indexed array.

您可以使用mt_rand()生成n次随机数,然后在新数组中填充这些值。为了反对两次返回相同索引的情况,我们使用实际返回的索引来填充新数组,并检查新数组中是否存在索引,如果是这样的话,我们使用while循环遍历它只要我们得到一个重复索引。最后,我们使用array_values()来获得0索引数组。

$count = count($array) - 1;
$new_array = array();
for($i = 0; $i < $n; $i++) {
    $index = mt_rand(0, $count);
    while(isset($new_array[$index])) {
        $index = mt_rand(0, $count);
    }

    $new_array[$index] = $array[$index];
}
$new_array = array_values($new_array);

#1


5  

$randomArray = [];
while (count($randomArray) < 5) {
  $randomKey = mt_rand(0, count($array)-1);
  $randomArray[$randomKey] = $array[$randomKey];
}

This will provide exactly 5 elements with no duplicates and very quickly. The keys will be preserved.

这将提供5个元素,没有重复,非常快。密钥将被保留。

Note: You'd have to make sure $array had 5 or more elements or add some sort of check to prevent an endless loop.

注意:您必须确保$ array有5个或更多元素或添加某种检查以防止无限循环。

#2


3  

This function performs a shuffle on only $n elements where $n is the number of random elements you want to pick. It will also work on associative arrays and sparse arrays. $array is the array to work on and $n is the number of random elements to retrieve.

此函数仅对$ n元素执行shuffle,其中$ n是要选择的随机元素的数量。它也适用于关联数组和稀疏数组。 $ array是要处理的数组,$ n是要检索的随机元素的数量。

If we define the $max_index as count($array) - 1 - $iteration.

如果我们将$ max_index定义为count($ array) - 1 - $ iteration。

It works by generating a random number between 0 and $max_index. Picking the key at that index, and replacing its index with the value at $max_index so that it can never be picked again, as $max_index will be one less at the next iteration and unreachable.

它的工作原理是生成0到$ max_index之间的随机数。在该索引处拾取密钥,并将其索引替换为$ max_index处的值,以便永远不会再次拾取它,因为$ max_index在下一次迭代时将少一个并且无法访问。

In summary this is the Richard Durstenfeld's Fisher-Yates shuffle but operating only on $n elements instead of the entire array.

总之,这是Richard Durstenfeld的Fisher-Yates shuffle,但只运行$ n元素而不是整个阵列。

function rand_pluck($array, $n) {
    $array_keys = array_keys($array);
    $array_length = count($array_keys);
    $max_index = $array_length -1;
    $iterations = min($n, $array_length);
    $random_array = array();
    while($iterations--) {
        $index = mt_rand(0, $max_index);
        $value = $array_keys[$index];
        $array_keys[$index] = $array_keys[$max_index];
        array_push($random_array, $array[$value]);
        $max_index--;
    }
    return $random_array;
}

#3


3  

The trick is to use a variation of shuffle or in other words a partial shuffle.

诀窍是使用shuffle的变体,或者换句话说是部分shuffle。

performance is not the only criterion, statistical efficiency, i.e unbiased sampling is as important (as the original shuffle solution is)

性能不是唯一的标准,统计效率,即无偏差抽样同样重要(正如原始的洗牌解决方案)

function random_pick( $a, $n ) 
{
  $N = count($a);
  $n = min($n, $N);
  $picked = array_fill(0, $n, 0); $backup = array_fill(0, $n, 0);
  // partially shuffle the array, and generate unbiased selection simultaneously
  // this is a variation on fisher-yates-knuth shuffle
  for ($i=0; $i<$n; $i++) // O(n) times
  { 
    $selected = mt_rand( 0, --$N ); // unbiased sampling N * N-1 * N-2 * .. * N-n+1
    $value = $a[ $selected ];
    $a[ $selected ] = $a[ $N ];
    $a[ $N ] = $value;
    $backup[ $i ] = $selected;
    $picked[ $i ] = $value;
  }
  // restore partially shuffled input array from backup
  // optional step, if needed it can be ignored, e.g $a is passed by value, hence copied
  for ($i=$n-1; $i>=0; $i--) // O(n) times
  { 
    $selected = $backup[ $i ];
    $value = $a[ $N ];
    $a[ $N ] = $a[ $selected ];
    $a[ $selected ] = $value;
    $N++;
  }
  return $picked;
}

NOTE the algorithm is strictly O(n) in both time and space, produces unbiased selections (it is a partial unbiased shuffling) and produces output which is proper array with consecutive keys (not needing extra array_values etc..)

注意,算法在时间和空间上都是严格的O(n),产生无偏的选择(它是部分无偏的混洗)并产生输出,这是具有连续键的正确数组(不需要额外的array_values等)。

Use example:

使用示例:

$randomly_picked = random_pick($my_array, 5);
// or if an associative array is used
$randomly_picked_keys = random_pick(array_keys($my_array), 5);
$randomly_picked = array_intersect_key($my_array, array_flip($randomly_picked_keys));

For further variations and extensions of shuffling for PHP:

有关PHP的改组的进一步变化和扩展:

  1. PHP - shuffle only part of an array
  2. PHP - 只是数组的一部分
  3. PHP shuffle with seed
  4. PHP用种子洗牌
  5. How can I take n elements at random from a Perl array?
  6. 如何从Perl数组中随机取出n个元素?

#4


2  

This will only show benifits for small n compared to an array shuffle, but you could

这只会显示小n与数组shuffle相比的好处,但你可以

  1. Choose a random index r n times, each time decreasing the limit by 1
  2. 选择一个随机索引r n次,每次将限制减少1
  3. Adjust for previously used indices
  4. 调整以前使用的指数
  5. Take value
  6. 取值
  7. Store used index
  8. 存储使用的索引

Pseudocode

伪代码

arr = []
used = []
for i = 0..n-1:
    r = rand 0..len-i
    d = 0
    for j = 0..used.length-1:
        if r >= used[j]:
            d += 1
    arr.append($array[r + d])
    used.append(r)
return arr

#5


2  

You could generate n-times a random number with mt_rand() and then fill these values in a new array. To go against the case where the same index gets returned twice we use the actual returned index to fill the new array and check always if the index exists in the new array, if so we use while to loop through it as long as we get a duplicate index. At the end we use array_values() to get a 0-indexed array.

您可以使用mt_rand()生成n次随机数,然后在新数组中填充这些值。为了反对两次返回相同索引的情况,我们使用实际返回的索引来填充新数组,并检查新数组中是否存在索引,如果是这样的话,我们使用while循环遍历它只要我们得到一个重复索引。最后,我们使用array_values()来获得0索引数组。

$count = count($array) - 1;
$new_array = array();
for($i = 0; $i < $n; $i++) {
    $index = mt_rand(0, $count);
    while(isset($new_array[$index])) {
        $index = mt_rand(0, $count);
    }

    $new_array[$index] = $array[$index];
}
$new_array = array_values($new_array);