检查数组中元素的更快方法？

This function does the same as exists does with hashes.

此函数与存在的哈希值相同。

I plan on use it a lot.

我计划使用它很多。

Can it be optimized in some way?

它能以某种方式优化吗？

my @a = qw/a b c d/;

my $ret = array_exists("b", @a);

sub array_exists {
    my ($var, @a) = @_;

    foreach my $e (@a) {
        if ($var eq $e) {
            return 1;
        }
    }
    return 0;
}

7 个解决方案

#1

You can use smart matching, available in Perl 5.10 and later:

您可以使用Perl 5.10及更高版本中提供的智能匹配：

if ("b" ~~ @a) {
    # "b" exists in @a
}

This should be much faster than a function call.

这应该比函数调用快得多。

#2

If you have to do this a lot on a fixed array, use a hash instead:

如果必须在固定数组上执行此操作，请使用哈希：

 my %hash = map { $_, 1 } @array;

 if( exists $hash{$key} ) { ... }

Some people reach for the smart match operator, but that's one of the features that we need to remove from Perl. You need to decide if this should match, where the array hold an array reference that has a hash reference with the key b:

有些人会选择智能匹配运算符，但这是我们需要从Perl中删除的功能之一。您需要确定它是否应该匹配，其中数组包含具有键b的哈希引用的数组引用：

use 5.010;

my @a = (
    qw(x y z),
    [ { 'b' => 1 } ],
    );

say 'Matches' if "b" ~~ @a; # This matches

Since the smart match is recursive, if keeps going down into data structures. I write about some of this in Rethinking smart matching.

由于智能匹配是递归的，因此如果继续进入数据结构。我在重新思考智能匹配时写了一些内容。

#3

I'd use List::MoreUtils::any.

我使用List :: MoreUtils :: any。

my $ret = any { $_ eq 'b' } @a;

#4

Since there are lots of similar questions on * where different "correct answers" return different results, I tried to compare them. This question seems to be a good place to share my little benchmark.

由于*上存在许多类似的问题，其中不同的“正确答案”会返回不同的结果，因此我尝试对它们进行比较。这个问题似乎是分享我的小基准的好地方。

For my tests I used a test set (@test_set) of 1,000 elements (strings) of length 10 where only one element ($search_value) matches a given string.

对于我的测试，我使用了长度为10的1,000个元素（字符串）的测试集（@test_set），其中只有一个元素（$ search_value）匹配给定的字符串。

I took the following statements to validate the existence of this element in a loop of 100,000 turns.

我采用以下语句来验证这个元素在100,000转的循环中是否存在。

_grep

_grep

grep( $_ eq $search_value, @test_set )

_hash

_hash

{ map { $_ => 1 } @test_set }->{ $search_value }

_hash_premapped

_hash_premapped

$mapping->{ $search_value }

uses a $mapping that is precalculated as $mapping = { map { $_ => 1 } @test_set } (which is included in the final measuring)
使用预先计算为$ mapping = {map {$ _ => 1} @test_set}的$映射（包含在最终测量中）

_regex

sub{ my $rx = join "|", map quotemeta, @test_set; $search_value =~ /^(?:$rx)$/ }

_regex_prejoined

$search_value =~ /^(?:$rx)$/

uses a regular expression $rx that is precalculated as $rx = join "|", map quotemeta, @test_set; (which is included in the final measuring)
使用预先计算为$ rx = join“|”的正则表达式$ rx，map quotemeta，@ test_set; （包括在最终测量中）

_manual_first

_manual_first

sub{ foreach ( @test_set ) { return 1 if( $_ eq $search_value ); } return 0; }

_first

_第一

first { $_ eq $search_value } @test_set

from List::Util (version 1.38)
来自List :: Util（版本1.38）

_smart

_聪明

$search_value ~~ @test_set

_any

_任何

any { $_ eq $search_value } @test_set

from List::MoreUtils (version 0.33)
来自List :: MoreUtils（版本0.33）

On my machine ( Ubuntu, 3.2.0-60-generic, x86_64, Perl v5.14.2 ) I got the following results. The shown values are seconds and returned by gettimeofday and tv_interval of Time::HiRes (version 1.9726).

在我的机器上（Ubuntu，3.2.0-60-generic，x86_64，Perl v5.14.2），我得到了以下结果。显示的值是秒，由gettimeofday和time :: HiRes（版本1.9726）的tv_interval返回。

Element $search_value is located at position 0 in array @test_set

Element $ search_value位于数组@test_set中的位置0

_hash_premapped:    0.056211
_smart:             0.060267
_manual_first:      0.064195
_first:             0.258953
_any:               0.292959
_regex_prejoined:   0.350076
_grep:              5.748364
_regex:             29.27262
_hash:              45.638838

Element $search_value is located at position 500 in array @test_set

元素$ search_value位于数组@test_set中的位置500

_hash_premapped:    0.056316
_regex_prejoined:   0.357595
_first:             2.337911
_smart:             2.80226
_manual_first:      3.34348
_any:               3.408409
_grep:              5.772233
_regex:             28.668455
_hash:              45.076083

Element $search_value is located at position 999 in array @test_set

元素$ search_value位于数组@test_set中的位置999

_hash_premapped:    0.054434
_regex_prejoined:   0.362615
_first:             4.383842
_smart:             5.536873
_grep:              5.962746
_any:               6.31152
_manual_first:      6.59063
_regex:             28.695459
_hash:              45.804386

Conclusion

结论

The fastest method to check the existence of an element in an array is using prepared hashes. You of course buy that by an proportional amount of memory consumption and it only makes sense if you search for elements in the set multiple times. If your task includes small amounts of data and only a single or a few searches, hashes can even be the worst solution. Not the same way fast, but a similar idea would be to use prepared regular expressions, which seem to have a smaller preparation time.

检查数组中元素是否存在的最快方法是使用准备好的哈希。你当然会通过一定比例的内存消耗购买它，只有你多次搜索集合中的元素才有意义。如果您的任务包含少量数据且只进行一次或几次搜索，则哈希甚至可能是最糟糕的解决方案。快速的方式不一样，但类似的想法是使用准备好的正则表达式，这似乎具有较小的准备时间。

In many cases, a prepared environment is no option.

在许多情况下，准备好的环境是不可取的。

Surprisingly List::Util::first has very good results, when it comes to the comparison of statements, that don't have a prepared environment. While having the search value at the beginning (which could be perhaps interpreted as the result in smaller sets, too) it is very close to the favourites ~~ and any (and could even be in the range of measurement inaccuracy). For items in the middle or at the end of my larger test set, first is definitely the fastest.

令人惊讶的是，List :: Util :: first在语句比较方面有非常好的结果，没有准备好的环境。虽然在开头有搜索值（也许可能被解释为较小集合的结果），但它非常接近收藏夹~~和任何（甚至可能在测量误差范围内）。对于较大测试集中间或末尾的项目，首先肯定是最快的。

#5

brian d foy suggested using a hash, which gives O(1) lookups, at the cost of slightly more expensive hash creation. There is a technique that Marc Jason Dominus describes in his book Higher Order Perl where by a hash is used to memoize (or cache) results of a sub for a given parameter. So for example, if findit(1000) always returns the same thing for the given parameter, there's no need to recalculate the result every time. The technique is implemented in the Memoize module (part of the Perl core).

brian d foy建议使用散列，它提供O（1）查找，代价是稍微更昂贵的散列创建。 Marc Jason Dominus在他的书“Higher Order Perl”中描述了一种技术，其中使用散列来记忆（或缓存）给定参数的子结果。因此，例如，如果findit（1000）总是为给定参数返回相同的内容，则无需每次都重新计算结果。该技术在Memoize模块（Perl核心的一部分）中实现。

Memoizing is not always a win. Sometimes the overhead of the memoized wrapper is greater than the cost of calculating a result. Sometimes a given parameter is unlikely to ever be checked more than once or a relatively few times. And sometimes it cannot be guaranteed that the result of a function for a given parameter will always be the same (ie, the cache can become stale). But if you have an expensive function with stable return values per parameter, memoization can be a big win.

记忆并不总是一个胜利。有时，memoized包装器的开销大于计算结果的开销。有时，给定参数不可能被多次检查或相对较少次检查。有时无法保证给定参数的函数结果总是相同（即缓存可能变得陈旧）。但是如果你有一个昂贵的函数，每个参数的返回值都很稳定，那么memoization就是一个很大的胜利。

Just as brian d foy's answer uses a hash, Memoize uses a hash internally. There is additional overhead in the Memoize implementation, but the benefit to using Memoize is that it doesn't require refactoring the original subroutine. You just use Memoize; and then memoize( 'expensive_function' );, provided it meets the criteria for benefitting from memoization.

就像brian d foy的答案使用哈希一样，Memoize在内部使用哈希。 Memoize实现中还有额外的开销，但使用Memoize的好处是它不需要重构原始子例程。你只需使用Memoize;然后memoize（'expensive_function'）;,只要它符合从memoization中受益的标准。

I took your original subroutine and converted it to work with integers (just for simplicity in testing). Then I added a second version that passed a reference to the original array rather than copying the array. With those two versions, I created two more subs that I memoized. I then benchmarked the four subs.

我使用了原始子程序并将其转换为整数（仅为了简化测试）。然后我添加了第二个版本，它传递了对原始数组的引用，而不是复制数组。有了这两个版本，我创建了两个我记忆的潜艇。然后我对四个潜艇进行了基准测试。

In benchmarking, I had to make some decisions. First, how many iterations to test. The more iterations we test, the more likely we are to have good cache hits for the memoized versions. Then I also had to decide how many items to put into the sample array. The more items, the less likely to have cache hits, but the more significant the savings when a cache hit occurs. I ultimately decided on an array to be searched containing 8000 elements, and chose to search through 24000 iterations. That means that on average there should be two cache hits per memoized call. (The first call with a given param will write to the cache, while the second and third calls will read from the cache, so two good hits on average).

在基准测试中，我不得不做出一些决定。首先，要测试多少次迭代。我们测试的迭代次数越多，我们就越有可能为memoized版本提供良好的缓存命中率。然后我还必须决定将多少项放入样本数组中。项目越多，缓存命中的可能性就越小，但缓存命中时的节省越多。我最终决定要搜索包含8000个元素的数组，并选择搜索24000次迭代。这意味着平均每个memoized调用应该有两次缓存命中。（使用给定参数的第一次调用将写入缓存，而第二次和第三次调用将从缓存中读取，因此平均有两次良好的命中）。

Here is the test code:

这是测试代码：

use warnings;
use strict;
use Memoize;
use Benchmark qw/cmpthese/;

my $n = 8000; # Elements in target array
my $count = 24000; # Test iterations.

my @a = ( 1 .. $n );
my @find = map { int(rand($n)) } 0 .. $count;
my ( $orx, $ormx, $opx, $opmx ) = ( 0, 0, 0, 0 );

memoize( 'orig_memo' );
memoize( 'opt_memo'  );

cmpthese( $count, {
    original  => sub{ my $ret =  original( $find[ $orx++  ],  @a ); },
    orig_memo => sub{ my $ret = orig_memo( $find[ $ormx++ ],  @a ); },
    optimized => sub{ my $ret = optimized( $find[ $opx++  ], \@a ); },
    opt_memo  => sub{ my $ret =  opt_memo( $find[ $opmx++ ], \@a ); }
} );

sub original {
    my ( $var, @a) = @_;
    foreach my $e ( @a ) {
        return 1 if $var == $e;
    }
    return 0;
}

sub orig_memo {
    my ( $var, @a ) = @_;
    foreach my $e ( @a ) {
        return 1 if $var == $e;
    }
    return 0;
}

sub optimized {
    my( $var, $aref ) = @_;
    foreach my $e ( @{$aref} ) {
        return 1 if $var == $e;
    }
    return 0;
}

sub opt_memo {
    my( $var, $aref ) = @_;
    foreach my $e ( @{$aref} ) {
        return 1 if $var == $e;
    }
    return 0;
}

And here are the results:

以下是结果：

             Rate orig_memo  original optimized  opt_memo
orig_memo   876/s        --      -10%      -83%      -94%
original    972/s       11%        --      -82%      -94%
optimized  5298/s      505%      445%        --      -66%
opt_memo  15385/s     1657%     1483%      190%        --

As you can see, the memoized version of your original function was actually slower. That's because so much of the cost of your original subroutine was spent in making copies of the 8000 element array, combined with the fact that there is additional call-stack and bookkeeping overhead with the memoized version.

如您所见，原始函数的memoized版本实际上更慢。这是因为你的原始子程序的大部分成本花费在制作8000元素数组的副本上，再加上memoized版本有额外的调用堆栈和簿记开销这一事实。

But once we pass an array reference instead of a copy, we remove the expense of passing the entire array around. Your speed jumps considerably. But the clear winner is the optimized (ie, passing array refs) version that we memoized (cached), at 1483% faster than your original function. With memoization the O(n) lookup only happens the first time a given parameter is checked. Subsequent lookups occur in O(1) time.

但是一旦我们传递数组引用而不是副本，我们就会消除传递整个数组的费用。你的速度大幅提升。但明显的赢家是我们记忆（缓存）的优化（即传递数组引用）版本，比原始函数快1483％。通过memoization，O（n）查找仅在第一次检查给定参数时发生。后续查找在O（1）时间内发生。

Now you would have to decide (by Benchmarking) whether memoization helps you. Certainly passing an array ref does. And if memoization doesn't help you, maybe brian's hash method is best. But in terms of not having to rewrite much code, memoization combined with passing an array ref may be an excellent alternative.

现在你必须决定（通过Benchmarking）memoization是否可以帮助你。当然传递一个数组引用。如果memoization对你没有帮助，也许brian的哈希方法是最好的。但是，在不必重写大量代码方面，memoization与传递数组ref相结合可能是一个很好的选择。

#6

Your current solution iterates through the array before it finds the element it is looking for. As such, it is a linear algorithm.

您当前的解决方案会在找到要查找的元素之前遍历数组。因此，它是线性算法。

If you sort the array first with a relational operator (>for numeric elements, gt for strings) you can use binary search to find the elements. It is a logarithmic algorithm, much faster than linear.

如果首先使用关系运算符（>表示数字元素，gt表示字符串）对数组进行排序，则可以使用二进制搜索来查找元素。它是一种对数算法，比线性快得多。

Of course, one must consider the penalty of sorting the array in the first place, which is a rather slow operation (n log n). If the contents of the array you are matching against change often, you must sort after every change, and it gets really slow. If the contents remain the same after you've initially sorted them, binary search ends up being practically faster.

当然，必须首先考虑对数组进行排序的惩罚，这是一个相当慢的操作（n log n）。如果您匹配的数组的内容经常更改，则必须在每次更改后进行排序，并且它变得非常慢。如果在您最初对它们进行排序后内容保持不变，则二进制搜索最终会更快。

#7

You can use grep:

你可以使用grep：

sub array_exists {
  my $val = shift;
  return grep { $val eq $_ } @_;
}

Surprisingly, it's not off too far in speed from List::MoreUtils' any(). It's faster if your item is at the end of the list by about 25% and slower by about 50% if your item is at the start of the list.

令人惊讶的是，它与List :: MoreUtils'any（）的速度并没有太大差距。如果您的商品位于列表的最后，如果您的商品位于列表的开头，则该商品在列表末尾的速度提高约25％，速度提高约50％。

You can also inline it if needed -- no need to shove it off into a subroutine. i.e.

如果需要，您也可以内联它 - 无需将其推送到子例程中。即

if ( grep { $needle eq $_ } @haystack ) {
  ### Do something
  ...
}

#1