检查字节数组是否为0的最快方法

I have a byte[4096] and was wondering what the fastest way is to check if all values are zero?

我有一个字节[4096]，我想知道检查所有值是否为零的最快方法是什么?

Is there any way faster than doing:

有没有比这更快的方法:

byte[] b = new byte[4096];
b[4095] = 1;
for(int i=0;i<b.length;i++)
    if(b[i] != 0)
        return false; // Not Empty

5 个解决方案

#1

I have rewritten this answer as I was first summing all bytes, this is however incorrect as Java has signed bytes, hence I need to or. Also I have changed the JVM warmup to be correct now.

我重写了这个答案，因为我是第一次把所有字节加起来的，但是这是不正确的，因为Java有带符号的字节，所以我需要或者。我还更改了JVM热身以使其正确。

Your best bet really is to simply loop over all values.

最好的办法就是简单地循环所有的值。

I suppose you have three major options available:

我想你有三个主要的选择:

Or all elements and check the sum.
或者所有的元素，检查和。
Do branchless comparisons.
做无枝的比较。
Do comparisons with a branch.
与分支进行比较。

I don't know how good the performance is of adding bytes using Java (low level performance), I do know that Java uses (low level) branch predictors if you give branched comparisons.

我不知道使用Java(低级别性能)添加字节的性能有多好，但我知道如果进行分支比较，Java使用(低级别)分支谓词。

Therefore I expect the following to happen on:

因此，我希望接下来发生的事情是:

byte[] array = new byte[4096];
for (byte b : array) {
    if (b != 0) {
        return false;
    }
}

Relatively slow comparison in the first few iterations when the branch predictor is still seeding itself.
在最初的几个迭代中，当分支预测器仍在播种时，相对比较慢。
Very fast branch comparisons due to branch prediction as every value should be zero anyway.
由于分支预测，分支比较非常快，因为每个值都应该是0。

If it would hit a non-zero value, then the branch predictor would fail, causing a slow-down of the comparison, but then you are also at the end of your computation as you want to return false either way. I think the cost of one failing branch prediction is an order of magnitude smaller as the cost of continuing to iterate over the array.

如果它命中一个非零值，那么分支预测器就会失败，导致比较的慢下来，但是当你想要返回false时，你也在计算的最后。我认为一个失败的分支预测的代价是一个数量级的小数量级，因为继续遍历数组的代价。

I furthermore believe that for (byte b : array) should be allowed as it should get compiled directly into indexed array iteration as as far as I know there is no such thing as a PrimitiveArrayIterator which would cause some extra method calls (as iterating over a list) until the code gets inlined.

b:我而且相信(字节数组)应该被允许,因为它应该直接编译成索引数组迭代,据我所知,没有所谓的PrimitiveArrayIterator会导致一些额外的方法调用(遍历一个列表),直到被内联的代码。

Update

更新

I wrote my own benchmarks which give some interesting results... Unfortunately I couldn't use any of the existing benchmark tools as they are pretty hard to get installed correctly.

我写了我自己的基准测试，结果很有趣……不幸的是，我不能使用任何现有的基准测试工具，因为它们很难正确安装。

I also decided to group options 1 and 2 together, as I think they are actually the same as with branchless you usually or everything (minus the condition) and then check the final result. And the condition here is x > 0 and hence a or of zero is a noop presumably.

我还决定把选项1和选项2放在一起，因为我认为它们实际上和你通常的无分支或所有的东西(减去条件)是一样的，然后检查最后的结果。这里的条件是x >，因此a或0应该是noop。

The code:

代码:

public class Benchmark {
    private void start() {
        //setup byte arrays
        List<byte[]> arrays = createByteArrays(700_000);

        //warmup and benchmark repeated
        arrays.forEach(this::byteArrayCheck12);
        benchmark(arrays, this::byteArrayCheck12, "byteArrayCheck12");

        arrays.forEach(this::byteArrayCheck3);
        benchmark(arrays, this::byteArrayCheck3, "byteArrayCheck3");

        arrays.forEach(this::byteArrayCheck4);
        benchmark(arrays, this::byteArrayCheck4, "byteArrayCheck4");

        arrays.forEach(this::byteArrayCheck5);
        benchmark(arrays, this::byteArrayCheck5, "byteArrayCheck5");
    }

    private void benchmark(final List<byte[]> arrays, final Consumer<byte[]> method, final String name) {
        long start = System.nanoTime();
        arrays.forEach(method);
        long end = System.nanoTime();
        double nanosecondsPerIteration = (end - start) * 1d / arrays.size();
        System.out.println("Benchmark: " + name + " / iterations: " + arrays.size() + " / time per iteration: " + nanosecondsPerIteration + "ns");
    }

    private List<byte[]> createByteArrays(final int amount) {
        Random random = new Random();
        List<byte[]> resultList = new ArrayList<>();
        for (int i = 0; i < amount; i++) {
            byte[] byteArray = new byte[4096];
            byteArray[random.nextInt(4096)] = 1;
            resultList.add(byteArray);
        }
        return resultList;
    }

    private boolean byteArrayCheck12(final byte[] array) {
        int sum = 0;
        for (byte b : array) {
            sum |= b;
        }
        return (sum == 0);
    }

    private boolean byteArrayCheck3(final byte[] array) {
        for (byte b : array) {
            if (b != 0) {
                return false;
            }
        }
        return true;
    }

    private boolean byteArrayCheck4(final byte[] array) {
        return (IntStream.range(0, array.length).map(i -> array[i]).reduce(0, (a, b) -> a | b) != 0);
    }

    private boolean byteArrayCheck5(final byte[] array) {
        return IntStream.range(0, array.length).map(i -> array[i]).anyMatch(i -> i != 0);
    }

    public static void main(String[] args) {
        new Benchmark().start();
    }
}

The surprising results:

令人惊讶的结果:

Benchmark: byteArrayCheck12 / iterations: 700000 / time per iteration: 50.18817142857143ns
Benchmark: byteArrayCheck3 / iterations: 700000 / time per iteration: 767.7371985714286ns
Benchmark: byteArrayCheck4 / iterations: 700000 / time per iteration: 21145.03219857143ns
Benchmark: byteArrayCheck5 / iterations: 700000 / time per iteration: 10376.119144285714ns

标准:byteArrayCheck12 /迭代:700000 /次每次迭代:50.18817142857143ns基准:byteArrayCheck3 /迭代:700000 /次迭代:767.7371985714286ns基准:byteArrayCheck4 /迭代:700000 /次迭代:21145.03219857143ns基准:byteArrayCheck5 /迭代:700000 /次迭代:10376.119144285714ns。

This shows that orring is a whole lots of faster than the branch predictor, which is rather surprising, so I assume some low level optimizations are being done.

这表明orring比分支预测器要快得多，这很令人惊讶，因此我假设正在进行一些低级别优化。

As extra I've included the stream variants, which I did not expect to be that fast anyhow.

另外，我还包含了流变体，但我并没有想到它会这么快。

Ran on a stock-clocked Intel i7-3770, 16GB 1600MHz RAM.

运行在一个库存的英特尔i7-3770, 16GB 1600MHz RAM。

So I think the final answer is: It depends. It depends on how many times you are going to check the array consecutively. The "byteArrayCheck3" solution is always steadily at 700~800ns.

所以我认为最终的答案是:这要看情况。它取决于你连续检查数组的次数。byteArrayCheck3解决方案总是稳定在700~800ns。

Follow up update

跟进更新

Things actually take another interesting approach, turns out the JIT was optimizing almost all calculations away due to resulting variables not being used at all.

事情实际上采取了另一种有趣的方法，事实证明JIT优化了几乎所有的计算，因为结果变量根本没有被使用。

Thus I have the following new benchmark method:

因此，我有以下新的基准方法:

private void benchmark(final List<byte[]> arrays, final Predicate<byte[]> method, final String name) {
    long start = System.nanoTime();
    boolean someUnrelatedResult = false;
    for (byte[] array : arrays) {
        someUnrelatedResult |= method.test(array);
    }
    long end = System.nanoTime();
    double nanosecondsPerIteration = (end - start) * 1d / arrays.size();
    System.out.println("Result: " + someUnrelatedResult);
    System.out.println("Benchmark: " + name + " / iterations: " + arrays.size() + " / time per iteration: " + nanosecondsPerIteration + "ns");
}

This ensures that the result of the benchmarks cannot be optimized away, the major issue hence was that the byteArrayCheck12 method was void, as it noticed that the (sum == 0) was not being used, hence it optimized away the entire method.

这确保了基准测试的结果不能被优化，因此主要的问题是byteArrayCheck12方法无效，因为它注意到(sum = 0)没有被使用，因此它优化了整个方法。

Thus we have the following new result (omitted the result prints for clarity):

因此我们有了以下的新结果(为了清晰起见省略了结果打印):

Benchmark: byteArrayCheck12 / iterations: 700000 / time per iteration: 1370.6987942857143ns
Benchmark: byteArrayCheck3 / iterations: 700000 / time per iteration: 736.1096242857143ns
Benchmark: byteArrayCheck4 / iterations: 700000 / time per iteration: 20671.230327142857ns
Benchmark: byteArrayCheck5 / iterations: 700000 / time per iteration: 9845.388841428572ns

基准点:byteArrayCheck12 /迭代:700000 /次迭代:1370.6987942857143ns基准:byteArrayCheck3 /迭代:700000 /次迭代:736.1096242857143ns基准:byteArrayCheck4 /迭代:700000 /次迭代:20671.230327142857ns基准:byteArrayCheck5 /迭代:700000 /每次迭代:9845.388841428572ns。

Hence we think that we can finally conclude that branch prediction wins. It could however also happen because of the early returns, as on average the offending byte will be in the middle of the byte array, hence it is time for another method that does not return early:

因此，我们认为我们可以最终得出结论:分支预测是成功的。但是，由于提前返回，也可能发生这种情况，因为通常情况下，违规字节将位于字节数组的中间，因此现在是另一个不提前返回的方法的时候了:

private boolean byteArrayCheck3b(final byte[] array) {
    int hits = 0;
    for (byte b : array) {
        if (b != 0) {
            hits++;
        }
    }
    return (hits == 0);
}

In this way we still benefit from the branch prediction, however we make sure that we cannot return early.

这样，我们仍然可以从分支预测中获益，但是我们要确保不能提前返回。

Which in turn gives us more interesting results again!

这又给我们带来了更有趣的结果!

Benchmark: byteArrayCheck12 / iterations: 700000 / time per iteration: 1327.2817714285713ns
Benchmark: byteArrayCheck3 / iterations: 700000 / time per iteration: 753.31376ns
Benchmark: byteArrayCheck3b / iterations: 700000 / time per iteration: 1506.6772842857142ns
Benchmark: byteArrayCheck4 / iterations: 700000 / time per iteration: 21655.950115714284ns
Benchmark: byteArrayCheck5 / iterations: 700000 / time per iteration: 10608.70917857143ns

基准:byteArrayCheck3 /迭代:700000 /次每次迭代:byteArrayCheck3 /迭代:700000 /次迭代:每迭代70000 /次:byteArrayCheck4 /迭代:700000 /次迭代:21655.950115714284ns基准:byteArrayCheck4 /迭代:700000 /次迭代:10608.70917857143ns /迭代:21655.950115714284ns基准:byteArrayCheck5 /迭代:700000 /次迭代:10608.70917857143ns。

I think we can though finally conclude that the fastest way is to use both early-return and branch prediction, followed by orring, followed by purely branch prediction. I suspect that all of those operations are highly optimized in native code.

我想我们最终可以得出结论，最快的方法是同时使用早期收益和分支预测，然后是orring，然后是纯粹的分支预测。我怀疑所有这些操作都是在本机代码中高度优化的。

Update, some additional benchmarking using long and int arrays.

更新，一些额外的基准使用长和int数组。

After seeing suggestions on using long[] and int[] I decided it was worth investigating. However these attempts may not be fully in line with the original answers anymore, nevertheless may still be interesting.

在看到关于使用long[]和int[]的建议后，我认为这是值得研究的。然而，这些尝试可能不再完全符合最初的答案，尽管如此，可能仍然是有趣的。

Firstly, I changed the benchmark method to use generics:

首先，我将基准法改为使用泛型:

private <T> void benchmark(final List<T> arrays, final Predicate<T> method, final String name) {
    long start = System.nanoTime();
    boolean someUnrelatedResult = false;
    for (T array : arrays) {
        someUnrelatedResult |= method.test(array);
    }
    long end = System.nanoTime();
    double nanosecondsPerIteration = (end - start) * 1d / arrays.size();
    System.out.println("Result: " + someUnrelatedResult);
    System.out.println("Benchmark: " + name + " / iterations: " + arrays.size() + " / time per iteration: " + nanosecondsPerIteration + "ns");
}

Then I performed conversions from byte[] to long[] and int[] respectively before the benchmarks, it was also neccessary to set the maximum heap size to 10 GB.

然后在基准测试之前分别执行从byte[]到long[]和int[]的转换，也需要将最大堆大小设置为10gb。

List<long[]> longArrays = arrays.stream().map(byteArray -> {
    long[] longArray = new long[4096 / 8];
    ByteBuffer.wrap(byteArray).asLongBuffer().get(longArray);
    return longArray;
}).collect(Collectors.toList());
longArrays.forEach(this::byteArrayCheck8);
benchmark(longArrays, this::byteArrayCheck8, "byteArrayCheck8");

List<int[]> intArrays = arrays.stream().map(byteArray -> {
    int[] intArray = new int[4096 / 4];
    ByteBuffer.wrap(byteArray).asIntBuffer().get(intArray);
    return intArray;
}).collect(Collectors.toList());
intArrays.forEach(this::byteArrayCheck9);
benchmark(intArrays, this::byteArrayCheck9, "byteArrayCheck9");

private boolean byteArrayCheck8(final long[] array) {
    for (long l : array) {
        if (l != 0) {
            return false;
        }
    }
    return true;
}

private boolean byteArrayCheck9(final int[] array) {
    for (int i : array) {
        if (i != 0) {
            return false;
        }
    }
    return true;
}

Which gave the following results:

结果如下:

Benchmark: byteArrayCheck8 / iterations: 700000 / time per iteration: 259.8157614285714ns
Benchmark: byteArrayCheck9 / iterations: 700000 / time per iteration: 266.38013714285717ns

基准:byteArrayCheck8 /迭代:700000 /次迭代:259.815761428571414ns基准:byteArrayCheck9 /迭代:700000 /次迭代:266.38013714285717ns

This path may be worth exploring if it is possibly to get the bytes in such format. However when doing the transformations inside the benchmarked method, the times were around 2000 nanoseconds per iteration, so it is not worth it when you need to do the conversions yourself.

如果可能以这种格式获取字节，那么此路径可能值得研究。然而，当在基准测试方法中进行转换时，每次迭代的时间大约为2000纳秒，所以当您需要自己进行转换时，这样做是不值得的。

#2

This may not be the fastest or most memory performant solution but it's a one liner:

这可能不是最快的或者内存性能最好的解决方案，但它是一个衬垫:

byte[] arr = randomByteArray();
assert Arrays.equals(arr, new byte[arr.length]);

#3

For Java 8, you can simply use this:

对于Java 8，您可以简单地使用以下内容:

public static boolean isEmpty(final byte[] data){
    return IntStream.range(0, data.length).parallel().allMatch(i -> data[i] == 0);
}

#4

I think that theoretically your way in the fastest way, in practice you might be able to make use of larger comparisons as suggested by one of the commenters (1 byte comparison takes 1 instruction, but so does an 8-byte comparison on a 64-bit system).

我认为从理论上讲，你的方法是最快的，在实践中，你可能可以使用一个评论者建议的更大的比较(1字节的比较需要1条指令，但是在64位系统上8字节的比较也需要1条指令)。

Also in languages closer to the hardware (C and variants) you can make use of something called vectorization where you could perform a number of the comparisons/additions simultaneously. It looks like Java still doesn't have native support for it but based on this answer you might be able to get some use of it.

同样，在更接近硬件(C和变体)的语言中，您可以使用称为向量化的东西，在这里您可以同时执行许多比较/添加。看起来Java仍然没有本地支持，但是基于这个答案你可能会得到一些使用。

Also in line with the other comments I would say that with a 4k buffer it's probably not worth the time to try and optimize it (unless it is being called very often)

与其他评论一致的是，对于4k缓冲区，可能不值得花时间去尝试和优化它(除非它经常被调用)

#5

Someone suggested checking 4 or 8 bytes at a time. You actually can do this in Java:

有人建议每次检查4或8个字节。你可以用Java实现:

LongBuffer longBuffer = ByteBuffer.wrap(b).asLongBuffer();
while (longBuffer.hasRemaining()) {
    if (longBuffer.get() != 0) {
        return false;
    }
}
return true;

Whether this is faster than checking byte values is uncertain, since there is so much potential for optimization.

这是否比检查字节值快是不确定的，因为有很多优化的潜力。

#1