计算事件的最有效方法?

时间:2022-09-23 17:13:05

I've got an array of bytes (primitive), they can have random values. I'm trying to count occurrences of them in the array in the most efficient/fastest way. Currently I'm using:

我有一个字节数组(原始),它们可以有随机值。我试图以最有效/最快的方式计算它们在数组中的出现次数。目前我正在使用:

HashMap<Byte, Integer> dataCount = new HashMap<>();
for (byte b : data) dataCount.put(b, dataCount.getOrDefault(b, 0) + 1);

This one-liner takes ~500ms to process a byte[] of length 24883200. Using a regular for loop takes at least 600ms.

这个单行程需要大约500ms来处理长度为24883200的字节[]。使用常规for循环至少需要600ms。

I've been thinking of constructing a set (since they only contain one of each element) then adding it to a HashMap using Collections.frequency(), but the methods to construct a Set from primitives require several other calls, so I'm guessing it's not as fast.

我一直在考虑构造一个集合(因为它们只包含每个元素中的一个)然后使用Collections.frequency()将它添加到HashMap中,但是从原语构造Set的方法需要几个其他调用,所以我是猜测它不是那么快。

What would be the fastest way to accomplish counting of occurrences of each item?

完成每个项目发生次数的最快方法是什么?

I'm using Java 8 and I'd prefer to avoid using Apache Commons if possible.

我正在使用Java 8,如果可能的话,我宁愿避免使用Apache Commons。

2 个解决方案

#1


If it's just bytes, use an array, don't use a map. You do have to use masking to deal with the signedness of bytes, but that's not a big deal.

如果它只是字节,请使用数组,不要使用地图。你必须使用掩码来处理字节的签名,但这不是什么大问题。

int[] counts = new int[256];
for (byte b : data) {
   counts[b & 0xFF]++;
}

Arrays are just so massively compact and efficient that they're almost impossible to beat when you can use them.

阵列非常紧凑和高效,当你可以使用时几乎不可能击败它们。

#2


I would create an array instead of a HashMap, given that you know exactly how many counts you need to keep track of:

我会创建一个数组而不是HashMap,因为您确切知道需要跟踪的计数数量:

int[] counts = new int[256];
for (byte b : data) {
    counts[b & 0xff]++;
}

That way:

  • You never need to do any boxing of either the keys or the values
  • 你永远不需要对键或值进行任何装箱

  • Nothing needs to take a hash code, check for equality etc
  • 没有什么需要采用哈希码,检查相等性等

  • It's about as memory-efficient as it gets
  • 它与内存一样高效

Note that the & 0xff is used to get a value in the range [0, 255] instead of [-128, 127], so it's suitable as the index into the array.

请注意,&0xff用于获取[0,255]范围内的值而不是[-128,127],因此它适合作为数组的索引。

#1


If it's just bytes, use an array, don't use a map. You do have to use masking to deal with the signedness of bytes, but that's not a big deal.

如果它只是字节,请使用数组,不要使用地图。你必须使用掩码来处理字节的签名,但这不是什么大问题。

int[] counts = new int[256];
for (byte b : data) {
   counts[b & 0xFF]++;
}

Arrays are just so massively compact and efficient that they're almost impossible to beat when you can use them.

阵列非常紧凑和高效,当你可以使用时几乎不可能击败它们。

#2


I would create an array instead of a HashMap, given that you know exactly how many counts you need to keep track of:

我会创建一个数组而不是HashMap,因为您确切知道需要跟踪的计数数量:

int[] counts = new int[256];
for (byte b : data) {
    counts[b & 0xff]++;
}

That way:

  • You never need to do any boxing of either the keys or the values
  • 你永远不需要对键或值进行任何装箱

  • Nothing needs to take a hash code, check for equality etc
  • 没有什么需要采用哈希码,检查相等性等

  • It's about as memory-efficient as it gets
  • 它与内存一样高效

Note that the & 0xff is used to get a value in the range [0, 255] instead of [-128, 127], so it's suitable as the index into the array.

请注意,&0xff用于获取[0,255]范围内的值而不是[-128,127],因此它适合作为数组的索引。