Many books and tutorials say that the size of a hash table must be a prime to evenly distribute the keys in all the buckets. But Java's HashMap
always uses a size that is a power of two. Shouldn't it be using a prime? What's better, a "prime" or a "power of two" as the hash table size?
许多书籍和教程都说哈希表的大小必须是在所有桶中均匀分配密钥的主要原因。但Java的HashMap总是使用2的幂的大小。它不应该使用素数吗?哈希表大小更好,“素数”或“两个幂”?
5 个解决方案
#1
19
Using a power of two effectively masks out top bits of the hash code. Thus a poor-quality hash function might perform particularly badly in this scenario.
使用2的幂有效地掩盖了散列码的顶部位。因此,在这种情况下,质量差的哈希函数可能表现得特别糟糕。
Java's HashMap
mitigates this by mistrusting the object's hashCode()
implementation and applying a second level of hashing to its result:
Java的HashMap通过不信任对象的hashCode()实现并对其结果应用第二级散列来缓解这个问题:
Applies a supplemental hash function to a given hashCode, which defends against poor quality hash functions. This is critical because HashMap uses power-of-two length hash tables, that otherwise encounter collisions for hashCodes that do not differ in lower bits.
将补充哈希函数应用于给定的hashCode,以防御质量差的哈希函数。这很关键,因为HashMap使用两个幂的长度哈希表,否则会遇到低位不同的hashCodes的冲突。
If you have a good hash function, or do something similar to what HashMap
does, it does not matter whether you use prime numbers etc as the table size.
如果你有一个好的哈希函数,或者做类似于HashMap的工作,那么你是否使用素数等作为表大小并不重要。
If, on the other hand, the hash function is of unknown or poor quality, then using a prime number would be a safer bet. It will, however, make dynamically-sized tables tricker to implement, since all of a sudden you need to be able to produce prime numbers instead of just multiplying the size by a constant factor.
另一方面,如果散列函数的质量未知或质量差,那么使用素数将是更安全的赌注。但是,它会使动态大小的表格变得难以实现,因为突然之间你需要能够产生素数而不是仅仅将大小乘以常数因子。
#2
3
The standard HashMap implementation has a hash
method which rehashes your object's hashcode to avoid that pitfall. The comment before the hash()
method reads:
标准的HashMap实现有一个哈希方法,它重新处理对象的哈希码以避免陷阱。 hash()方法之前的注释读取:
/**
* Retrieve object hash code and applies a supplemental hash function to the
* result hash, which defends against poor quality hash functions. This is
* critical because HashMap uses power-of-two length hash tables, that
* otherwise encounter collisions for hashCodes that do not differ
* in lower bits. Note: Null keys always map to hash 0, thus index 0.
*/
#3
3
The only way to know which is better between prime and power-of-two is to benchmark it.
知道素数和二次幂之间哪个更好的唯一方法是对其进行基准测试。
Many years ago, when writing an assembler whose performance depended strongly on symbol talbe lookup, I tested this using a large block of generated identifiers. Even with a naive mapping, I found that power-of-two, as expected, had less even distribution and longer chains than a similar sized prime number of buckets. It still ran faster, because of the speed of bucket selection by bit masking.
许多年前,当编写一个性能强烈依赖于符号talbe查找的汇编程序时,我使用一大块生成的标识符对其进行了测试。即使有一个天真的映射,我发现正如预期的那样,二次幂的分布比一个类似大小的素数桶的分布更均匀,链更长。由于通过位屏蔽选择桶的速度,它仍然运行得更快。
I strongly suspect the java.util developers would not have resorted to the extra hashing and power-of-two without benchmarking it against using a prime number of buckets. It is a really obvious thing to do when designing a hashed data structure.
我强烈怀疑java.util开发人员不会使用额外散列和2次幂而不使用大量数据桶对其进行基准测试。在设计散列数据结构时,这是非常明显的事情。
For that reason, I'm sure the rehash and power-of-two size gives better performance for typical Java hash maps than a prime number of buckets.
出于这个原因,我确信rehash和power-of-two大小为典型的Java哈希映射提供了比质数数量的桶更好的性能。
#4
0
From a performance/calculation time point of view power-of-two sizes can be calculated with just bit masking which is faster than integer division which would be required otherwise.
从性能/计算时间的角度来看,可以仅利用比特掩码来计算两个功率的大小,其比整数除法更快,否则将是所需的。
#5
0
You probably should use prime sized hash tables if you use quadratic probing for collision resolution. If you have a prime sized table, quadratic probing will hit half of the entries, less if it is not a prime. So you might not find a suitable place to store you entry even if your hash table is less than half full. Since Java hash maps don't use quadratic probing, there is no need to use primes as size.
如果使用二次探测进行冲突解决,则可能应该使用素数大小的哈希表。如果你有一个素数大小的表,二次探测将击中一半的条目,如果它不是素数则更少。因此,即使哈希表少于半满,您也可能找不到合适的存储位置。由于Java哈希映射不使用二次探测,因此不需要使用素数作为大小。
#1
19
Using a power of two effectively masks out top bits of the hash code. Thus a poor-quality hash function might perform particularly badly in this scenario.
使用2的幂有效地掩盖了散列码的顶部位。因此,在这种情况下,质量差的哈希函数可能表现得特别糟糕。
Java's HashMap
mitigates this by mistrusting the object's hashCode()
implementation and applying a second level of hashing to its result:
Java的HashMap通过不信任对象的hashCode()实现并对其结果应用第二级散列来缓解这个问题:
Applies a supplemental hash function to a given hashCode, which defends against poor quality hash functions. This is critical because HashMap uses power-of-two length hash tables, that otherwise encounter collisions for hashCodes that do not differ in lower bits.
将补充哈希函数应用于给定的hashCode,以防御质量差的哈希函数。这很关键,因为HashMap使用两个幂的长度哈希表,否则会遇到低位不同的hashCodes的冲突。
If you have a good hash function, or do something similar to what HashMap
does, it does not matter whether you use prime numbers etc as the table size.
如果你有一个好的哈希函数,或者做类似于HashMap的工作,那么你是否使用素数等作为表大小并不重要。
If, on the other hand, the hash function is of unknown or poor quality, then using a prime number would be a safer bet. It will, however, make dynamically-sized tables tricker to implement, since all of a sudden you need to be able to produce prime numbers instead of just multiplying the size by a constant factor.
另一方面,如果散列函数的质量未知或质量差,那么使用素数将是更安全的赌注。但是,它会使动态大小的表格变得难以实现,因为突然之间你需要能够产生素数而不是仅仅将大小乘以常数因子。
#2
3
The standard HashMap implementation has a hash
method which rehashes your object's hashcode to avoid that pitfall. The comment before the hash()
method reads:
标准的HashMap实现有一个哈希方法,它重新处理对象的哈希码以避免陷阱。 hash()方法之前的注释读取:
/**
* Retrieve object hash code and applies a supplemental hash function to the
* result hash, which defends against poor quality hash functions. This is
* critical because HashMap uses power-of-two length hash tables, that
* otherwise encounter collisions for hashCodes that do not differ
* in lower bits. Note: Null keys always map to hash 0, thus index 0.
*/
#3
3
The only way to know which is better between prime and power-of-two is to benchmark it.
知道素数和二次幂之间哪个更好的唯一方法是对其进行基准测试。
Many years ago, when writing an assembler whose performance depended strongly on symbol talbe lookup, I tested this using a large block of generated identifiers. Even with a naive mapping, I found that power-of-two, as expected, had less even distribution and longer chains than a similar sized prime number of buckets. It still ran faster, because of the speed of bucket selection by bit masking.
许多年前,当编写一个性能强烈依赖于符号talbe查找的汇编程序时,我使用一大块生成的标识符对其进行了测试。即使有一个天真的映射,我发现正如预期的那样,二次幂的分布比一个类似大小的素数桶的分布更均匀,链更长。由于通过位屏蔽选择桶的速度,它仍然运行得更快。
I strongly suspect the java.util developers would not have resorted to the extra hashing and power-of-two without benchmarking it against using a prime number of buckets. It is a really obvious thing to do when designing a hashed data structure.
我强烈怀疑java.util开发人员不会使用额外散列和2次幂而不使用大量数据桶对其进行基准测试。在设计散列数据结构时,这是非常明显的事情。
For that reason, I'm sure the rehash and power-of-two size gives better performance for typical Java hash maps than a prime number of buckets.
出于这个原因,我确信rehash和power-of-two大小为典型的Java哈希映射提供了比质数数量的桶更好的性能。
#4
0
From a performance/calculation time point of view power-of-two sizes can be calculated with just bit masking which is faster than integer division which would be required otherwise.
从性能/计算时间的角度来看,可以仅利用比特掩码来计算两个功率的大小,其比整数除法更快,否则将是所需的。
#5
0
You probably should use prime sized hash tables if you use quadratic probing for collision resolution. If you have a prime sized table, quadratic probing will hit half of the entries, less if it is not a prime. So you might not find a suitable place to store you entry even if your hash table is less than half full. Since Java hash maps don't use quadratic probing, there is no need to use primes as size.
如果使用二次探测进行冲突解决,则可能应该使用素数大小的哈希表。如果你有一个素数大小的表,二次探测将击中一半的条目,如果它不是素数则更少。因此,即使哈希表少于半满,您也可能找不到合适的存储位置。由于Java哈希映射不使用二次探测,因此不需要使用素数作为大小。