HashMap中的负载因子的意义是什么?

时间:2021-05-29 21:42:13

HashMap has two important properties: size and load factor. I went through the Java documentation and it says 0.75f is the initial load factor. But I can't find the actual use of it.

HashMap有两个重要的属性:大小和负载因子。我看过Java文档,它说0。75f是初始负载因子。但是我找不到它的实际用途。

Can someone describe what are the different scenarios where we need to set load factor and what are some sample ideal values for different cases?

有人能描述一下我们需要设置负载因数的不同场景以及不同情况下的理想样本值是什么吗?

6 个解决方案

#1


206  

The documentation explains it pretty well:

文档解释得很好:

An instance of HashMap has two parameters that affect its performance: initial capacity and load factor. The capacity is the number of buckets in the hash table, and the initial capacity is simply the capacity at the time the hash table is created. The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased. When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets.

HashMap的实例有两个影响其性能的参数:初始容量和负载因数。容量是哈希表中的桶数,而初始容量只是创建哈希表时的容量。负载因数是在哈希表的容量自动增加之前允许哈希表达到的满的程度的度量。当哈希表中的条目数超过负载因子和当前容量的乘积时,哈希表被重新散列(也就是说,内部数据结构被重新构建),这样哈希表就有大约两倍的桶数。

As a general rule, the default load factor (.75) offers a good tradeoff between time and space costs. Higher values decrease the space overhead but increase the lookup cost (reflected in most of the operations of the HashMap class, including get and put). The expected number of entries in the map and its load factor should be taken into account when setting its initial capacity, so as to minimize the number of rehash operations. If the initial capacity is greater than the maximum number of entries divided by the load factor, no rehash operations will ever occur.

一般来说,默认负载因数(.75)在时间和空间成本之间提供了很好的权衡。更高的值减少了空间开销,但是增加了查找成本(在HashMap类的大部分操作中,包括get和put)。在设置映射的初始容量时,应该考虑映射中预期的条目数量及其负载因数,以最小化rehash操作的数量。如果初始容量大于条目的最大数量除以负载因子,则不会发生重新哈希操作。

As with all performance optimizations, it is a good idea to avoid optimizing things prematurely (i.e. without hard data on where the bottlenecks are).

与所有性能优化一样,最好避免过早地优化事物(也就是说,不要在瓶颈所在的地方使用硬数据)。

#2


113  

Default initial capacity of the HashMap takes is 16 and load factor is 0.75f (i.e 75% of current map size). The load factor represents at what level the HashMap capacity should be doubled.

HashMap的默认初始容量为16,负载因数为0.75f (i)。e当前地图尺寸的75%)。负载因子表示HashMap的容量应该增加一倍。

For example product of capacity and load factor as 16 * 0.75 = 12. This represents that after storing the 12th key – value pair into the HashMap , its capacity becomes 32.

例如,容量和负载系数的乘积为16 * 0.75 = 12。这表示在将第12个键值对存储到HashMap之后,其容量变为32。

#3


25  

Actually, from my calculations, the "perfect" load factor is closer to log 2 (~ 0.7). Although any load factor less than this will yield better performance. I think that .75 was probably pulled out of a hat.

实际上,根据我的计算,“完美”负载因子更接近log 2(~ 0.7)。虽然任何负载因数小于这个将产生更好的性能。我想。75可能是从帽子里拿出来的。

Proof:

证明:

Chaining can be avoided and branch prediction exploited by predicting if a bucket is empty or not. A bucket is probably empty if the probability of it being empty exceeds .5.

可以避免链接,通过预测桶是否为空来利用分支预测。如果一个水桶是空的,那么它很可能是空的。

Let s represent the size and n the number of keys added. Using the binomial theorem, the probability of a bucket being empty is:

让s表示大小,n表示添加的键的数量。利用二项式定理,水桶为空的概率为:

P(0) = C(n, 0) * (1/s)^0 * (1 - 1/s)^(n - 0)

Thus, a bucket is probably empty if there are less than

因此,如果一个水桶小于

log(2)/log(s/(s - 1)) keys

As s reaches infinity and if the number of keys added is such that P(0) = .5, then n/s approaches log(2) rapidly:

当s趋于∞时,如果增加的键数为P(0) = .5,则n/s快速逼近log(2):

lim (log(2)/log(s/(s - 1)))/s as s -> infinity = log(2) ~ 0.693...

#4


21  

What is load factor ?

The amount of capacity which is to be exhausted for the HashMap to increase its capacity ?

为了增加它的容量,需要消耗的容量是多少?

Why load factor ?

Load factor is by default 0.75 of the initial capacity (16) therefore 25% of the buckets will be free before there is an increase in the capacity & this makes many new buckets with new hashcodes pointing to them to exist just after the increase in the number of buckets.

负载因数默认为初始容量(16)的0.75,因此25%的桶在容量增加之前是空闲的,这使得许多新的桶具有新的hashcode指向它们,而这些新桶的存在就在桶的数量增加之后。

Now why should you keep many free buckets & what is the impact of keeping free buckets on the performance ?

If you set the loading factor to say 1.0 then something very interesting might happen.

如果你把加载因子设置为1。0那么很有趣的事情就会发生。

Say you are adding an object x to your hashmap whose hashCode is 888 & in your hashmap the bucket representing the hashcode is free , so the object x gets added to the bucket, but now again say if you are adding another object y whose hashCode is also 888 then your object y will get added for sure BUT at the end of the bucket (because the buckets are nothing but linkedList implementation storing key,value & next) now this has a performance impact ! Since your object y is no longer present in the head of the bucket if you perform a lookup the time taken is not going to be O(1) this time it depends on how many items are there in the same bucket. This is called hash collision by the way & this even happens when your loading factor is less than 1.

说你添加一个对象x hashmap的hashCode 888 & hashmap表示hashCode的桶是免费的,所以对象x被添加到桶,但是现在又说如果你添加另一个对象的hashCode也是888然后你对象y会添加肯定的但是最终桶(因为桶除了linkedList实现存储键,值和下一个)现在有一个性能影响!因为当你执行查找时,你的对象y不再出现在bucket的头部中,所以这次的时间不应该是O(1),这取决于在同一个bucket中有多少项。顺便说一下,这被称为哈希冲突&甚至在加载因子小于1时也会发生这种情况。

Correlation between performance , hash collision & loading factor ?

Lower load factor = more free buckets = less chances of collision = high performance = high space requirement.

更低的负载系数=更多的空闲桶=更少的碰撞机会=高性能=更高的空间需求。

Correct me if i am wrong somewhere.

如果我说错了,请纠正我。

#5


16  

From the documentation:

从文档:

The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased

负载因数是在哈希表的容量自动增加之前允许哈希表达到的满的程度的度量

It really depends on your particular requirements, there's no "rule of thumb" for specifying an initial load factor.

它确实取决于您的特定需求,没有“经验规则”来指定初始负载因子。

#6


1  

I would pick a table size of n * 1.5 or n + (n >> 1), this would give a load factor of .66666~ without division, which is slow on most systems, especially on portable systems where there is no division in the hardware.

我选择一个表大小的n * 1.5或n + (n >> 1),这将会给出一个负载系数为。66666~没有除法,这在大多数系统上是很慢的,尤其是在没有硬件分割的便携式系统上。

#1


206  

The documentation explains it pretty well:

文档解释得很好:

An instance of HashMap has two parameters that affect its performance: initial capacity and load factor. The capacity is the number of buckets in the hash table, and the initial capacity is simply the capacity at the time the hash table is created. The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased. When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets.

HashMap的实例有两个影响其性能的参数:初始容量和负载因数。容量是哈希表中的桶数,而初始容量只是创建哈希表时的容量。负载因数是在哈希表的容量自动增加之前允许哈希表达到的满的程度的度量。当哈希表中的条目数超过负载因子和当前容量的乘积时,哈希表被重新散列(也就是说,内部数据结构被重新构建),这样哈希表就有大约两倍的桶数。

As a general rule, the default load factor (.75) offers a good tradeoff between time and space costs. Higher values decrease the space overhead but increase the lookup cost (reflected in most of the operations of the HashMap class, including get and put). The expected number of entries in the map and its load factor should be taken into account when setting its initial capacity, so as to minimize the number of rehash operations. If the initial capacity is greater than the maximum number of entries divided by the load factor, no rehash operations will ever occur.

一般来说,默认负载因数(.75)在时间和空间成本之间提供了很好的权衡。更高的值减少了空间开销,但是增加了查找成本(在HashMap类的大部分操作中,包括get和put)。在设置映射的初始容量时,应该考虑映射中预期的条目数量及其负载因数,以最小化rehash操作的数量。如果初始容量大于条目的最大数量除以负载因子,则不会发生重新哈希操作。

As with all performance optimizations, it is a good idea to avoid optimizing things prematurely (i.e. without hard data on where the bottlenecks are).

与所有性能优化一样,最好避免过早地优化事物(也就是说,不要在瓶颈所在的地方使用硬数据)。

#2


113  

Default initial capacity of the HashMap takes is 16 and load factor is 0.75f (i.e 75% of current map size). The load factor represents at what level the HashMap capacity should be doubled.

HashMap的默认初始容量为16,负载因数为0.75f (i)。e当前地图尺寸的75%)。负载因子表示HashMap的容量应该增加一倍。

For example product of capacity and load factor as 16 * 0.75 = 12. This represents that after storing the 12th key – value pair into the HashMap , its capacity becomes 32.

例如,容量和负载系数的乘积为16 * 0.75 = 12。这表示在将第12个键值对存储到HashMap之后,其容量变为32。

#3


25  

Actually, from my calculations, the "perfect" load factor is closer to log 2 (~ 0.7). Although any load factor less than this will yield better performance. I think that .75 was probably pulled out of a hat.

实际上,根据我的计算,“完美”负载因子更接近log 2(~ 0.7)。虽然任何负载因数小于这个将产生更好的性能。我想。75可能是从帽子里拿出来的。

Proof:

证明:

Chaining can be avoided and branch prediction exploited by predicting if a bucket is empty or not. A bucket is probably empty if the probability of it being empty exceeds .5.

可以避免链接,通过预测桶是否为空来利用分支预测。如果一个水桶是空的,那么它很可能是空的。

Let s represent the size and n the number of keys added. Using the binomial theorem, the probability of a bucket being empty is:

让s表示大小,n表示添加的键的数量。利用二项式定理,水桶为空的概率为:

P(0) = C(n, 0) * (1/s)^0 * (1 - 1/s)^(n - 0)

Thus, a bucket is probably empty if there are less than

因此,如果一个水桶小于

log(2)/log(s/(s - 1)) keys

As s reaches infinity and if the number of keys added is such that P(0) = .5, then n/s approaches log(2) rapidly:

当s趋于∞时,如果增加的键数为P(0) = .5,则n/s快速逼近log(2):

lim (log(2)/log(s/(s - 1)))/s as s -> infinity = log(2) ~ 0.693...

#4


21  

What is load factor ?

The amount of capacity which is to be exhausted for the HashMap to increase its capacity ?

为了增加它的容量,需要消耗的容量是多少?

Why load factor ?

Load factor is by default 0.75 of the initial capacity (16) therefore 25% of the buckets will be free before there is an increase in the capacity & this makes many new buckets with new hashcodes pointing to them to exist just after the increase in the number of buckets.

负载因数默认为初始容量(16)的0.75,因此25%的桶在容量增加之前是空闲的,这使得许多新的桶具有新的hashcode指向它们,而这些新桶的存在就在桶的数量增加之后。

Now why should you keep many free buckets & what is the impact of keeping free buckets on the performance ?

If you set the loading factor to say 1.0 then something very interesting might happen.

如果你把加载因子设置为1。0那么很有趣的事情就会发生。

Say you are adding an object x to your hashmap whose hashCode is 888 & in your hashmap the bucket representing the hashcode is free , so the object x gets added to the bucket, but now again say if you are adding another object y whose hashCode is also 888 then your object y will get added for sure BUT at the end of the bucket (because the buckets are nothing but linkedList implementation storing key,value & next) now this has a performance impact ! Since your object y is no longer present in the head of the bucket if you perform a lookup the time taken is not going to be O(1) this time it depends on how many items are there in the same bucket. This is called hash collision by the way & this even happens when your loading factor is less than 1.

说你添加一个对象x hashmap的hashCode 888 & hashmap表示hashCode的桶是免费的,所以对象x被添加到桶,但是现在又说如果你添加另一个对象的hashCode也是888然后你对象y会添加肯定的但是最终桶(因为桶除了linkedList实现存储键,值和下一个)现在有一个性能影响!因为当你执行查找时,你的对象y不再出现在bucket的头部中,所以这次的时间不应该是O(1),这取决于在同一个bucket中有多少项。顺便说一下,这被称为哈希冲突&甚至在加载因子小于1时也会发生这种情况。

Correlation between performance , hash collision & loading factor ?

Lower load factor = more free buckets = less chances of collision = high performance = high space requirement.

更低的负载系数=更多的空闲桶=更少的碰撞机会=高性能=更高的空间需求。

Correct me if i am wrong somewhere.

如果我说错了,请纠正我。

#5


16  

From the documentation:

从文档:

The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased

负载因数是在哈希表的容量自动增加之前允许哈希表达到的满的程度的度量

It really depends on your particular requirements, there's no "rule of thumb" for specifying an initial load factor.

它确实取决于您的特定需求,没有“经验规则”来指定初始负载因子。

#6


1  

I would pick a table size of n * 1.5 or n + (n >> 1), this would give a load factor of .66666~ without division, which is slow on most systems, especially on portable systems where there is no division in the hardware.

我选择一个表大小的n * 1.5或n + (n >> 1),这将会给出一个负载系数为。66666~没有除法,这在大多数系统上是很慢的,尤其是在没有硬件分割的便携式系统上。