Some hash table schemes, such as cuckoo hashing or dynamic perfect hashing, rely on the existence of universal hash functions and the ability to take a collection of data exhibiting collisions and resolve those collisions by picking a new hash function from the family of universal hash functions.
一些哈希表方案,如cuckoo哈希或dynamic perfect哈希,依赖于通用哈希函数的存在,以及从通用哈希函数族中选择一个新的哈希函数来收集显示冲突的数据并解决冲突的能力。
A while ago I was trying to implement a hash table in Java backed by cuckoo hashing and ran into trouble because while all Java objects have a hashCode
function, the value that hashCode
returns is fixed for each object (unless, of course, the objects change). This means that without the user providing an external family of universal hash functions, it's impossible to build a hash table that relies on universal hashing.
不久前,我试图在Java中实现一个由cuckoo散列支持的哈希表,但遇到了麻烦,因为尽管所有Java对象都有一个hashCode函数,但hashCode返回的值对于每个对象都是固定的(当然,除非对象发生变化)。这意味着,如果用户不提供通用散列函数的外部族,就不可能构建依赖于通用散列的散列表。
Inititially I thought that I could get around this by applying a universal hash function to the object's hashCode
s directly, but this doesn't work because if two objects have the same hashCode
, then any deterministic function you apply to those hash codes, even a randomly-chosen hash function, will result in the same value and thus cause a collision.
Inititially我以为我可以绕过这个通过应用通用哈希函数直接对象的hashCode,但这行不通,因为如果两个对象具有相同的hashCode,那么任何确定性函数适用于这些哈希码,即使是一个随机选择的哈希函数,将导致相同的值,从而导致碰撞。
It seems like this would be detrimental to Java's design. It means that HashMap
and other hash containers are completely prohibited from using tables based on universal hashing, even if the language designers may think that such tables would be appropriate in the language design. It also makes it harder for third-party library designers to build hash tables of this sort as well.
看起来这对Java的设计是有害的。这意味着HashMap和其他散列容器完全被禁止使用基于通用散列的表,即使语言设计人员可能认为这种表在语言设计中是合适的。它还使第三方库设计人员更难构建此类散列表。
My question is: is there a reason that Java opted to design hashCode
without considering the possibility of hashing objects with multiple hash functions? I understand that many good hashing schemes like chained hashing or quadratic probing don't require it, but it seems as though the decision makes it hard to use certain classes of algorithms on Java objects.
我的问题是:为什么Java选择设计hashCode而不考虑使用多个散列函数对对象进行散列?我知道很多好的哈希方案(比如链式哈希或二次探查)都不需要它,但似乎这个决定使得在Java对象上使用某些算法变得困难。
2 个解决方案
#1
15
Simplicity. Java allows class designers to provide their own hashCode
, which as you mention is good enough for "ordinary" hash tables, and can be hard enough to understand.
简单。Java允许类设计人员提供他们自己的hashCode,正如您所提到的,对于“普通”散列表来说已经足够好了,并且可能很难理解。
Besides, when the Java Collections API was designed, having generic hash tables in the standard library was bold enough a move already. C has never had them. C++ had them in the STL as hash_set
and hash_map
, but those didn't make it into the standard. Only now, in C++0x, are hash tables being considered for standardization again.
此外,在设计Java Collections API时,在标准库中使用泛型哈希表已经足够大胆了。C从来没有过。c++在STL中有hash_set和hash_map,但是这些都不能作为标准。直到现在,在c++ 0x中,才再次考虑使用散列表进行标准化。
#2
0
I think the normal hashCode
method was created without the "malicious inputs" case in mind. Also, as written by larsmann, its contract is much more easier to understand and implement than a universal hash function would be.
我认为正常的hashCode方法是在没有“恶意输入”的情况下创建的。而且,正如larsmann所写的,它的契约比通用哈希函数更容易理解和实现。
Here an idea about what to do:
这里有一个关于该怎么做的想法:
- Use a map implementation relying on external hash-functions (like the HashableEquivalenceRelation I presented some hours ago here)
- 使用依赖于外部哈希函数的映射实现(如几个小时前我在这里介绍的hashable相当cerelation)
- then use a universal family of such implementations (or an implementation which allows changing the parameter to switch to another member of the family).
- 然后使用此类实现的通用族(或允许更改参数以切换到该族的另一个成员的实现)。
#1
15
Simplicity. Java allows class designers to provide their own hashCode
, which as you mention is good enough for "ordinary" hash tables, and can be hard enough to understand.
简单。Java允许类设计人员提供他们自己的hashCode,正如您所提到的,对于“普通”散列表来说已经足够好了,并且可能很难理解。
Besides, when the Java Collections API was designed, having generic hash tables in the standard library was bold enough a move already. C has never had them. C++ had them in the STL as hash_set
and hash_map
, but those didn't make it into the standard. Only now, in C++0x, are hash tables being considered for standardization again.
此外,在设计Java Collections API时,在标准库中使用泛型哈希表已经足够大胆了。C从来没有过。c++在STL中有hash_set和hash_map,但是这些都不能作为标准。直到现在,在c++ 0x中,才再次考虑使用散列表进行标准化。
#2
0
I think the normal hashCode
method was created without the "malicious inputs" case in mind. Also, as written by larsmann, its contract is much more easier to understand and implement than a universal hash function would be.
我认为正常的hashCode方法是在没有“恶意输入”的情况下创建的。而且,正如larsmann所写的,它的契约比通用哈希函数更容易理解和实现。
Here an idea about what to do:
这里有一个关于该怎么做的想法:
- Use a map implementation relying on external hash-functions (like the HashableEquivalenceRelation I presented some hours ago here)
- 使用依赖于外部哈希函数的映射实现(如几个小时前我在这里介绍的hashable相当cerelation)
- then use a universal family of such implementations (or an implementation which allows changing the parameter to switch to another member of the family).
- 然后使用此类实现的通用族(或允许更改参数以切换到该族的另一个成员的实现)。