对原始类型的好哈希函数。

时间:2021-02-01 05:27:38

I'm trying to figure out a good hash function for a std::pair of two primitive types. This is the way I have it implemented right now:

我正在尝试为std找到一个好的哈希函数::两个基本类型的对。这就是我现在实施的方法:

template<typename T, typename U>
std::size_t operator()(const std::pair<T,U> &rhs) const
{
    return stdext::hash_value<T>(rhs.first) ^ stdext::hash_value<U>(rhs.second);
}

It appears to work even if I have two pairs such as (1, 2) and (2, 1) (numbers flipped). They generate the same hash value but the values are still inserted successfully into the hash map. Any thoughts?

即使我有两对(1,2)和(2,1)(数字翻转),它似乎也能工作。它们生成相同的散列值,但是值仍然被成功插入到散列映射中。任何想法吗?

5 个解决方案

#1


5  

Generally speaking, hashing containers always have to handle this case (hash collisions). There are a couple methods they can use like chaining and probing, any of which will potentially hurt performance.

一般来说,哈希容器总是必须处理这种情况(散列冲突)。有几种方法可以使用,如链接和探测,任何一种都有可能损害性能。

Instead, I would suggest using boost::hash_combine to combine the hashes in such a way they swapping first and second doesn't generate the same hash.

相反,我建议使用boost:: hash_merge以这样的方式组合散列,这样它们就不会产生相同的散列。

#2


2  

Here's an implementation of hash_combine based on the docs from the current version of boost: (http://www.boost.org/doc/libs/1_53_0/doc/html/hash/reference.html#boost.hash_combine)

这里有一个基于当前boost版本的文档的hash_merge的实现:(http://www.boost.org/doc/libs/1_53_0/doc/html/hash/reference.html#boost.hash_combine)

template<typename T> void hash_combine(size_t & seed, T const& v) {
  seed ^= stdext::hash_value(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
}

You'd use it like this:

你可以这样使用:

template<typename T, typename U>
std::size_t operator()(const std::pair<T,U> &rhs) const   {
  size_t retval = stdext::hash_value<T>(rhs.first);
  hash_combine(retval, rhs.second);
  return retval;
}

I can't actually vouch for the logic behind this function, but it looks more solid than most of the options mentioned here.

我不能担保这个函数背后的逻辑,但它看起来比这里提到的大多数选项都要可靠。

#3


1  

Assuming that stdext::hash_value provides a good distribution of hash values for each of first and second, what you've done is fine if you don't expect a disproportionately high incidence of mirror image pairs (e.g. (1,2) and (2,1)). If your data set is such that you do expect a lot of those, you might consider tweaking the hash function to force them to be different. For example, inverting the first hash value:

假设stdext::hash_value提供了一个良好的散列值分布,如果您不期望异常高的镜像对(例如,1,2)和(2,1),那么您所做的就很好了。如果您的数据集是这样的,您确实期望有很多这样的数据集,那么您可以考虑调整散列函数来强制它们不同。例如,反转第一个哈希值:

return ~stdext::hash_value<T>(rhs.first) ^ stdext::hash_value<T>(rhs.second);

I mention this only because you expressed a concern about mirror image pairs. If your input is close to random in that regard, then ^ should be fine. Remember, the goal is to minimize collisions, not avoid them entirely.

我之所以提到这一点,是因为你表达了对镜像对的关注。如果你的输入是随机的在这方面,那么^应该没事的。记住,目标是最小化冲突,而不是完全避免冲突。

#4


0  

As others have noted, hash functions affect only the efficiency of hash tables, not correctness. Your function is only bad if mirror image pairs are frequent. Since in some applications this might be a problem, it would be reasonable to swap upper and lower half-words of one hash value and then xor. Many other schemes are possible, but this one is very fast and simple.

正如其他人所指出的,哈希函数只影响哈希表的效率,而不影响正确性。如果镜像对频繁,你的函数就不好了。由于在某些应用程序中,这可能是一个问题,所以将一个哈希值的上下半字替换为xor是合理的。许多其他的方案都是可行的,但是这个方法非常简单。

template<typename T, typename U>
std::size_t operator()(const std::pair<T,U> &rhs) const
{
    const int h = sizeof(size_t) * 8 / 2;
    size_t a = stdext::hash_value<T>(rhs.first);
    return ((a << h) | (a >> h)) ^ stdext::hash_value<U>(rhs.second);
}

#5


0  

Just for the fun of it, here's another approach that's simple and directly addresses the problem cases, specifically:

只是为了好玩,这是另一种简单直接处理问题的方法,具体来说:

  • if both first and second are the same, it returns their common hash value
  • 如果第一个和第二个是相同的,则返回它们的普通散列值。
  • otherwise, it XORs the values but:
    • to prevent h(a, b) colliding with h(b, a), it uses a < b to choose between h(a)^h(b) and h(a)^h(~b)
    • 防止h(a,b)碰撞与h(b),它使用一个< b h(a)^ h之间做出选择(b)和h(a)^ h(~ b)
  • 否则,它值xor但是:防止h(a,b)碰撞与h(b),它使用一个< b h(a)^ h之间做出选择(b)和h(a)^ h(~ b)

Implementation:

实现:

1 template<typename T, typename U>
2 std::size_t operator()(const std::pair<T,U> &rhs) const
3 {
4     std::size_t a = stdext::hash_value<T>(rhs.first);
5     return rhs.first == rhs.second ? a :
6            a ^ (rhs.first < rhs.second ? stdext::hash_value<U>(rhs.second)
7                                        : stdext::hash_value<U>(~rhs.second));
8 }

Seriously though, I'd recommend following Mark B's advice and using boost::hash_combine - less branching is likely to improve speed.

尽管如此,我还是建议遵循Mark B的建议并使用boost::hash_combine - less分支可能提高速度。

#1


5  

Generally speaking, hashing containers always have to handle this case (hash collisions). There are a couple methods they can use like chaining and probing, any of which will potentially hurt performance.

一般来说,哈希容器总是必须处理这种情况(散列冲突)。有几种方法可以使用,如链接和探测,任何一种都有可能损害性能。

Instead, I would suggest using boost::hash_combine to combine the hashes in such a way they swapping first and second doesn't generate the same hash.

相反,我建议使用boost:: hash_merge以这样的方式组合散列,这样它们就不会产生相同的散列。

#2


2  

Here's an implementation of hash_combine based on the docs from the current version of boost: (http://www.boost.org/doc/libs/1_53_0/doc/html/hash/reference.html#boost.hash_combine)

这里有一个基于当前boost版本的文档的hash_merge的实现:(http://www.boost.org/doc/libs/1_53_0/doc/html/hash/reference.html#boost.hash_combine)

template<typename T> void hash_combine(size_t & seed, T const& v) {
  seed ^= stdext::hash_value(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
}

You'd use it like this:

你可以这样使用:

template<typename T, typename U>
std::size_t operator()(const std::pair<T,U> &rhs) const   {
  size_t retval = stdext::hash_value<T>(rhs.first);
  hash_combine(retval, rhs.second);
  return retval;
}

I can't actually vouch for the logic behind this function, but it looks more solid than most of the options mentioned here.

我不能担保这个函数背后的逻辑,但它看起来比这里提到的大多数选项都要可靠。

#3


1  

Assuming that stdext::hash_value provides a good distribution of hash values for each of first and second, what you've done is fine if you don't expect a disproportionately high incidence of mirror image pairs (e.g. (1,2) and (2,1)). If your data set is such that you do expect a lot of those, you might consider tweaking the hash function to force them to be different. For example, inverting the first hash value:

假设stdext::hash_value提供了一个良好的散列值分布,如果您不期望异常高的镜像对(例如,1,2)和(2,1),那么您所做的就很好了。如果您的数据集是这样的,您确实期望有很多这样的数据集,那么您可以考虑调整散列函数来强制它们不同。例如,反转第一个哈希值:

return ~stdext::hash_value<T>(rhs.first) ^ stdext::hash_value<T>(rhs.second);

I mention this only because you expressed a concern about mirror image pairs. If your input is close to random in that regard, then ^ should be fine. Remember, the goal is to minimize collisions, not avoid them entirely.

我之所以提到这一点,是因为你表达了对镜像对的关注。如果你的输入是随机的在这方面,那么^应该没事的。记住,目标是最小化冲突,而不是完全避免冲突。

#4


0  

As others have noted, hash functions affect only the efficiency of hash tables, not correctness. Your function is only bad if mirror image pairs are frequent. Since in some applications this might be a problem, it would be reasonable to swap upper and lower half-words of one hash value and then xor. Many other schemes are possible, but this one is very fast and simple.

正如其他人所指出的,哈希函数只影响哈希表的效率,而不影响正确性。如果镜像对频繁,你的函数就不好了。由于在某些应用程序中,这可能是一个问题,所以将一个哈希值的上下半字替换为xor是合理的。许多其他的方案都是可行的,但是这个方法非常简单。

template<typename T, typename U>
std::size_t operator()(const std::pair<T,U> &rhs) const
{
    const int h = sizeof(size_t) * 8 / 2;
    size_t a = stdext::hash_value<T>(rhs.first);
    return ((a << h) | (a >> h)) ^ stdext::hash_value<U>(rhs.second);
}

#5


0  

Just for the fun of it, here's another approach that's simple and directly addresses the problem cases, specifically:

只是为了好玩,这是另一种简单直接处理问题的方法,具体来说:

  • if both first and second are the same, it returns their common hash value
  • 如果第一个和第二个是相同的,则返回它们的普通散列值。
  • otherwise, it XORs the values but:
    • to prevent h(a, b) colliding with h(b, a), it uses a < b to choose between h(a)^h(b) and h(a)^h(~b)
    • 防止h(a,b)碰撞与h(b),它使用一个< b h(a)^ h之间做出选择(b)和h(a)^ h(~ b)
  • 否则,它值xor但是:防止h(a,b)碰撞与h(b),它使用一个< b h(a)^ h之间做出选择(b)和h(a)^ h(~ b)

Implementation:

实现:

1 template<typename T, typename U>
2 std::size_t operator()(const std::pair<T,U> &rhs) const
3 {
4     std::size_t a = stdext::hash_value<T>(rhs.first);
5     return rhs.first == rhs.second ? a :
6            a ^ (rhs.first < rhs.second ? stdext::hash_value<U>(rhs.second)
7                                        : stdext::hash_value<U>(~rhs.second));
8 }

Seriously though, I'd recommend following Mark B's advice and using boost::hash_combine - less branching is likely to improve speed.

尽管如此,我还是建议遵循Mark B的建议并使用boost::hash_combine - less分支可能提高速度。