什么是在c++ std中使用的默认哈希函数::unordered_map?

时间:2021-11-09 18:51:53

I am using

我用

unordered_map<string, int>

and

unordered_map<int, int>

What hash function is used in each case and what is chance of collision in each case? I will be inserting unique string and unique int as keys in each case respectively.

每种情况下使用什么哈希函数,每种情况下发生碰撞的概率是多少?在每种情况下,我将分别插入唯一的字符串和唯一的int作为键。

I am interested in knowing the algorithm of hash function in case of string and int keys and their collision stats.

我想知道在字符串和int键以及它们的冲突统计情况下哈希函数的算法。

2 个解决方案

#1


86  

The function object std::hash<> is used.

使用函数对象std::散列<>。

Standard specializations exist for all built-in types, and some other standard library types such as std::string and std::thread. See the link for the full list.

对于所有内置类型以及其他一些标准库类型,如std::string和std::thread,都存在标准专门化。查看完整列表的链接。

For other types to be used in a std::unordered_map, you will have to specialize std::hash<> or create your own function object.

对于要在std:::unordered_map中使用的其他类型,您必须专门化std:::hash<>或创建自己的函数对象。

The chance of collision is completely implementation-dependent, but considering the fact that integers are limited between a defined range, while strings are theoretically infinitely long, I'd say there is a much better chance for collision with strings.

碰撞的概率完全依赖于实现,但是考虑到整数在一个定义的范围内是有限的,而字符串理论上是无限长的,我认为与字符串发生碰撞的可能性要大得多。

As for the implementation in GCC, the specialization for builtin-types just returns the bit pattern. Here's how they are defined in bits/functional_hash.h:

至于GCC中的实现,构建类型的专门化只返回位模式。以下是它们是如何在bits/functional_hash.h中定义的。

  /// Partial specializations for pointer types.
  template<typename _Tp>
    struct hash<_Tp*> : public __hash_base<size_t, _Tp*>
    {
      size_t
      operator()(_Tp* __p) const noexcept
      { return reinterpret_cast<size_t>(__p); }
    };

  // Explicit specializations for integer types.
#define _Cxx_hashtable_define_trivial_hash(_Tp)     \
  template<>                        \
    struct hash<_Tp> : public __hash_base<size_t, _Tp>  \
    {                                                   \
      size_t                                            \
      operator()(_Tp __val) const noexcept              \
      { return static_cast<size_t>(__val); }            \
    };

  /// Explicit specialization for bool.
  _Cxx_hashtable_define_trivial_hash(bool)

  /// Explicit specialization for char.
  _Cxx_hashtable_define_trivial_hash(char)

  /// ...

The specialization for std::string is defined as:

std::string的专门化定义为:

#ifndef _GLIBCXX_COMPATIBILITY_CXX0X
  /// std::hash specialization for string.
  template<>
    struct hash<string>
    : public __hash_base<size_t, string>
    {
      size_t
      operator()(const string& __s) const noexcept
      { return std::_Hash_impl::hash(__s.data(), __s.length()); }
    };

Some further search leads us to:

进一步的研究使我们得出:

struct _Hash_impl
{
  static size_t
  hash(const void* __ptr, size_t __clength,
       size_t __seed = static_cast<size_t>(0xc70f6907UL))
  { return _Hash_bytes(__ptr, __clength, __seed); }
  ...
};
...
// Hash function implementation for the nontrivial specialization.
// All of them are based on a primitive that hashes a pointer to a
// byte array. The actual hash algorithm is not guaranteed to stay
// the same from release to release -- it may be updated or tuned to
// improve hash quality or speed.
size_t
_Hash_bytes(const void* __ptr, size_t __len, size_t __seed);

_Hash_bytes is an external function from libstdc++. A bit more searching led me to this file, which states:

_Hash_bytes是libstdc++的一个外部函数。再搜索一下,我找到了这个文件,上面写着:

// This file defines Hash_bytes, a primitive used for defining hash
// functions. Based on public domain MurmurHashUnaligned2, by Austin
// Appleby.  http://murmurhash.googlepages.com/

So the default hashing algorithm GCC uses for strings is MurmurHashUnaligned2.

GCC对字符串使用的默认哈希算法是杂音不对齐。

#2


2  

Though the hashing algorithms are compiler-dependent, I'll present it for GCC C++11. @Avidan Borisov astutely discovered that the GCC hashing algorithm used for strings is "MurmurHashUnaligned2," by Austin Appleby. I did some searching and found a mirrored copy of GCC on Github. Therefore:

尽管散列算法依赖于编译器,但我将为GCC c++ 11提供它。@Avidan Borisov敏锐地发现用于字符串的GCC哈希算法是Austin Appleby的“杂音不对齐d2”。我在Github上搜索并找到了GCC的镜像副本。因此:

The GCC C++11 hashing functions used for unordered_map (a hash table template) and unordered_set (a hash set template) appear to be as follows.

用于unordered_map(一个哈希表模板)和unordered_set(一个哈希集模板)的GCC c++ 11哈希函数如下所示。

  • Thanks to Avidan Borisov for his background research which on the question of what are the GCC C++11 hash functions used, stating that GCC uses an implementation of "MurmurHashUnaligned2", by Austin Appleby (http://murmurhash.googlepages.com/).
  • 感谢Avidan Borisov对GCC c++ 11哈希函数使用什么问题的背景研究,他说GCC使用的是Austin Appleby (http://杂音.googlepages.com/)的“杂音不对齐d2”的实现。
  • In the file "gcc/libstdc++-v3/libsupc++/hash_bytes.cc", here (https://github.com/gcc-mirror/gcc/blob/master/libstdc++-v3/libsupc++/hash_bytes.cc), I found the implementations. Here's the one for the "32-bit size_t" return value, for example (pulled 11 Aug 2017)
  • 在文件“gcc / libstdc + + v3 / libsupc + + / hash_bytes。这里(https://github.com/gcc-mirror/gcc/gcc/gcc/gcc/blogb/master/libstdc +-v3/libsupc++/hash_bytes.cc),我找到了实现。例如,这里是“32位size_t”返回值(拖拽2017年8月11日)

Code:

代码:

// Implementation of Murmur hash for 32-bit size_t.
size_t _Hash_bytes(const void* ptr, size_t len, size_t seed)
{
  const size_t m = 0x5bd1e995;
  size_t hash = seed ^ len;
  const char* buf = static_cast<const char*>(ptr);

  // Mix 4 bytes at a time into the hash.
  while (len >= 4)
  {
    size_t k = unaligned_load(buf);
    k *= m;
    k ^= k >> 24;
    k *= m;
    hash *= m;
    hash ^= k;
    buf += 4;
    len -= 4;
  }

  // Handle the last few bytes of the input array.
  switch (len)
  {
    case 3:
      hash ^= static_cast<unsigned char>(buf[2]) << 16;
      [[gnu::fallthrough]];
    case 2:
      hash ^= static_cast<unsigned char>(buf[1]) << 8;
      [[gnu::fallthrough]];
    case 1:
      hash ^= static_cast<unsigned char>(buf[0]);
      hash *= m;
  };

  // Do a few final mixes of the hash.
  hash ^= hash >> 13;
  hash *= m;
  hash ^= hash >> 15;
  return hash;
}

For additional hashing functions, including djb2, and the 2 versions of the K&R hashing functions (one apparently terrible, one pretty good), see my other answer here: https://*.com/a/45641002/4561887.

对于其他哈希函数(包括djb2)和K&R哈希函数的两个版本(一个显然很糟糕,一个相当不错),请参阅我这里的另一个答案:https://*.com/a/45641002/4561887。

#1


86  

The function object std::hash<> is used.

使用函数对象std::散列<>。

Standard specializations exist for all built-in types, and some other standard library types such as std::string and std::thread. See the link for the full list.

对于所有内置类型以及其他一些标准库类型,如std::string和std::thread,都存在标准专门化。查看完整列表的链接。

For other types to be used in a std::unordered_map, you will have to specialize std::hash<> or create your own function object.

对于要在std:::unordered_map中使用的其他类型,您必须专门化std:::hash<>或创建自己的函数对象。

The chance of collision is completely implementation-dependent, but considering the fact that integers are limited between a defined range, while strings are theoretically infinitely long, I'd say there is a much better chance for collision with strings.

碰撞的概率完全依赖于实现,但是考虑到整数在一个定义的范围内是有限的,而字符串理论上是无限长的,我认为与字符串发生碰撞的可能性要大得多。

As for the implementation in GCC, the specialization for builtin-types just returns the bit pattern. Here's how they are defined in bits/functional_hash.h:

至于GCC中的实现,构建类型的专门化只返回位模式。以下是它们是如何在bits/functional_hash.h中定义的。

  /// Partial specializations for pointer types.
  template<typename _Tp>
    struct hash<_Tp*> : public __hash_base<size_t, _Tp*>
    {
      size_t
      operator()(_Tp* __p) const noexcept
      { return reinterpret_cast<size_t>(__p); }
    };

  // Explicit specializations for integer types.
#define _Cxx_hashtable_define_trivial_hash(_Tp)     \
  template<>                        \
    struct hash<_Tp> : public __hash_base<size_t, _Tp>  \
    {                                                   \
      size_t                                            \
      operator()(_Tp __val) const noexcept              \
      { return static_cast<size_t>(__val); }            \
    };

  /// Explicit specialization for bool.
  _Cxx_hashtable_define_trivial_hash(bool)

  /// Explicit specialization for char.
  _Cxx_hashtable_define_trivial_hash(char)

  /// ...

The specialization for std::string is defined as:

std::string的专门化定义为:

#ifndef _GLIBCXX_COMPATIBILITY_CXX0X
  /// std::hash specialization for string.
  template<>
    struct hash<string>
    : public __hash_base<size_t, string>
    {
      size_t
      operator()(const string& __s) const noexcept
      { return std::_Hash_impl::hash(__s.data(), __s.length()); }
    };

Some further search leads us to:

进一步的研究使我们得出:

struct _Hash_impl
{
  static size_t
  hash(const void* __ptr, size_t __clength,
       size_t __seed = static_cast<size_t>(0xc70f6907UL))
  { return _Hash_bytes(__ptr, __clength, __seed); }
  ...
};
...
// Hash function implementation for the nontrivial specialization.
// All of them are based on a primitive that hashes a pointer to a
// byte array. The actual hash algorithm is not guaranteed to stay
// the same from release to release -- it may be updated or tuned to
// improve hash quality or speed.
size_t
_Hash_bytes(const void* __ptr, size_t __len, size_t __seed);

_Hash_bytes is an external function from libstdc++. A bit more searching led me to this file, which states:

_Hash_bytes是libstdc++的一个外部函数。再搜索一下,我找到了这个文件,上面写着:

// This file defines Hash_bytes, a primitive used for defining hash
// functions. Based on public domain MurmurHashUnaligned2, by Austin
// Appleby.  http://murmurhash.googlepages.com/

So the default hashing algorithm GCC uses for strings is MurmurHashUnaligned2.

GCC对字符串使用的默认哈希算法是杂音不对齐。

#2


2  

Though the hashing algorithms are compiler-dependent, I'll present it for GCC C++11. @Avidan Borisov astutely discovered that the GCC hashing algorithm used for strings is "MurmurHashUnaligned2," by Austin Appleby. I did some searching and found a mirrored copy of GCC on Github. Therefore:

尽管散列算法依赖于编译器,但我将为GCC c++ 11提供它。@Avidan Borisov敏锐地发现用于字符串的GCC哈希算法是Austin Appleby的“杂音不对齐d2”。我在Github上搜索并找到了GCC的镜像副本。因此:

The GCC C++11 hashing functions used for unordered_map (a hash table template) and unordered_set (a hash set template) appear to be as follows.

用于unordered_map(一个哈希表模板)和unordered_set(一个哈希集模板)的GCC c++ 11哈希函数如下所示。

  • Thanks to Avidan Borisov for his background research which on the question of what are the GCC C++11 hash functions used, stating that GCC uses an implementation of "MurmurHashUnaligned2", by Austin Appleby (http://murmurhash.googlepages.com/).
  • 感谢Avidan Borisov对GCC c++ 11哈希函数使用什么问题的背景研究,他说GCC使用的是Austin Appleby (http://杂音.googlepages.com/)的“杂音不对齐d2”的实现。
  • In the file "gcc/libstdc++-v3/libsupc++/hash_bytes.cc", here (https://github.com/gcc-mirror/gcc/blob/master/libstdc++-v3/libsupc++/hash_bytes.cc), I found the implementations. Here's the one for the "32-bit size_t" return value, for example (pulled 11 Aug 2017)
  • 在文件“gcc / libstdc + + v3 / libsupc + + / hash_bytes。这里(https://github.com/gcc-mirror/gcc/gcc/gcc/gcc/blogb/master/libstdc +-v3/libsupc++/hash_bytes.cc),我找到了实现。例如,这里是“32位size_t”返回值(拖拽2017年8月11日)

Code:

代码:

// Implementation of Murmur hash for 32-bit size_t.
size_t _Hash_bytes(const void* ptr, size_t len, size_t seed)
{
  const size_t m = 0x5bd1e995;
  size_t hash = seed ^ len;
  const char* buf = static_cast<const char*>(ptr);

  // Mix 4 bytes at a time into the hash.
  while (len >= 4)
  {
    size_t k = unaligned_load(buf);
    k *= m;
    k ^= k >> 24;
    k *= m;
    hash *= m;
    hash ^= k;
    buf += 4;
    len -= 4;
  }

  // Handle the last few bytes of the input array.
  switch (len)
  {
    case 3:
      hash ^= static_cast<unsigned char>(buf[2]) << 16;
      [[gnu::fallthrough]];
    case 2:
      hash ^= static_cast<unsigned char>(buf[1]) << 8;
      [[gnu::fallthrough]];
    case 1:
      hash ^= static_cast<unsigned char>(buf[0]);
      hash *= m;
  };

  // Do a few final mixes of the hash.
  hash ^= hash >> 13;
  hash *= m;
  hash ^= hash >> 15;
  return hash;
}

For additional hashing functions, including djb2, and the 2 versions of the K&R hashing functions (one apparently terrible, one pretty good), see my other answer here: https://*.com/a/45641002/4561887.

对于其他哈希函数(包括djb2)和K&R哈希函数的两个版本(一个显然很糟糕,一个相当不错),请参阅我这里的另一个答案:https://*.com/a/45641002/4561887。