Hash for Python中的lambda函数

时间:2020-11-30 18:03:17

I'm trying to get the hash of a lambda function. Why do I get two values (8746164008739 and -9223363290690767077)? Why is the hash from the lambda function not always one value?

我正在尝试获取lambda函数的哈希值。为什么我得到两个值(8746164008739和-9223363290690767077)?为什么lambda函数的哈希值不总是一个值?

>>> fn = lambda: 1
>>> hash(fn)
-9223363290690767077
>>> fn = lambda: 1
>>> hash(fn)
8746164008739
>>> fn = lambda: 1
>>> hash(fn)
-9223363290690767077
>>> fn = lambda: 1
>>> hash(fn)
8746164008739
>>> fn = lambda: 1
>>> hash(fn)
-9223363290690767077

4 个解决方案

#1


37  

Two objects are not guaranteed to hash to the same value unless they compare equal [1].

除非它们比较相等[1],否则不保证两个对象散列到相同的值。

Python functions (including lambdas) don't compare equal even if they have identical code [2]. For example:

Python函数(包括lambdas)即使它们具有相同的代码也不会相等[2]。例如:

>>> (lambda: 1) == (lambda: 1)
False

Implementation-wise, this behaviour is due to the fact that function objects don't provide their own equality operator. Instead, they inherit the default one that uses the object's identity, i.e. its address. From the documentation:

实现方面,这种行为是由于函数对象不提供自己的相等运算符。相反,它们继承了使用对象标识的默认标识,即其地址。从文档:

If no __cmp__(), __eq__() or __ne__() operation is defined, class instances are compared by object identity (“address”).

如果未定义__cmp __(),__ eq __()或__ne __()操作,则通过对象标识(“address”)比较类实例。

Here is what happens in your particular example:

以下是您的特定示例中发生的情况:

fn = lambda: 1  # New function is allocated at address A and stored in fn.
fn = lambda: 1  # New function is allocated at address B and stored in fn.
                # The function at address A is garbage collected.
fn = lambda: 1  # New function is allocated at address A and stored in fn.
                # The function at address B is garbage collected.
fn = lambda: 1  # New function is allocated at address B and stored in fn.
                # The function at address A is garbage collected.
...

Since address A is always hashed to one value, and address B to another, you are seeing hash(fn) alternate between the two values. This alternating behaviour is, however, an implementation artefact and could change one day if, for example, the garbage collector were made to behave slightly differently.

由于地址A始终被散列为一个值,而地址B被散列到另一个值,因此您看到散列(fn)在两个值之间交替。然而,这种交替行为是一种实现假象,并且如果例如使垃圾收集器的行为稍微不同,则可以改变一天。

The following insightful note has been contributed by @ruakh:

以下见解是由@ruakh提供的:

It is worth noting that it's not possible to write a general process for determining if two functions are equivalent. (This is a consequence of the undecidability of the halting problem.) Furthermore, two Python functions can behave differently even if their code is identical (since they may be closures referring to distinct-but-identically-named variables). So it makes sense that Python functions don't overload the equality operator: there's no way to implement anything better than the default object-identity comparison.

值得注意的是,不可能编写一个通用的过程来确定两个函数是否相等。 (这是停止问题不可判定性的结果。)此外,两个Python函数的行为可能不同,即使它们的代码相同(因为它们可能是闭包引用不同但相同命名的变量)。因此,Python函数不会重载相等运算符是有道理的:没有办法比默认的对象 - 身份比较更好地实现任何东西。

[1] The converse is generally not true: two objects that compare unequal can have the same hash value. This is called a hash collision.

[1]反之通常不正确:比较不等的两个对象可以具有相同的哈希值。这称为哈希冲突。

[2] Calling your lambdas and then hashing the result would of course always give the same value since hash(1) is always the same within one program:

[2]调用lambdas然后散列结果当然总是给出相同的值,因为hash(1)在一个程序中始终是相同的:

>>> (lambda: 1)() == (lambda: 1)()
True

#2


10  

The hash of a lambda function object is based on its memory address (in CPython this is what the id function returns). This means that any two function objects will have different hashes (assuming there are no hash collisions), even if the functions contain the same code.

lambda函数对象的散列基于其内存地址(在CPython中,这是id函数返回的内容)。这意味着任何两个函数对象都将具有不同的哈希值(假设没有哈希冲突),即使这些函数包含相同的代码。

To explain what's happening in the question, first note that writing fn = lambda: 1 creates a new function object in memory and binds the name fn to it. This new function will therefore have a different hash value to any existing functions.

为了解释问题中发生的事情,首先要注意写入fn = lambda:1会在内存中创建一个新的函数对象,并将名称fn绑定到它。因此,此新函数将具有与任何现有函数不同的散列值。

Repeating fn = lambda: 1, you get alternating values for the hashes because when fn is bound to the newly created function object, the function that fn previously pointed to is garbage collected by Python. This is because there are no longer any references to it (since the name fn now points to a different object).

重复fn = lambda:1,你得到散列的交替值,因为当fn绑定到新创建的函数对象时,fn先前指向的函数被Python垃圾收集。这是因为不再有任何引用(因为名称fn现在指向不同的对象)。

The Python interpreter then reuses this old memory address for the next new function object created by writing fn = lambda: 1.

然后,Python解释器将此旧内存地址重用于通过编写fn = lambda:1创建的下一个新函数对象。

This behaviour might vary between different systems and Python implementations.

不同系统和Python实现之间的行为可能不同。

#3


5  

Each time you do fn = lambda: 1 a new function object is created, and the old object bound to the name fn is marked for deletion. But Python doesn't simply deallocate the object, passing its memory back to the OS. To minimize system calls for memory allocation and to minimize memory fragmentation Python tries to recycle memory when it can. And so when you create fn = lambda: 1 a third time the interpreter notices it's got a block of RAM handy that's big enough for the new function object, and so it uses that block. And thus your 3rd fn ends up in that block of RAM and hence has the same id as the first fn, since the id of CPython objects is their memory address.

每次执行fn = lambda:1时,都会创建一个新的函数对象,并将绑定到名称fn的旧对象标记为删除。但Python并不是简单地释放对象,而是将其内存传回操作系统。为了最小化系统调用内存分配并最小化内存碎片,Python尝试尽可能地回收内存。因此,当你第三次创建fn = lambda:1时,解释器注意到它有一块RAM方便,对于新的函数对象来说足够大,所以它使用了那个块。因此,你的第3个fn最终在那个RAM块中,因此与第一个fn具有相同的id,因为CPython对象的id是它们的内存地址。

(As others have mentioned the hash of any object type that doesn't provide a specific implementation of __hash__ is based on its id in CPython. And if a class does not define a __cmp__ or __eq__ method it should not define a __hash__ operation either).

(正如其他人提到的那样,没有提供__hash__特定实现的任何对象类型的散列都是基于它在CPython中的id。如果一个类没有定义__cmp__或__eq__方法,它也不应该定义一个__hash__操作) 。

#4


4  

Deciding whether two functions are equal is impossible, as it is a superset of the halting problem.

决定两个函数是否相等是不可能的,因为它是停止问题的超集。

In an ideal world, comparing (and therefore hashing) functions would result in a type error. Python apparently doesn't like this, and instead chooses to use the identity of the functions to compare (and therefore hash) them.

在理想情况下,比较(因此散列)函数会导致类型错误。 Python显然不喜欢这样,而是选择使用函数的标识来比较(并因此散列)它们。

#1


37  

Two objects are not guaranteed to hash to the same value unless they compare equal [1].

除非它们比较相等[1],否则不保证两个对象散列到相同的值。

Python functions (including lambdas) don't compare equal even if they have identical code [2]. For example:

Python函数(包括lambdas)即使它们具有相同的代码也不会相等[2]。例如:

>>> (lambda: 1) == (lambda: 1)
False

Implementation-wise, this behaviour is due to the fact that function objects don't provide their own equality operator. Instead, they inherit the default one that uses the object's identity, i.e. its address. From the documentation:

实现方面,这种行为是由于函数对象不提供自己的相等运算符。相反,它们继承了使用对象标识的默认标识,即其地址。从文档:

If no __cmp__(), __eq__() or __ne__() operation is defined, class instances are compared by object identity (“address”).

如果未定义__cmp __(),__ eq __()或__ne __()操作,则通过对象标识(“address”)比较类实例。

Here is what happens in your particular example:

以下是您的特定示例中发生的情况:

fn = lambda: 1  # New function is allocated at address A and stored in fn.
fn = lambda: 1  # New function is allocated at address B and stored in fn.
                # The function at address A is garbage collected.
fn = lambda: 1  # New function is allocated at address A and stored in fn.
                # The function at address B is garbage collected.
fn = lambda: 1  # New function is allocated at address B and stored in fn.
                # The function at address A is garbage collected.
...

Since address A is always hashed to one value, and address B to another, you are seeing hash(fn) alternate between the two values. This alternating behaviour is, however, an implementation artefact and could change one day if, for example, the garbage collector were made to behave slightly differently.

由于地址A始终被散列为一个值,而地址B被散列到另一个值,因此您看到散列(fn)在两个值之间交替。然而,这种交替行为是一种实现假象,并且如果例如使垃圾收集器的行为稍微不同,则可以改变一天。

The following insightful note has been contributed by @ruakh:

以下见解是由@ruakh提供的:

It is worth noting that it's not possible to write a general process for determining if two functions are equivalent. (This is a consequence of the undecidability of the halting problem.) Furthermore, two Python functions can behave differently even if their code is identical (since they may be closures referring to distinct-but-identically-named variables). So it makes sense that Python functions don't overload the equality operator: there's no way to implement anything better than the default object-identity comparison.

值得注意的是,不可能编写一个通用的过程来确定两个函数是否相等。 (这是停止问题不可判定性的结果。)此外,两个Python函数的行为可能不同,即使它们的代码相同(因为它们可能是闭包引用不同但相同命名的变量)。因此,Python函数不会重载相等运算符是有道理的:没有办法比默认的对象 - 身份比较更好地实现任何东西。

[1] The converse is generally not true: two objects that compare unequal can have the same hash value. This is called a hash collision.

[1]反之通常不正确:比较不等的两个对象可以具有相同的哈希值。这称为哈希冲突。

[2] Calling your lambdas and then hashing the result would of course always give the same value since hash(1) is always the same within one program:

[2]调用lambdas然后散列结果当然总是给出相同的值,因为hash(1)在一个程序中始终是相同的:

>>> (lambda: 1)() == (lambda: 1)()
True

#2


10  

The hash of a lambda function object is based on its memory address (in CPython this is what the id function returns). This means that any two function objects will have different hashes (assuming there are no hash collisions), even if the functions contain the same code.

lambda函数对象的散列基于其内存地址(在CPython中,这是id函数返回的内容)。这意味着任何两个函数对象都将具有不同的哈希值(假设没有哈希冲突),即使这些函数包含相同的代码。

To explain what's happening in the question, first note that writing fn = lambda: 1 creates a new function object in memory and binds the name fn to it. This new function will therefore have a different hash value to any existing functions.

为了解释问题中发生的事情,首先要注意写入fn = lambda:1会在内存中创建一个新的函数对象,并将名称fn绑定到它。因此,此新函数将具有与任何现有函数不同的散列值。

Repeating fn = lambda: 1, you get alternating values for the hashes because when fn is bound to the newly created function object, the function that fn previously pointed to is garbage collected by Python. This is because there are no longer any references to it (since the name fn now points to a different object).

重复fn = lambda:1,你得到散列的交替值,因为当fn绑定到新创建的函数对象时,fn先前指向的函数被Python垃圾收集。这是因为不再有任何引用(因为名称fn现在指向不同的对象)。

The Python interpreter then reuses this old memory address for the next new function object created by writing fn = lambda: 1.

然后,Python解释器将此旧内存地址重用于通过编写fn = lambda:1创建的下一个新函数对象。

This behaviour might vary between different systems and Python implementations.

不同系统和Python实现之间的行为可能不同。

#3


5  

Each time you do fn = lambda: 1 a new function object is created, and the old object bound to the name fn is marked for deletion. But Python doesn't simply deallocate the object, passing its memory back to the OS. To minimize system calls for memory allocation and to minimize memory fragmentation Python tries to recycle memory when it can. And so when you create fn = lambda: 1 a third time the interpreter notices it's got a block of RAM handy that's big enough for the new function object, and so it uses that block. And thus your 3rd fn ends up in that block of RAM and hence has the same id as the first fn, since the id of CPython objects is their memory address.

每次执行fn = lambda:1时,都会创建一个新的函数对象,并将绑定到名称fn的旧对象标记为删除。但Python并不是简单地释放对象,而是将其内存传回操作系统。为了最小化系统调用内存分配并最小化内存碎片,Python尝试尽可能地回收内存。因此,当你第三次创建fn = lambda:1时,解释器注意到它有一块RAM方便,对于新的函数对象来说足够大,所以它使用了那个块。因此,你的第3个fn最终在那个RAM块中,因此与第一个fn具有相同的id,因为CPython对象的id是它们的内存地址。

(As others have mentioned the hash of any object type that doesn't provide a specific implementation of __hash__ is based on its id in CPython. And if a class does not define a __cmp__ or __eq__ method it should not define a __hash__ operation either).

(正如其他人提到的那样,没有提供__hash__特定实现的任何对象类型的散列都是基于它在CPython中的id。如果一个类没有定义__cmp__或__eq__方法,它也不应该定义一个__hash__操作) 。

#4


4  

Deciding whether two functions are equal is impossible, as it is a superset of the halting problem.

决定两个函数是否相等是不可能的,因为它是停止问题的超集。

In an ideal world, comparing (and therefore hashing) functions would result in a type error. Python apparently doesn't like this, and instead chooses to use the identity of the functions to compare (and therefore hash) them.

在理想情况下,比较(因此散列)函数会导致类型错误。 Python显然不喜欢这样,而是选择使用函数的标识来比较(并因此散列)它们。