python sys.intern做什么,什么时候应该使用?

时间:2021-07-29 18:18:41

I came across this question about memory management of dictionaries, which mentions the intern function. What exactly does it do, and when would it be used?

我遇到了关于字典内存管理的问题,提到了实习功能。它究竟做了什么,何时使用?

To give an example:

举个例子:

If I have a set called seen, that contains tuples in the form (string1,string2), which I use to check for duplicates, would storing (intern(string1),intern(string2)) improve performance w.r.t. memory or speed?

如果我有一个名为see的集合,它包含形式(string1,string2)中的元组,我用它来检查重复项,将存储(intern(string1),intern(string2))提高性能w.r.t.记忆还是速度?

5 个解决方案

#1


47  

From the Python 3 documentation:

从Python 3文档:

sys.intern(string)

Enter string in the table of “interned” strings and return the interned string – which is string itself or a copy. Interning strings is useful to gain a little performance on dictionary lookup – if the keys in a dictionary are interned, and the lookup key is interned, the key comparisons (after hashing) can be done by a pointer compare instead of a string compare. Normally, the names used in Python programs are automatically interned, and the dictionaries used to hold module, class or instance attributes have interned keys.

在“interned”字符串表中输入字符串并返回实习字符串 - 字符串本身或副本。实习字符串对于在字典查找中获得一点性能很有用 - 如果字典中的键被中断,并且查找键被中断,则可以通过指针比较而不是字符串比较来完成键比较(在散列之后)。通常,Python程序中使用的名称会自动实现,而用于保存模块,类或实例属性的字典具有实习键。

Interned strings are not immortal; you must keep a reference to the return value of intern() around to benefit from it.

实习字符串不是不朽的;你必须保持对intern()的返回值的引用才能从中受益。

Clarification:

澄清:

As the documentation suggests, the sys.intern function is intended to be used for performance optimization.

如文档所示,sys.intern函数旨在用于性能优化。

The sys.intern function maintains a table of interned strings. When you attempt to intern a string, the function looks it up in the table and:

sys.intern函数维护一个实习字符串表。当您尝试实习字符串时,该函数会在表格中查找并执行以下操作:

  1. If the string does not exists (hasn't been interned yet) the function saves it in the table and returns it from the interned strings table.

    如果字符串不存在(尚未实现),则该函数将其保存在表中并从实习字符串表中返回。

    >>> import sys
    >>> a = sys.intern('why do pangolins dream of quiche')
    >>> a
    'why do pangolins dream of quiche'
    

    In the above example, a holds the interned string. Even though it is not visible, the sys.intern function has saved the 'why do pangolins dream of quiche' string object in the interned strings table.

    在上面的例子中,a保存了实习字符串。尽管它不可​​见,但是sys.intern函数已经在interned strings表中保存了“为什么pangolins梦想的乳蛋饼”字符串对象。

  2. If the string exists (has been interned) the function returns it from the interned strings table.

    如果字符串存在(已被实习),则该函数将从实习字符串表中返回该字符串。

    >>> b = sys.intern('why do pangolins dream of quiche')
    >>> b
    'why do pangolins dream of quiche'
    

    Even though it is not immediately visible, because the string 'why do pangolins dream of quiche' has been interned before, b holds now the same string object as a.

    即使它不是立即可见的,因为字符串“为什么穿山甲梦想的乳蛋饼”之前被实习过,b现在拥有与a相同的字符串对象。

    >>> b is a
    True
    

    If we create the same string without using intern, we end up with two different string objects that have the same value.

    如果我们在不使用实习生的情况下创建相同的字符串,我们最终会得到两个具有相同值的不同字符串对象。

    >>> c = 'why do pangolins dream of quiche'
    >>> c is a
    False
    >>> c is b
    False
    

By using sys.intern you ensure that you never create two string objects that have the same value—when you request the creation of a second string object with the same value as an existing string object, you receive a reference to the pre-existing string object. This way, you are saving memory. Also, string objects comparison is now very efficient because it is carried out by comparing the memory addresses of the two string objects instead of their content.

通过使用sys.intern确保您永远不会创建具有相同值的两个字符串对象 - 当您请求创建与现有字符串对象具有相同值的第二个字符串对象时,您将收到对预先存在的字符串的引用目的。这样,您就可以节省内存。此外,字符串对象比较现在非常有效,因为它是通过比较两个字符串对象的内存地址而不是它们的内容来执行的。

#2


17  

Essentially intern looks up (or stores if not present) the string in a collection of interned strings, so all interned instances will share the same identity. You trade the one-time cost of looking up this string for faster comparisons (the compare can return True after just checking for identity, rather than having to compare each character), and reduced memory usage.

本质上实习生在实习字符串集合中查找(或存储,如果不存在)字符串,因此所有实例化实例将共享相同的标识。您可以交换查找此字符串的一次性成本以进行更快的比较(比较可以在检查身份后返回True,而不是必须比较每个字符),并减少内存使用量。

However, python will automatically intern strings that are small, or look like identifiers, so you may find you get no improvement because your strings are already being interned behind the scenes. For example:

但是,python会自动实习小的字符串,或看起来像标识符,所以你可能会发现你没有得到任何改进,因为你的字符串已经在幕后被实习。例如:

>>> a = 'abc'; b = 'abc'
>>> a is b
True

In the past, one disadvantage was that interned strings were permanent. Once interned, the string memory was never freed even after all references were dropped. I think this is no longer the case for more recent versions of python though.

在过去,一个缺点是实习字符串是永久性的。一旦被实现,即使在删除所有引用之后,字符串内存也从未被释放。我认为现在已经不再适用于更新版本的python了。

#3


11  

They weren't talking about keyword intern because there is no such thing in Python. They were talking about non-essential built-in function intern. Which in py3k has been moved to sys.intern. Docs have an exhaustive description.

他们不是在谈论关键字实习生,因为Python中没有这样的东西。他们谈论的是非必要的内置功能实习生。 py3k中已将其移至sys.intern。文档有详尽的描述。

#4


4  

It returns a canonical instance of the string.

它返回字符串的规范实例。

Therefore if you have many string instances that are equal you save memory, and in addition you can also compare canonicalized strings by identity instead of equality which is faster.

因此,如果您有许多相等的字符串实例,则可以节省内存,此外,您还可以按标识比较规范化字符串,而不是更快的相等性。

#5


-3  

This idea seems around us in several languages, including Python, Java etc.

这个想法似乎在我们身边有几种语言,包括Python,Java等。

String Interning

字符串实习

#1


47  

From the Python 3 documentation:

从Python 3文档:

sys.intern(string)

Enter string in the table of “interned” strings and return the interned string – which is string itself or a copy. Interning strings is useful to gain a little performance on dictionary lookup – if the keys in a dictionary are interned, and the lookup key is interned, the key comparisons (after hashing) can be done by a pointer compare instead of a string compare. Normally, the names used in Python programs are automatically interned, and the dictionaries used to hold module, class or instance attributes have interned keys.

在“interned”字符串表中输入字符串并返回实习字符串 - 字符串本身或副本。实习字符串对于在字典查找中获得一点性能很有用 - 如果字典中的键被中断,并且查找键被中断,则可以通过指针比较而不是字符串比较来完成键比较(在散列之后)。通常,Python程序中使用的名称会自动实现,而用于保存模块,类或实例属性的字典具有实习键。

Interned strings are not immortal; you must keep a reference to the return value of intern() around to benefit from it.

实习字符串不是不朽的;你必须保持对intern()的返回值的引用才能从中受益。

Clarification:

澄清:

As the documentation suggests, the sys.intern function is intended to be used for performance optimization.

如文档所示,sys.intern函数旨在用于性能优化。

The sys.intern function maintains a table of interned strings. When you attempt to intern a string, the function looks it up in the table and:

sys.intern函数维护一个实习字符串表。当您尝试实习字符串时,该函数会在表格中查找并执行以下操作:

  1. If the string does not exists (hasn't been interned yet) the function saves it in the table and returns it from the interned strings table.

    如果字符串不存在(尚未实现),则该函数将其保存在表中并从实习字符串表中返回。

    >>> import sys
    >>> a = sys.intern('why do pangolins dream of quiche')
    >>> a
    'why do pangolins dream of quiche'
    

    In the above example, a holds the interned string. Even though it is not visible, the sys.intern function has saved the 'why do pangolins dream of quiche' string object in the interned strings table.

    在上面的例子中,a保存了实习字符串。尽管它不可​​见,但是sys.intern函数已经在interned strings表中保存了“为什么pangolins梦想的乳蛋饼”字符串对象。

  2. If the string exists (has been interned) the function returns it from the interned strings table.

    如果字符串存在(已被实习),则该函数将从实习字符串表中返回该字符串。

    >>> b = sys.intern('why do pangolins dream of quiche')
    >>> b
    'why do pangolins dream of quiche'
    

    Even though it is not immediately visible, because the string 'why do pangolins dream of quiche' has been interned before, b holds now the same string object as a.

    即使它不是立即可见的,因为字符串“为什么穿山甲梦想的乳蛋饼”之前被实习过,b现在拥有与a相同的字符串对象。

    >>> b is a
    True
    

    If we create the same string without using intern, we end up with two different string objects that have the same value.

    如果我们在不使用实习生的情况下创建相同的字符串,我们最终会得到两个具有相同值的不同字符串对象。

    >>> c = 'why do pangolins dream of quiche'
    >>> c is a
    False
    >>> c is b
    False
    

By using sys.intern you ensure that you never create two string objects that have the same value—when you request the creation of a second string object with the same value as an existing string object, you receive a reference to the pre-existing string object. This way, you are saving memory. Also, string objects comparison is now very efficient because it is carried out by comparing the memory addresses of the two string objects instead of their content.

通过使用sys.intern确保您永远不会创建具有相同值的两个字符串对象 - 当您请求创建与现有字符串对象具有相同值的第二个字符串对象时,您将收到对预先存在的字符串的引用目的。这样,您就可以节省内存。此外,字符串对象比较现在非常有效,因为它是通过比较两个字符串对象的内存地址而不是它们的内容来执行的。

#2


17  

Essentially intern looks up (or stores if not present) the string in a collection of interned strings, so all interned instances will share the same identity. You trade the one-time cost of looking up this string for faster comparisons (the compare can return True after just checking for identity, rather than having to compare each character), and reduced memory usage.

本质上实习生在实习字符串集合中查找(或存储,如果不存在)字符串,因此所有实例化实例将共享相同的标识。您可以交换查找此字符串的一次性成本以进行更快的比较(比较可以在检查身份后返回True,而不是必须比较每个字符),并减少内存使用量。

However, python will automatically intern strings that are small, or look like identifiers, so you may find you get no improvement because your strings are already being interned behind the scenes. For example:

但是,python会自动实习小的字符串,或看起来像标识符,所以你可能会发现你没有得到任何改进,因为你的字符串已经在幕后被实习。例如:

>>> a = 'abc'; b = 'abc'
>>> a is b
True

In the past, one disadvantage was that interned strings were permanent. Once interned, the string memory was never freed even after all references were dropped. I think this is no longer the case for more recent versions of python though.

在过去,一个缺点是实习字符串是永久性的。一旦被实现,即使在删除所有引用之后,字符串内存也从未被释放。我认为现在已经不再适用于更新版本的python了。

#3


11  

They weren't talking about keyword intern because there is no such thing in Python. They were talking about non-essential built-in function intern. Which in py3k has been moved to sys.intern. Docs have an exhaustive description.

他们不是在谈论关键字实习生,因为Python中没有这样的东西。他们谈论的是非必要的内置功能实习生。 py3k中已将其移至sys.intern。文档有详尽的描述。

#4


4  

It returns a canonical instance of the string.

它返回字符串的规范实例。

Therefore if you have many string instances that are equal you save memory, and in addition you can also compare canonicalized strings by identity instead of equality which is faster.

因此,如果您有许多相等的字符串实例,则可以节省内存,此外,您还可以按标识比较规范化字符串,而不是更快的相等性。

#5


-3  

This idea seems around us in several languages, including Python, Java etc.

这个想法似乎在我们身边有几种语言,包括Python,Java等。

String Interning

字符串实习