Python中包含一百万个元素的列表会占用多少内存？

There are more than a million subreddits on Reddit, according to redditmetrics.com.

根据redditmetrics.com的数据,Reddit上有超过一百万个子评价。

I wrote a script that repeatedly queries this Reddit API endpoint until all the subreddits are stored in an array, all_subs:

我写了一个脚本,反复查询这个Reddit API端点,直到所有的subreddit都存储在一个数组all_subs中:

all_subs = []
for sub in <repeated request here>:
    all_subs.append({"name": display_name, "subscribers": subscriber_count})

The script has been running for close to ten hours, and it's about halfway done (it gets rate-limited every three or four requests). When it's finished, I expect an array like this:

该脚本已经运行了近十个小时,大约已经完成了一半(每三或四个请求就会受到速率限制)。当它完成后,我期待一个像这样的数组:

[
    { "name": "AskReddit", "subscribers", 16751677 },
    { "name": "news", "subscribers", 13860169 },
    { "name": "politics", "subscribers", 3350326 },
    ... # plus one million more entries
]

Approximately how much space in memory will this list take up?

这个列表占用的内存空间大约是多少?

1 个解决方案

#1

This depends on your Python version and your system, but I will give you a hand figuring out about how much memory it will take. First thing is first, sys.getsizeof only returns the memory use of the object representing the container, not all the elements in the container.

这取决于你的Python版本和你的系统,但我会帮你弄清楚它需要多少内存。首先,sys.getsizeof只返回表示容器的对象的内存使用,而不是容器中的所有元素。

Only the memory consumption directly attributed to the object is accounted for, not the memory consumption of objects it refers to.

仅考虑直接归因于对象的内存消耗,而不考虑它所引用的对象的内存消耗。

If given, default will be returned if the object does not provide means to retrieve the size. Otherwise a TypeError will be raised.

如果给定,则如果对象未提供检索大小的方法,则将返回default。否则将引发TypeError。

getsizeof() calls the object’s __sizeof__ method and adds an additional garbage collector overhead if the object is managed by the garbage collector.

getsizeof()调用对象的__sizeof__方法,如果对象由垃圾收集器管理,则会增加额外的垃圾收集器开销。

See recursive sizeof recipe for an example of using getsizeof() recursively to find the size of containers and all their contents.

有关递归使用getsizeof()的示例,请参阅recursive sizeof recipe以查找容器及其所有内容的大小。

So, I've loaded up that recipe in an interactive interpreter session:

所以,我在交互式解释器会话中加载了该配方:

So, a CPython list is actually a heterogenous, resizable arraylist. The underlying array only contains pointers to Py_Objects. So, a pointer takes up a machine word worth of memory. On a 64-bit system, this is 64 bits, so 8 bytes. So, just for the container a list of size 1,000,000 will take up roughly 8 million bytes, or 8 megabytes. Building a list with 1000000 entries bears that out:

因此,CPython列表实际上是异质的,可调整大小的arraylist。底层数组只包含指向Py_Objects的指针。因此,指针占用了一个值得记忆的机器字。在64位系统上,这是64位,因此是8个字节。因此,仅对容器而言,大小为1,000,000的列表将占用大约800万字节或8兆字节。构建一个包含1000000条目的列表可以解决这个问题:

In [6]: for i in range(1000000):
   ...:     x.append([])
   ...:

In [7]: import sys

In [8]: sys.getsizeof(x)
Out[8]: 8697464

The extra memory is accounted for by the overhead of a python object, and the extra space that a the underlying array leaves at the end to allow for efficient .append operations.

额外的内存由python对象的开销和底层数组在末尾留下的额外空间来计算,以允许有效的.append操作。

Now, a dictionary is rather heavy-weight in Python. Just the container:

现在,字典在Python中相当重要。只是容器:

In [10]: sys.getsizeof({})
Out[10]: 288

So a lower bound on the size of 1 million dicts is: 288000000 bytes. So, a rough lower bound:

因此,100万个dicts大小的下限是:288000000字节。所以,粗略的下限:

In [12]: 1000000*288 + 1000000*8
Out[12]: 296000000

In [13]: 296000000 * 1e-9 # gigabytes
Out[13]: 0.29600000000000004

So you can expect about about 0.3 gigabytes worth of memory. Using the recipie and a more realistic dict:

所以你可以期待大约0.3千兆字节的内存。使用recipie和更现实的dict:

In [16]: x = []
    ...: for i in range(1000000):
    ...:     x.append(dict(name="my name is what", subscribers=23456644))
    ...:

In [17]: total_size(x)
Out[17]: 296697669

In [18]:

So, about 0.3 gigs. Now, that's not a lot on a modern system. But if you wanted to save space, you should use a tuple or even better, a namedtuple:

所以,大约0.3演出。现在,这在现代系统上并不是很多。但是如果你想节省空间,你应该使用一个元组甚至更好,一个命名元组:

In [24]: from collections import namedtuple

In [25]: Record = namedtuple('Record', "name subscribers")

In [26]: x = []
    ...: for i in range(1000000):
    ...:     x.append(Record(name="my name is what", subscribers=23456644))
    ...:

In [27]: total_size(x)
Out[27]: 72697556

Or, in gigabytes:

或者,以千兆字节为单位:

In [29]: total_size(x)*1e-9
Out[29]: 0.07269755600000001

namedtuple works just like a tuple, but you can access the fields with names:

namedtuple就像一个元组一样工作,但是你可以访问带有名字的字段:

In [30]: r = x[0]

In [31]: r.name
Out[31]: 'my name is what'

In [32]: r.subscribers
Out[32]: 23456644

#1