在Python 2.7中保存/加载大型列表的最快方法是什么？

What's the fastest way to save/load a large list in Python 2.7? I apologize if this has already been asked, I couldn't find an answer to this exact question when I searched...

在Python 2.7中保存/加载大型列表的最快方法是什么?如果已经被问过我道歉,当我搜索时,我无法找到这个问题的答案...

More specifically, I'm testing out methods for simulating something, and I need to compare the result from each method I test out to an exact solution. I have a Python script that produces a list of values representing the exact solution, and I don't want to re-compute it every time I run a new simulation. Thus, I want to save it somewhere and just load the solution instead of re-computing it every time I want to see how good my simulation results are.

更具体地说,我正在测试模拟某些东西的方法,我需要将我测试的每个方法的结果与精确的解决方案进行比较。我有一个Python脚本,它生成一个表示确切解决方案的值列表,我不希望每次运行新模拟时都重新计算它。因此,我想将它保存在某个地方,只需加载解决方案,而不是每当我想看看我的模拟结果有多好时重新计算它。

I also don't need the saved file to be human-readable. I just need to be able to load it in Python.

我也不需要保存的文件是人类可读的。我只需要能够在Python中加载它。

4 个解决方案

#1

Using np.load and tolist is significantly faster than any other solution:

使用np.load和tolist比任何其他解决方案快得多:

In [77]: outfile = open("test.pkl","w")   
In [78]: l = list(range(1000000))   

In [79]:  timeit np.save("test",l)
10 loops, best of 3: 122 ms per loop

In [80]:  timeit np.load("test.npy").tolist()
10 loops, best of 3: 20.9 ms per loop

In [81]: timeit pickle.load(outfile)
1 loops, best of 3: 1.86 s per loop

In [82]: outfile = open("test.pkl","r")

In [83]: timeit pickle.load(outfile)
1 loops, best of 3: 1.88 s per loop

In [84]: cPickle.dump(l,outfile)
....: 
1 loops, best of 3: 
273 ms per loop    
In [85]: outfile = open("test.pkl","r")
In [72]: %%timeit
cPickle.load(outfile)
   ....: 
1 loops, best of 3: 
539 ms per loop

In python 3 numpy is far more efficient if you use a numpy array:

在python 3中,如果使用numpy数组,numpy会更有效:

In [24]: %%timeit                  
out = open("test.pkl","wb")
pickle.dump(l, out)
   ....: 
10 loops, best of 3: 27.3 ms per loop

In [25]: %%timeit
out = open("test.pkl","rb")
pickle.load(out)
   ....: 
10 loops, best of 3: 52.2 ms per loop

In [26]: timeit np.save("test",l)
10 loops, best of 3: 115 ms per loop

In [27]: timeit np.load("test.npy")
100 loops, best of 3: 2.35 ms per loop

If you want a list it is again faster to call tolist and use np.load:

如果你想要一个列表,那么再次调用tolist并使用np.load会更快:

In [29]: timeit np.load("test.npy").tolist()
10 loops, best of 3: 37 ms per loop

#2

As PadraicCunningham has mentioned, you can pickle the list.

正如PadraicCunningham所提到的,你可以腌制清单。

import pickle

lst = [1,2,3,4,5]

with open('file.pkl', 'wb') as pickle_file:
    pickle.dump(lst, pickle_file, protocol=pickle.HIGHEST_PROTOCOL)

this loads the list into a file.

这会将列表加载到文件中。

And to extract it:

并提取它:

import pickle

with open('file.pkl', 'rb') as pickle_load:
    lst = pickle.load(pickle_load)
print(lst) # prints [1,2,3,4,5]

The HIGHEST_PROTOCOL bit is optional, but is normally recommended. Protocols define how pickle will serialise the object, with lower protocols tending to be compatible with older versions of Python.

HIGHEST_PROTOCOL位是可选的,但通常建议使用。协议定义了pickle如何序列化对象,较低的协议倾向于与旧版本的Python兼容。

It's worth noting two more things:

值得注意的还有两件事:

There is also the cPickle module - written in C to optimise speed. You use this in the same way as above.

还有cPickle模块 - 用C语言编写以优化速度。您可以按照与上面相同的方式使用它。

Pickle is also known to have some insecurities (there are ways of manipulating how pickle deserialises an object, which you can manipulate into making Python do more or less whatever you want). As a result, this library shouldn't be used when it will be opening unknown data. In extreme cases you can try out a safer version like spickle: https://github.com/ershov/sPickle

Pickle也有一些不安全感(有一些方法可以操纵pickle如何反序列化一个对象,你可以操纵它来使Python做任何你想做的事情或多或少)。因此,在打开未知数据时不应使用此库。在极端情况下,您可以尝试更安全的版本,如spickle:https://github.com/ershov/sPickle

Other libraries I'd recommend looking up are json and marshall.

我推荐查找的其他图书馆是json和marshall。

#3

I've done some profiling of many methods (except the numpy method) and pickle/cPickle is very slow on simple data sets. The fastest way depends on what type of data you are saving. If you are saving a list of strings and/or integers. The fastest way that I've seen is to just write it directly to a file using a for loop and ','.join(...); read it back in using a similar for loop with .split(',').

我已经对许多方法进行了一些分析(除了numpy方法)并且pickle / cPickle在简单数据集上非常慢。最快的方法取决于您要保存的数据类型。如果要保存字符串和/或整数列表。我见过的最快的方法是使用for循环直接将它写入文件,并使用','。join(...);使用与.split(',')类似的for循环读回来。

#4

You may want to take a look at Python object serialization,pickle and cPickle http://pymotw.com/2/pickle/

你可能想看看Python对象序列化,pickle和cPickle http://pymotw.com/2/pickle/

pickle.dumps(obj[, protocol]) If the protocol parameter is omitted, protocol 0 is used. If protocol is specified as a negative value or HIGHEST_PROTOCOL, the highest protocol version will be used.

pickle.dumps(obj [,protocol])如果省略protocol参数,则使用协议0。如果protocol指定为负值或HIGHEST_PROTOCOL,则将使用最高协议版本。

#1