I often have processed numpy arrays that come as a result of lengthy computations. I need to use them elsewhere in calculations. I currently 'pickle' them and unpickle the files into variables as and when I need them.
我经常处理由于冗长计算而产生的numpy数组。我需要在计算中的其他地方使用它们。我现在'腌制'它们并在需要时将文件解压缩到变量中。
I noticed for large data sizes (~1M data points), this is slow. I read elsewhere that pickling is not best way to store huge files. I would like to store and read them as ASCII files efficiently to load directly into a numpy array. What is the best way to do this?
我注意到对于大数据量(~1M数据点),这很慢。我在别处读到,酸洗不是存储大文件的最佳方式。我想有效地存储和读取它们作为ASCII文件直接加载到一个numpy数组。做这个的最好方式是什么?
say I have a 100k x 3 2D array in a variable 'a'. I want to store it in an ASCII file and load it into a numpy array variable 'b'.
说我在变量'a'中有一个100k x 3的二维数组。我想将它存储在ASCII文件中并将其加载到一个numpy数组变量'b'中。
3 个解决方案
#1
3
Numpy has a range of input and output methods that will do exactly what you are after.
Numpy拥有一系列输入和输出方法,可以完全满足您的要求。
One option would be numpy.save
:
一个选项是numpy.save:
import numpy as np
my_array = np.array([1,2,3,4])
with open('data.txt', 'wb') as f:
np.save(f, my_array, allow_pickle=False)
To load your data again:
要再次加载数据:
with open('data.txt', 'rb') as f:
my_loaded_array = np.load(f)
#2
3
If you want efficiency, ASCII will not be the case. The problem with pickle is that it is dependent on the python version, so it's not a good idea for long term storage. You can try to use other binary technologies, where the most straightforward solution would be to use the numpy.save
method as documented here.
如果你想要效率,ASCII不会是这种情况。 pickle的问题在于它依赖于python版本,因此长期存储不是一个好主意。您可以尝试使用其他二进制技术,其中最直接的解决方案是使用此处记录的numpy.save方法。
#3
2
The problem you pose is directly related to the size of the dataset.
您提出的问题与数据集的大小直接相关。
There are several solutions to this quite common problem that come with specialized libraries.
对于这个与专业库一起出现的常见问题,有几种解决方案。
- Python-only persistence: joblib offers an alternative to pickle specifically for storing files that are too large for convenient pickling.
- 仅限Python的持久性:joblib提供了pickle的替代方法,专门用于存储太大而无法方便酸洗的文件。
- HDF5 is a file format that is specifically targeted for storing arrays. The format is multi-language and multi-platform but a very good Python library exists for it: h5py
- HDF5是一种专门用于存储阵列的文件格式。格式是多语言和多平台,但它有一个非常好的Python库:h5py
An example with h5py. To write the data:
h5py的一个例子。要写入数据:
import h5py
with h5py.File('data.h5', 'w') as f:
f.create_dataset('a', data=a)
To read the data:
要读取数据:
import h5py
with h5py.File('data.h5', 'r') as f:
b = f['a'][:]
#1
3
Numpy has a range of input and output methods that will do exactly what you are after.
Numpy拥有一系列输入和输出方法,可以完全满足您的要求。
One option would be numpy.save
:
一个选项是numpy.save:
import numpy as np
my_array = np.array([1,2,3,4])
with open('data.txt', 'wb') as f:
np.save(f, my_array, allow_pickle=False)
To load your data again:
要再次加载数据:
with open('data.txt', 'rb') as f:
my_loaded_array = np.load(f)
#2
3
If you want efficiency, ASCII will not be the case. The problem with pickle is that it is dependent on the python version, so it's not a good idea for long term storage. You can try to use other binary technologies, where the most straightforward solution would be to use the numpy.save
method as documented here.
如果你想要效率,ASCII不会是这种情况。 pickle的问题在于它依赖于python版本,因此长期存储不是一个好主意。您可以尝试使用其他二进制技术,其中最直接的解决方案是使用此处记录的numpy.save方法。
#3
2
The problem you pose is directly related to the size of the dataset.
您提出的问题与数据集的大小直接相关。
There are several solutions to this quite common problem that come with specialized libraries.
对于这个与专业库一起出现的常见问题,有几种解决方案。
- Python-only persistence: joblib offers an alternative to pickle specifically for storing files that are too large for convenient pickling.
- 仅限Python的持久性:joblib提供了pickle的替代方法,专门用于存储太大而无法方便酸洗的文件。
- HDF5 is a file format that is specifically targeted for storing arrays. The format is multi-language and multi-platform but a very good Python library exists for it: h5py
- HDF5是一种专门用于存储阵列的文件格式。格式是多语言和多平台,但它有一个非常好的Python库:h5py
An example with h5py. To write the data:
h5py的一个例子。要写入数据:
import h5py
with h5py.File('data.h5', 'w') as f:
f.create_dataset('a', data=a)
To read the data:
要读取数据:
import h5py
with h5py.File('data.h5', 'r') as f:
b = f['a'][:]