Right now I have a python program building a fairly large 2D numpy array and saving it as a tab delimited text file using numpy.savetxt. The numpy array contains only floats. I then read the file in one row at a time in a separate C++ program.
现在我有一个python程序构建一个相当大的2D numpy数组,并使用numpi .savetxt将其保存为一个带标签分隔的文本文件。numpy数组只包含浮点数。然后,我在一个独立的c++程序中以一行的速度读取文件。
What I would like to do is find a way to accomplish this same task, changing my code as little as possible such that I can decrease the size of the file I am passing between the two programs.
我想做的是找到一种方法来完成同样的任务,尽可能少地修改代码,这样我就可以减少在两个程序之间传递的文件的大小。
I found that I can use numpy.savetxt to save to a compressed .gz file instead of a text file. This lowers the file size from ~2MB to ~100kB.
我发现我可以用numpy。savetxt保存到压缩的.gz文件而不是文本文件。这将文件大小从~2MB降低到~100kB。
Is there a better way to do this? Could I, perhaps, write the numpy array in binary to the file to save space? If so, how would I do this so that I can still read it into the C++ program?
有更好的方法吗?我是否可以将numpy数组以二进制形式写入文件以节省空间?如果是的话,我该怎么做才能把它读进c++程序呢?
Thank you for the help. I appreciate any guidance I can get.
谢谢你的帮助。我感谢我能得到的任何指导。
EDIT:
编辑:
There are a lot of zeros (probably 70% of the values in the numpy array are 0.0000) I am not sure of how I can somehow exploit this though and generate a tiny file that my c++ program can read in
有很多0(可能是numpy数组中70%的值是0.0000),我不确定如何利用它,并生成一个小文件,我的c++程序可以读取这个文件。
5 个解决方案
#1
3
Since you have a lot of zeroes, you could only write out the non-zero elements in the form (index, number).
因为你有很多0,你只能把非0的元素写在表格中(索引,数字)。
Suppose you have an array with a small amount of nonzero numbers:
假设您有一个具有少量非零数的数组:
In [5]: a = np.zeros((10, 10))
In [6]: a
Out[6]:
array([[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])
In [7]: a[3,1] = 2.0
In [8]: a[7,4] = 17.0
In [9]: a[9,0] = 1.5
First, isolate the interesting numbers and their indices:
首先,把有趣的数字和它们的指标分开:
In [11]: x, y = a.nonzero()
In [12]: zip(x,y)
Out[12]: [(3, 1), (7, 4), (9, 0)]
In [13]: nonzero = zip(x,y)
Now you only have a small number of data elements left. The easiest thing is to write them to a text file:
现在只剩下少量的数据元素。最简单的方法是将它们写入一个文本文件中:
In [17]: with open('numbers.txt', 'w+') as outf:
....: for r, k in nonzero:
....: outf.write('{:d} {:d} {:g}\n'.format(r, k, a[r,k]))
....:
In [18]: cat numbers.txt
3 1 2
7 4 17
9 0 1.5
This also gives you an opportunity to eyeball the data. In your C++ program you can read this data with fscanf
.
这也给了你一个观察数据的机会。在c++程序中,您可以使用fscanf读取这些数据。
But you can reduce the size even more by writing binary data using struct:
但是您可以通过使用struct编写二进制数据来进一步减少大小:
In [17]: import struct
In [19]: c = struct.Struct('=IId')
In [20]: with open('numbers.bin', 'w+') as outf:
....: for r, k in nonzero:
....: outf.write(c.pack(r, k, a[r,k]))
The argument to the Struct
constructor means; use native date format '='. The first and second data elements are unsigned integers 'I', the third element is a double 'd'.
结构构造函数的参数表示;使用本机日期格式'='。第一个和第二个数据元素是无符号整数“I”,第三个元素是双“d”。
In your C++ program this data is probably best read as binary data into a packed struct
.
在c++程序中,最好将这些数据作为二进制数据读入压缩结构体。
EDIT: Answer updated for a 2D array.
编辑:回答更新为2D数组。
#2
3
Unless you are sure you don't need to worry about endianness and such, best use numpy.savez
, as explained in @unutbu's answer and @jorgeca's comment here: numpy's tostring/fromstring --- what do I need to specify to restore the array.
除非你确信你不需要担心意外的发现,否则最好使用numpy。savez,正如@unutbu的答案和@jorgeca的评论:numpy的tostring/fromstring——我需要指定什么来恢复数组。
If the resulting size is not small enough, there's always zlib
(on python's side: import zlib
, on the C++ side, I'm sure an implementation exists).
如果结果的大小不够小,则总是有zlib(在python这边:import zlib,在c++那边,我肯定有一个实现)。
An alternative would be to use hdf5
format: while it does not necessarily reduce the on-disk file size, it does make saving/loading faster (this is what the format was designed for, large data arrays). There are both python and C++ readers/writers for hdf5
.
另一种选择是使用hdf5格式:虽然它不一定减少磁盘上的文件大小,但它确实使保存/加载速度更快(这种格式是为大数据数组设计的)。hdf5有python和c++读写器。
#3
1
numpy.ndarray.tofile
and numpy.fromfile
are useful for direct binary output/input from python. std::ostream::write
std::istream::read
are useful for binary output/input in c++.
numpy.ndarray。tofile和numpi .fromfile对于python的直接二进制输出/输入非常有用。写入std: istream::read对于c++的二进制输出/输入非常有用。
You should be careful about endianess if the data are transferred from one machine to another.
如果数据从一台机器传输到另一台机器,您应该小心意外发现。
#4
1
Use the an hdf5 file, they are really simple to use through h5py and you can use set compression a flag. Note that hdf5 has also a c++ interface.
使用hdf5文件,通过h5py非常简单,可以使用set压缩标志。注意,hdf5还有一个c++接口。
#5
0
If you don't mind installing additional packages (for both python
and c++
), you can use [BSON][1]
(Binary JSON).
如果您不介意安装额外的包(对于python和c++),您可以使用[BSON][1](二进制JSON)。
#1
3
Since you have a lot of zeroes, you could only write out the non-zero elements in the form (index, number).
因为你有很多0,你只能把非0的元素写在表格中(索引,数字)。
Suppose you have an array with a small amount of nonzero numbers:
假设您有一个具有少量非零数的数组:
In [5]: a = np.zeros((10, 10))
In [6]: a
Out[6]:
array([[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])
In [7]: a[3,1] = 2.0
In [8]: a[7,4] = 17.0
In [9]: a[9,0] = 1.5
First, isolate the interesting numbers and their indices:
首先,把有趣的数字和它们的指标分开:
In [11]: x, y = a.nonzero()
In [12]: zip(x,y)
Out[12]: [(3, 1), (7, 4), (9, 0)]
In [13]: nonzero = zip(x,y)
Now you only have a small number of data elements left. The easiest thing is to write them to a text file:
现在只剩下少量的数据元素。最简单的方法是将它们写入一个文本文件中:
In [17]: with open('numbers.txt', 'w+') as outf:
....: for r, k in nonzero:
....: outf.write('{:d} {:d} {:g}\n'.format(r, k, a[r,k]))
....:
In [18]: cat numbers.txt
3 1 2
7 4 17
9 0 1.5
This also gives you an opportunity to eyeball the data. In your C++ program you can read this data with fscanf
.
这也给了你一个观察数据的机会。在c++程序中,您可以使用fscanf读取这些数据。
But you can reduce the size even more by writing binary data using struct:
但是您可以通过使用struct编写二进制数据来进一步减少大小:
In [17]: import struct
In [19]: c = struct.Struct('=IId')
In [20]: with open('numbers.bin', 'w+') as outf:
....: for r, k in nonzero:
....: outf.write(c.pack(r, k, a[r,k]))
The argument to the Struct
constructor means; use native date format '='. The first and second data elements are unsigned integers 'I', the third element is a double 'd'.
结构构造函数的参数表示;使用本机日期格式'='。第一个和第二个数据元素是无符号整数“I”,第三个元素是双“d”。
In your C++ program this data is probably best read as binary data into a packed struct
.
在c++程序中,最好将这些数据作为二进制数据读入压缩结构体。
EDIT: Answer updated for a 2D array.
编辑:回答更新为2D数组。
#2
3
Unless you are sure you don't need to worry about endianness and such, best use numpy.savez
, as explained in @unutbu's answer and @jorgeca's comment here: numpy's tostring/fromstring --- what do I need to specify to restore the array.
除非你确信你不需要担心意外的发现,否则最好使用numpy。savez,正如@unutbu的答案和@jorgeca的评论:numpy的tostring/fromstring——我需要指定什么来恢复数组。
If the resulting size is not small enough, there's always zlib
(on python's side: import zlib
, on the C++ side, I'm sure an implementation exists).
如果结果的大小不够小,则总是有zlib(在python这边:import zlib,在c++那边,我肯定有一个实现)。
An alternative would be to use hdf5
format: while it does not necessarily reduce the on-disk file size, it does make saving/loading faster (this is what the format was designed for, large data arrays). There are both python and C++ readers/writers for hdf5
.
另一种选择是使用hdf5格式:虽然它不一定减少磁盘上的文件大小,但它确实使保存/加载速度更快(这种格式是为大数据数组设计的)。hdf5有python和c++读写器。
#3
1
numpy.ndarray.tofile
and numpy.fromfile
are useful for direct binary output/input from python. std::ostream::write
std::istream::read
are useful for binary output/input in c++.
numpy.ndarray。tofile和numpi .fromfile对于python的直接二进制输出/输入非常有用。写入std: istream::read对于c++的二进制输出/输入非常有用。
You should be careful about endianess if the data are transferred from one machine to another.
如果数据从一台机器传输到另一台机器,您应该小心意外发现。
#4
1
Use the an hdf5 file, they are really simple to use through h5py and you can use set compression a flag. Note that hdf5 has also a c++ interface.
使用hdf5文件,通过h5py非常简单,可以使用set压缩标志。注意,hdf5还有一个c++接口。
#5
0
If you don't mind installing additional packages (for both python
and c++
), you can use [BSON][1]
(Binary JSON).
如果您不介意安装额外的包(对于python和c++),您可以使用[BSON][1](二进制JSON)。