如何在numpy savetxt中格式化，使零仅保存为“0”

I am saving a numpy sparse array (densed) into a csv. The result is I have a 3GB csv. The problem is 95% of the cells are 0.0000. I used fmt='%5.4f'. How can I format and save such that the zeros are saved only as 0 and the non zero floats are saved with the '%5.4f' format ? I am sure I can get the 3GB down to 300MB if I can do this.

我正在将一个numpy稀疏数组(已删除)保存到csv中。结果是我有一个3GB的csv。问题是95%的细胞是0.0000。我用fmt ='%5.4f'。如何格式化和保存,使零保存为0,非零浮点数以'%5.4f'格式保存?如果我能做到这一点,我相信我可以将3GB降至300MB。

I am using

我在用

np.savetxt('foo.csv', arrayDense, fmt='%5.4f', delimiter = ',')

Thanks Regards

3 个解决方案

#1

If you look at the source code of np.savetxt, you'll see that, while there is quite a bit of code to handle the arguments and the differences between Python 2 and Python 3, it is ultimately a simple python loop over the rows, in which each row is formatted and written to the file. So you won't lose any performance if you write your own. For example, here's a pared down function that writes compact zeros:

如果你看一下np.savetxt的源代码,你会看到,虽然有很多代码可以处理Python 2和Python 3之间的参数和差异,但它最终是一个简单的python循环。 ,其中每行被格式化并写入文件。所以如果你自己编写,你不会失去任何表现。例如,这是一个写下紧凑零的简化函数:

def savetxt_compact(fname, x, fmt="%.6g", delimiter=','):
    with open(fname, 'w') as fh:
        for row in x:
            line = delimiter.join("0" if value == 0 else fmt % value for value in row)
            fh.write(line + '\n')

For example:

In [70]: x
Out[70]: 
array([[ 0.        ,  0.        ,  0.        ,  0.        ,  1.2345    ],
       [ 0.        ,  9.87654321,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  3.14159265,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ]])

In [71]: savetxt_compact('foo.csv', x, fmt='%.4f')

In [72]: !cat foo.csv
0,0,0,0,1.2345
0,9.8765,0,0,0
0,3.1416,0,0,0
0,0,0,0,0
0,0,0,0,0
0,0,0,0,0

Then, as long as you are writing your own savetxt function, you might as well make it handle sparse matrices, so you don't have to convert it to a (dense) numpy array before saving it. (I assume the sparse array is implemented using one of the sparse representations from scipy.sparse.) In the following function, the only change is from ... for value in row to ... for value in row.A[0].

然后,只要您编写自己的savetxt函数,您也可以使它处理稀疏矩阵,因此您不必在保存之前将其转换为(密集)numpy数组。 (我假设稀疏数组是使用scipy.sparse中的一个稀疏表示来实现的。)在下面的函数中,唯一的变化是从...表示行中的值到...表示行中的值.A [0] 。

def savetxt_sparse_compact(fname, x, fmt="%.6g", delimiter=','):
    with open(fname, 'w') as fh:
        for row in x:
            line = delimiter.join("0" if value == 0 else fmt % value for value in row.A[0])
            fh.write(line + '\n')

Example:

In [112]: a
Out[112]: 
<6x5 sparse matrix of type '<type 'numpy.float64'>'
    with 3 stored elements in Compressed Sparse Row format>

In [113]: a.A
Out[113]: 
array([[ 0.        ,  0.        ,  0.        ,  0.        ,  1.2345    ],
       [ 0.        ,  9.87654321,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  3.14159265,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ]])

In [114]: savetxt_sparse_compact('foo.csv', a, fmt='%.4f')

In [115]: !cat foo.csv
0,0,0,0,1.2345
0,9.8765,0,0,0
0,3.1416,0,0,0
0,0,0,0,0
0,0,0,0,0
0,0,0,0,0

#2

Another simple option that may work given your requirements is the 'g' specifier. If you care more about significant digits and less about seeing exactly x number of digits and don't mind it switching between scientific and float, this does the trick well. For example:

另一个可以满足您要求的简单选项是'g'说明符。如果你更关心有效数字而不是更多关于看到x个数字的数字,并且不介意它在科学和浮点数之间切换,这很好地解决了这个问题。例如:

np.savetxt("foo.csv", arrayDense, fmt='%5.4g', delimiter=',')

If arrayDense is this:

如果arrayDense是这样的:

matrix([[ -5.54900000e-01,   0.00000000e+00,   0.00000000e+00],
    [  0.00000000e+00,   3.43560000e-08,   0.00000000e+00],
    [  0.00000000e+00,   0.00000000e+00,   3.43422000e+01]])

Your way would yield:

你的方式会产生:

-0.5549,0.0000,0.0000
0.0000,0.0000,0.0000
0.0000,0.0000,34.3422

The above would yield instead:

以上将反过来:

-0.5549,    0,    0
0,3.436e-08,    0
0,    0,34.34

This way is also more flexible. Notice that using 'g' instead of 'f', you don't lose data (i.e. 3.4356e-08 instead of 0.0000). This obviously is dependent on what you set your precision to however.

这种方式也更灵活。请注意,使用'g'而不是'f',您不会丢失数据(即3.4356e-08而不是0.0000)。这显然取决于您设置精度的方式。

#3

It would be much better if you saved only the non-zeros entries in your sparse matrix (m in the example below), you could achieve that doing:

如果只保存稀疏矩阵中的非零条目(下例中的m)会更好,你可以实现这样做:

fname = 'row_col_data.txt'
m = m.tocoo()
a = np.vstack((m.row, m.col, m.data)).T
header = '{0}, {1}'.format(*m.shape)
np.savetxt(fname, a, header=header, fmt=('%d', '%d', '%5.4f'))

and the sparse matrix can be recomposed doing:

并且可以重构稀疏矩阵:

row, col, data = np.loadtxt(fname, skiprows=1, unpack=True)
shape = map(int, open(fname).next()[1:].split(','))
m = coo_matrix((data, (row, col)), shape=shape)

#1

def savetxt_compact(fname, x, fmt="%.6g", delimiter=','):
    with open(fname, 'w') as fh:
        for row in x:
            line = delimiter.join("0" if value == 0 else fmt % value for value in row)
            fh.write(line + '\n')