对于具有Dask数组和/或h5py的循环

时间:2022-03-25 21:16:56

I have a time series with over a hundred million rows of data. I am trying to reshape it to include a time window. My sample data is of shape (79499, 9) and I am trying to reshape it to (79979, 10, 9). The following for loop works fine in numpy.

我有一个超过一亿行数据的时间序列。我试图重塑它以包括一个时间窗口。我的样本数据有形状(79499,9),我正在尝试将其重塑为(79979,10,9)。以下for循环在numpy中工作正常。

def munge(data, backprop_window):
    result = []
    for index in range(len(data) - backprop_window):
       result.append(data[index: index + backprop_window])
    return np.array(result)

X_train = munge(X_train, backprop_window)

I have tried a few variations with dask, but all of them seem to hang without giving any error messages, including this one:

我已尝试使用dask进行一些变换,但所有这些变量似乎都挂起而没有给出任何错误消息,包括这一个:

import h5py
import dask.array as da
f1 = h5py.File("data.hdf5")
X_train = f1.create_dataset('X_train',data = X_train, dtype='float32') 
x = da.from_array(X_train, chunks=(10000, d.shape[1]))
result = x.compute(munge(x, backprop_window))

Any wise thoughts appreciated.

任何明智的想法都赞赏

1 个解决方案

#1


1  

This doesn't necessarily solve your dask issue, but as a much faster alternative to munge, you could instead use numpy's stride_tricks to create a rolling view into your data (based on example here).

这并不一定能解决你的dask问题,但作为munge的一个更快的替代方法,你可以使用numpy的stride_tricks来创建数据的滚动视图(基于此处的示例)。

def munge_strides(data, backprop_window):
    """ take a rolling view into array by manipulating strides """
    from numpy.lib.stride_tricks import as_strided
    new_shape = (data.shape[0] - backprop_window,
                 backprop_window,
                 data.shape[1])
    new_strides = (data.strides[0], data.strides[0], data.strides[1])
    return as_strided(data, shape=new_shape, strides=new_strides)

X_train = np.arange(100).reshape(20, 5)

np.array_equal(munge(X_train, backprop_window=3),
               munge_strides(X_train, backprop_window=3))
Out[112]: True

as_strided needs to be used very carefully - it is an 'advanced' feature and incorrect parameters can easily lead you into segfaults - see docstring

as_strided需要非常谨慎地使用 - 它是一个'高级'功能,不正确的参数很容易导致你进入段错误 - 请参阅docstring

#1


1  

This doesn't necessarily solve your dask issue, but as a much faster alternative to munge, you could instead use numpy's stride_tricks to create a rolling view into your data (based on example here).

这并不一定能解决你的dask问题,但作为munge的一个更快的替代方法,你可以使用numpy的stride_tricks来创建数据的滚动视图(基于此处的示例)。

def munge_strides(data, backprop_window):
    """ take a rolling view into array by manipulating strides """
    from numpy.lib.stride_tricks import as_strided
    new_shape = (data.shape[0] - backprop_window,
                 backprop_window,
                 data.shape[1])
    new_strides = (data.strides[0], data.strides[0], data.strides[1])
    return as_strided(data, shape=new_shape, strides=new_strides)

X_train = np.arange(100).reshape(20, 5)

np.array_equal(munge(X_train, backprop_window=3),
               munge_strides(X_train, backprop_window=3))
Out[112]: True

as_strided needs to be used very carefully - it is an 'advanced' feature and incorrect parameters can easily lead you into segfaults - see docstring

as_strided需要非常谨慎地使用 - 它是一个'高级'功能,不正确的参数很容易导致你进入段错误 - 请参阅docstring