创建基于阈值- Python / NumPy的区间斜坡数组

I would like to measure the length of a sub-array fullfilling some condition (like a stop clock), but as soon as the condition is not fulfilled any more, the value should reset to zero. So, the resulting array should tell me, how many values fulfilled some condition (e.g. value > 1):

我想度量满足某些条件(如停止时钟)的子数组的长度，但一旦条件不再满足，该值应该重置为零。因此，结果数组应该告诉我，有多少值满足某个条件(例如，值> 1):

[0, 0, 2, 2, 2, 2, 0, 3, 3, 0]

should result into the followin array:

应产生如下数组:

[0, 0, 1, 2, 3, 4, 0, 1, 2, 0]

One can easily define a function in python, which returns the corresponding numy array:

可以很容易地在python中定义一个函数，它返回相应的numy数组:

def StopClock(signal, threshold=1):

    clock = []
    current_time = 0
    for item in signal:
        if item > threshold:
            current_time += 1
        else:
            current_time = 0
        clock.append(current_time)
    return np.array(clock)

StopClock([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])

However, I really do not like this for-loop, especially since this counter should run over a longer dataset. I thought of some np.cumsum solution in combination with np.diff, however I do not get through the reset part. Is someone aware of a more elegant numpy-style solution of above problem?

但是，我真的不喜欢这个for循环，特别是因为这个计数器应该运行在一个更长的数据集上。我想到了一些np。与np结合的cumsum溶液。diff，但是我没有通过复位部分。是否有人意识到上述问题的更优雅的numpy风格解决方案?

3 个解决方案

#1

Yes, we can use diff-styled differentiation alongwith cumsum to create such intervaled ramps in a vectorized manner and that should be pretty efficient specially with large input arrays. The resetting part is taken care of by assigning appropriate values at the end of each interval, with the idea of cum-summing that resets the numbers at end of each interval.

是的，我们可以与cumsum一起使用diff-style differentiation来创建这样一个矢量化的区间坡度，这对于大型输入阵列来说应该是非常有效的。重置部分是通过在每个间隔结束时分配适当的值来处理的，其思想是在每个间隔结束时重新设置数字的累加和。

Here's one implementation to accomplish all that -

这里有一个实现所有这些的实现。

def intervaled_ramp(a, thresh=1):
    mask = a>thresh

    # Get start, stop indices
    mask_ext = np.concatenate(([False], mask, [False] ))
    idx = np.flatnonzero(mask_ext[1:] != mask_ext[:-1])
    s0,s1 = idx[::2], idx[1::2]

    out = mask.astype(int)
    valid_stop = s1[s1<len(a)]
    out[valid_stop] = s0[:len(valid_stop)] - valid_stop
    return out.cumsum()

Sample runs -

样本运行-

Input (a) : 
[5 3 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 1]
Output (intervaled_ramp(a, thresh=1)) : 
[1 2 0 1 2 0 0 1 2 3 4 0 1 2 0 0 0 1 0 1 2 3 4 0 0]

Input (a) : 
[1 1 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 1]
Output (intervaled_ramp(a, thresh=1)) : 
[0 0 0 1 2 0 0 1 2 3 4 0 1 2 0 0 0 1 0 1 2 3 4 0 0]

Input (a) : 
[1 1 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 5]
Output (intervaled_ramp(a, thresh=1)) : 
[0 0 0 1 2 0 0 1 2 3 4 0 1 2 0 0 0 1 0 1 2 3 4 0 1]

Input (a) : 
[1 1 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 5]
Output (intervaled_ramp(a, thresh=0)) : 
[1 2 3 4 5 0 0 1 2 3 4 0 1 2 0 1 2 3 0 1 2 3 4 0 1]

Runtime test

运行时测试

One way to do a fair benchmarking was to use the posted sample in the question and tiling into a big number of times and using that as the input array. With that setup, here's the timings -

进行公平基准测试的一种方法是使用问题中提交的示例，并将其平铺成大量次，并将其作为输入数组。有了这个设置，这是计时-

In [841]: a = np.array([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])

In [842]: a = np.tile(a,10000)

# @Alexander's soln
In [843]: %timeit pandas_app(a, threshold=1)
1 loop, best of 3: 3.93 s per loop

# @Psidom 's soln
In [844]: %timeit stop_clock(a, threshold=1)
10 loops, best of 3: 119 ms per loop

# Proposed in this post
In [845]: %timeit intervaled_ramp(a, thresh=1)
1000 loops, best of 3: 527 µs per loop

#2

This solution uses pandas to perform a groupby:

本方案使用熊猫执行分组:

s = pd.Series([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
threshold = 0
>>> np.where(
        s > threshold, 
        s
        .to_frame()  # Convert series to dataframe.
        .assign(_dummy_=1)  # Add column of ones.
        .groupby((s.gt(threshold) != s.gt(threshold).shift()).cumsum())['_dummy_']  # shift-cumsum pattern
        .transform(lambda x: x.cumsum()), # Cumsum the ones per group.
        0)  # Fill value with zero where threshold not exceeded.
array([0, 0, 1, 2, 3, 4, 0, 1, 2, 0])

#3

Another numpy solution:

另一个numpy解决方案:

import numpy as np
a = np.array([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])

def stop_clock(signal, threshold=1):
    mask = signal > threshold
    indices = np.flatnonzero(np.diff(mask)) + 1
    return np.concatenate(list(map(np.cumsum, np.array_split(mask, indices))))

stop_clock(a)
# array([0, 0, 1, 2, 3, 4, 0, 1, 2, 0])

#1