numpy数组在两个值之间设置1，很快

having been looking for solution for this problem for a while but can't seem to find anything.

寻找这个问题的解决方案已经有一段时间了，但似乎什么也找不到。

For example, I have an numpy array of

例如，我有一个numpy数组

[ 0,  0,  2,  3,  2,  4,  3,  4,  0,  0, -2, -1, -4, -2, -1, -3, -4,  0,  2,  3, -2, -1,  0]

what I would like to achieve is the generate another array to indicate the elements between a pair of numbers, let's say between 2 and -2 here. So I want to get an array like this

我想要实现的是生成另一个数组来表示两个数之间的元素，比如说在这里的2和-2之间。我想要一个这样的数组

[ 0,  0,  1,  1,  1,  1,  1,  1,  1,  1,  1,  0,  0,  0,  0,  0,  0,  0,  1,  1,  1,  0,  0]

Notice any 2 or -2 between a pair of (2, -2) are ignored. Any easy approach is to iterate through each element with for loop and identifies first occurrence of 2 and set everything after that to 1 until you hit an -2 and start look for the next 2 again.

注意(2，-2)对之间的任何2或-2都被忽略。任何简单的方法都是使用for循环遍历每个元素，并标识第一个出现的2，然后将后面的所有值设置为1，直到您碰到-2并开始再次查找下一个2。

But I would like this process to be faster as I have over 1000 elements in an numpy array. and this process needs to be done a lot of times. Do you guys know any elegant way to solve this? Thanks in advance!

但是我希望这个过程更快，因为在numpy数组中有超过1000个元素。这个过程需要做很多次。你们知道怎么解决这个问题吗?提前谢谢!

4 个解决方案

#1

Quite a problem that is! Listed in this post is a vectorized solution (hopefully the inlined comments would help to explain the logic behind it). I am assuming A as the input array with T1, T2 as the start and stop triggers.

这真是个大问题!在这篇文章中列出的是一个矢量化的解决方案(希望内联的注释将有助于解释它背后的逻辑)。我假设A作为输入数组T1, T2作为启动和停止触发器。

def setones_between_triggers(A,T1,T2):    

    # Get start and stop indices corresponding to rising and falling triggers
    start = np.where(A==T1)[0]
    stop = np.where(A==T2)[0]

    # Take care of boundary conditions for np.searchsorted to work
    if (stop[-1] < start[-1]) & (start[-1] != A.size-1):
        stop = np.append(stop,A.size-1)

    # This is where the magic happens.
    # Validate (filter out) the triggers based on the set conditions :
    # 1. See if there are more than one stop indices between two start indices.
    # If so, use the first one and rejecting all others in that in-between space.
    # 2. Repeat the same check for start, but use the validated start indices.

    # First off, take care of out-of-bound cases for proper indexing
    stop_valid_idx = np.unique(np.searchsorted(stop,start,'right'))
    stop_valid_idx = stop_valid_idx[stop_valid_idx < stop.size]

    stop_valid = stop[stop_valid_idx]
    _,idx = np.unique(np.searchsorted(stop_valid,start,'left'),return_index=True)
    start_valid = start[idx]

    # Create shifts array (array filled with zeros, unless triggered by T1 and T2 
    # for which we have +1 and -1 as triggers). 
    shifts = np.zeros(A.size,dtype=int)
    shifts[start_valid] = 1
    shifts[stop_valid] = -1

    # Perform cumm. summation that would almost give us the desired output
    out = shifts.cumsum()

    # For a worst case when we have two groups of (T1,T2) adjacent to each other, 
    # set the negative trigger position as 1 as well
    out[stop_valid] = 1    
    return out

Sample runs

样本运行

Original sample case :

原始样品箱:

In [1589]: A
Out[1589]: 
array([ 0,  0,  2,  3,  2,  4,  3,  4,  0,  0, -2, -1, -4, -2, -1, -3, -4,
        0,  2,  3, -2, -1,  0])

In [1590]: setones_between_triggers(A,2,-2)
Out[1590]: array([0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0])

Worst case #1 (adjacent (2,-2) groups) :

最坏情况1(相邻(2，-2)组):

In [1595]: A
Out[1595]: 
array([-2,  2,  0,  2, -2,  2,  2,  2,  4, -2,  0, -2, -2, -4, -2, -1,  2,
       -4,  0,  2,  3, -2, -2,  0])

In [1596]: setones_between_triggers(A,2,-2)
Out[1596]: 
array([0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0,
       0], dtype=int32)

Worst case #2 (2 without any -2 till end) :

最坏情况#2(2无-2至尾):

In [1603]: A
Out[1603]: 
array([-2,  2,  0,  2, -2,  2,  2,  2,  4, -2,  0, -2, -2, -4, -2, -1, -2,
       -4,  0,  2,  3,  5,  6,  0])

In [1604]: setones_between_triggers(A,2,-2)
Out[1604]: 
array([0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       1], dtype=int32)

#2

Assuming you have got a huge dataset, I prefer to do a pair of initial searches for the two boundaries then use for-loop on these indices for validation.

假设您有一个庞大的数据集，我宁愿对这两个边界进行一对初始搜索，然后对这些索引使用for-loop进行验证。

def between_pairs(x, b1, b2):
    # output vector
    out = np.zeros_like(x)

    # reversed list of indices for possible rising and trailing edges
    rise_edges = list(np.argwhere(x==b1)[::-1,0])
    trail_edges = list(np.argwhere(x==b2)[::-1,0])

    # determine the rising trailing edge pairs
    rt_pairs = []
    t = None
    # look for the next rising edge after the previous trailing edge
    while rise_edges:
        r = rise_edges.pop()
        if t is not None and r < t:
            continue

        # look for the next trailing edge after previous rising edge
        while trail_edges:
            t = trail_edges.pop()
            if t > r:
                rt_pairs.append((r, t))
                break

    # use the rising, trailing pairs for updating d
    for rt in rt_pairs:
        out[rt[0]:rt[1]+1] = 1
    return out
# Example
a = np.array([0,  0,  2,  3,  2,  4,  3,  4,  0,  0, -2, -1, -4, -2, -1, -3, -4,
        0,  2,  3, -2, -1,  0])
d = between_pairs(a , 2, -2)
print repr(d)

## -- End pasted text --
array([0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0])

I did a speed comparison with the alternative answer given by @CactusWoman

我对@CactusWoman给出的替代答案进行了速度比较。

def between_vals(x, val1, val2):
    out = np.zeros(x.shape, dtype = int)
    in_range = False
    for i, v in enumerate(x):
        if v == val1 and not in_range:
            in_range = True
        if in_range:
            out[i] = 1
        if v == val2 and in_range:
            in_range = False
    return out

I found the following

我发现以下

In [59]: a = np.random.choice(np.arange(-5, 6), 2000)

In [60]: %timeit between_vals(a, 2, -2)
1000 loops, best of 3: 681 µs per loop

In [61]: %timeit between_pairs(a, 2, -2)
1000 loops, best of 3: 182 µs per loop

and for a much smaller dataset,

对于更小的数据集，

In [72]: a = np.random.choice(np.arange(-5, 6), 50)

In [73]: %timeit between_vals(a, 2, -2)
10000 loops, best of 3: 17 µs per loop

In [74]: %timeit between_pairs(a, 2, -2)
10000 loops, best of 3: 34.7 µs per loop

Therefore it all depends on your dataset size.

因此，这完全取决于您的数据集大小。

#3

Is iterating through the array really too slow?

遍历数组真的太慢了吗?

def between_vals(x, val1, val2):
    out = np.zeros(x.shape, dtype = int)
    in_range = False
    for i, v in enumerate(x):
        if v == val1 and not in_range:
            in_range = True
        if in_range:
            out[i] = 1
        if v == val2 and in_range:
            in_range = False
    return out

I'm the same boat as @Randy C: nothing else I've tried is faster than this.

我和@Randy C是同一条船:我试过的没有比这更快的了。

#4

I've tried a few things at this point, and the need to keep track of state for the start/finish markers has made the more clever things I've tried slower than the dumb iterative approach I used as a check:

在这一点上，我尝试了一些事情，并且需要跟踪开始/结束标记的状态，这使得我尝试过的更聪明的事情比我用来检查的愚蠢的迭代方法要慢:

for _ in xrange(1000):
    a = np.random.choice(np.arange(-5, 6), 2000)
    found2 = False
    l = []
    for el in a:
        if el == 2:
            found2 = True
        l.append(1 if found2 else 0)
        if el == -2:
            found2 = False
    l = np.array(l)

#1

def setones_between_triggers(A,T1,T2):    

    # Get start and stop indices corresponding to rising and falling triggers
    start = np.where(A==T1)[0]
    stop = np.where(A==T2)[0]

    # Take care of boundary conditions for np.searchsorted to work
    if (stop[-1] < start[-1]) & (start[-1] != A.size-1):
        stop = np.append(stop,A.size-1)

    # This is where the magic happens.
    # Validate (filter out) the triggers based on the set conditions :
    # 1. See if there are more than one stop indices between two start indices.
    # If so, use the first one and rejecting all others in that in-between space.
    # 2. Repeat the same check for start, but use the validated start indices.

    # First off, take care of out-of-bound cases for proper indexing
    stop_valid_idx = np.unique(np.searchsorted(stop,start,'right'))
    stop_valid_idx = stop_valid_idx[stop_valid_idx < stop.size]

    stop_valid = stop[stop_valid_idx]
    _,idx = np.unique(np.searchsorted(stop_valid,start,'left'),return_index=True)
    start_valid = start[idx]

    # Create shifts array (array filled with zeros, unless triggered by T1 and T2 
    # for which we have +1 and -1 as triggers). 
    shifts = np.zeros(A.size,dtype=int)
    shifts[start_valid] = 1
    shifts[stop_valid] = -1

    # Perform cumm. summation that would almost give us the desired output
    out = shifts.cumsum()

    # For a worst case when we have two groups of (T1,T2) adjacent to each other, 
    # set the negative trigger position as 1 as well
    out[stop_valid] = 1    
    return out