在numpy数组中查找相同值的序列长度(运行长度编码)

时间:2022-03-04 21:25:59

In a pylab program (which could probably be a matlab program as well) I have a numpy array of numbers representing distances: d[t] is the distance at time t (and the timespan of my data is len(d) time units).

在一个pylab程序中(也可能是一个matlab程序)我有一个代表距离的数字numpy数组:d [t]是时间t的距离(我的数据的时间跨度是len(d)时间单位) 。

The events I'm interested in are when the distance is below a certain threshold, and I want to compute the duration of these events. It's easy to get an array of booleans with b = d<threshold, and the problem comes down to computing the sequence of the lengths of the True-only words in b. But I do not know how to do that efficiently (i.e. using numpy primitives), and I resorted to walk the array and to do manual change detection (i.e. initialize counter when value goes from False to True, increase counter as long as value is True, and output the counter to the sequence when value goes back to False). But this is tremendously slow.

我感兴趣的事件是当距离低于某个阈值时,我想计算这些事件的持续时间。很容易得到一个b = d <阈值的布尔数组,问题归结为计算b中纯真词的长度序列。但我不知道如何有效地做到这一点(即使用numpy原语),并且我使用数组并进行手动更改检测(即当值从false变为true时初始化计数器,只要值为true就增加计数器,当值返回false时,将计数器输出到序列。但这非常缓慢。< p>

How to efficienly detect that sort of sequences in numpy arrays ?

如何在numpy数组中有效地检测那种序列?

Below is some python code that illustrates my problem : the fourth dot takes a very long time to appear (if not, increase the size of the array)

下面是一些说明我的问题的python代码:第四个点需要很长时间才能出现(如果没有,增加数组的大小)

from pylab import *

threshold = 7

print '.'
d = 10*rand(10000000)

print '.'

b = d<threshold

print '.'

durations=[]
for i in xrange(len(b)):
    if b[i] and (i==0 or not b[i-1]):
        counter=1
    if  i>0 and b[i-1] and b[i]:
        counter+=1
    if (b[i-1] and not b[i]) or i==len(b)-1:
        durations.append(counter)

print '.'

5 个解决方案

#1


37  

While not numpy primitives, itertools functions are often very fast, so do give this one a try (and measure times for various solutions including this one, of course):

虽然不是numpy原语,但itertools函数通常非常快,所以请尝试这个(并且测量各种解决方案的时间,当然包括这个):

def runs_of_ones(bits):
  for bit, group in itertools.groupby(bits):
    if bit: yield sum(group)

If you do need the values in a list, just can use list(runs_of_ones(bits)), of course; but maybe a list comprehension might be marginally faster still:

如果确实需要列表中的值,那么当然可以使用list(runs_of_ones(bits));但也许列表理解可能会略微加快:

def runs_of_ones_list(bits):
  return [sum(g) for b, g in itertools.groupby(bits) if b]

Moving to "numpy-native" possibilities, what about:

转向“numpy-native”的可能性,那么:

def runs_of_ones_array(bits):
  # make sure all runs of ones are well-bounded
  bounded = numpy.hstack(([0], bits, [0]))
  # get 1 at run starts and -1 at run ends
  difs = numpy.diff(bounded)
  run_starts, = numpy.where(difs > 0)
  run_ends, = numpy.where(difs < 0)
  return run_ends - run_starts

Again: be sure to benchmark solutions against each others in realistic-for-you examples!

再次:确保在实际的示例中针对彼此对比解决方案!

#2


23  

Fully numpy vectorized and generic RLE for any array (works with strings, booleans etc too).

适用于任何阵列的完全numpy矢量化和通用RLE(也适用于字符串,布尔值等)。

Outputs tuple of run lengths, start positions, and values.

输出运行长度,起始位置和值的元组。

import numpy as np

def rle(inarray):
        """ run length encoding. Partial credit to R rle function. 
            Multi datatype arrays catered for including non Numpy
            returns: tuple (runlengths, startpositions, values) """
        ia = np.asarray(inarray)                  # force numpy
        n = len(ia)
        if n == 0: 
            return (None, None, None)
        else:
            y = np.array(ia[1:] != ia[:-1])     # pairwise unequal (string safe)
            i = np.append(np.where(y), n - 1)   # must include last element posi
            z = np.diff(np.append(-1, i))       # run lengths
            p = np.cumsum(np.append(0, z))[:-1] # positions
            return(z, p, ia[i])

Pretty fast (i7):

很快(i7):

xx = np.random.randint(0, 5, 1000000)
%timeit yy = rle(xx)
100 loops, best of 3: 18.6 ms per loop

Multiple data types:

多种数据类型:

rle([True, True, True, False, True, False, False])
Out[8]: 
(array([3, 1, 1, 2]),
 array([0, 3, 4, 5]),
 array([ True, False,  True, False], dtype=bool))

rle(np.array([5, 4, 4, 4, 4, 0, 0]))
Out[9]: (array([1, 4, 2]), array([0, 1, 5]), array([5, 4, 0]))

rle(["hello", "hello", "my", "friend", "okay", "okay", "bye"])
Out[10]: 
(array([2, 1, 1, 2, 1]),
 array([0, 2, 3, 4, 6]),
 array(['hello', 'my', 'friend', 'okay', 'bye'], 
       dtype='|S6'))

Same results as Alex Martelli above:

与Alex Martelli相同的结果如上:

xx = np.random.randint(0, 2, 20)

xx
Out[60]: array([1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1])

am = runs_of_ones_array(xx)

tb = rle(xx)

am
Out[63]: array([4, 5, 2, 5])

tb[0][tb[2] == 1]
Out[64]: array([4, 5, 2, 5])

%timeit runs_of_ones_array(xx)
10000 loops, best of 3: 28.5 µs per loop

%timeit rle(xx)
10000 loops, best of 3: 38.2 µs per loop

Slightly slower than Alex (but still very fast), and much more flexible.

比Alex慢一点(但速度非常快),而且更加灵活。

#3


6  

Here is a solution using only arrays: it takes an array containing a sequence of bools and counts the length of the transitions.

这是一个仅使用数组的解决方案:它采用包含bool序列的数组并计算转换的长度。

>>> from numpy import array, arange
>>> b = array([0,0,0,1,1,1,0,0,0,1,1,1,1,0,0], dtype=bool)
>>> sw = (b[:-1] ^ b[1:]); print sw
[False False  True False False  True False False  True False False False
  True False]
>>> isw = arange(len(sw))[sw]; print isw
[ 2  5  8 12]
>>> lens = isw[1::2] - isw[::2]; print lens
[3 4]

sw contains a true where there is a switch, isw converts them in indexes. The items of isw are then subtracted pairwise in lens.

sw包含一个true,其中有一个开关,isw在索引中转换它们。然后在镜头中成对地减去isw的项目。

Notice that if the sequence started with an 1 it would count the length of the 0s sequences: this can be fixed in the indexing to compute lens. Also, I have not tested corner cases such sequences of length 1.

请注意,如果序列以1开始,它将计算0s序列的长度:这可以在索引中修复以计算镜头。另外,我没有测试过角情况,例如长度为1的序列。


Full function that returns start positions and lengths of all True-subarrays.

返回所有True子阵列的起始位置和长度的完整函数。

import numpy as np

def count_adjacent_true(arr):
    assert len(arr.shape) == 1
    assert arr.dtype == np.bool
    if arr.size == 0:
        return np.empty(0, dtype=int), np.empty(0, dtype=int)
    sw = np.insert(arr[1:] ^ arr[:-1], [0, arr.shape[0]-1], values=True)
    swi = np.arange(sw.shape[0])[sw]
    offset = 0 if arr[0] else 1
    lengths = swi[offset+1::2] - swi[offset:-1:2]
    return swi[offset:-1:2], lengths

Tested for different bool 1D-arrays (empty array; single/multiple elements; even/odd lengths; started with True/False; with only True/False elements).

测试了不同的bool 1D阵列(空数组;单个/多个元素;偶数/奇数长度;以True / False开头;仅使用True / False元素)。

#4


5  

Just in case anyone is curious (and since you mentioned MATLAB in passing), here's one way to solve it in MATLAB:

万一有人好奇(因为你传递了MATLAB),这里有一种方法可以在MATLAB中解决它:

threshold = 7;
d = 10*rand(1,100000);  % Sample data
b = diff([false (d < threshold) false]);
durations = find(b == -1)-find(b == 1);

I'm not too familiar with Python, but maybe this could help give you some ideas. =)

我对Python不太熟悉,但也许这可以帮助你提供一些想法。 =)

#5


0  

durations = []
counter   = 0

for bool in b:
    if bool:
        counter += 1
    elif counter > 0:
        durations.append(counter)
        counter = 0

if counter > 0:
    durations.append(counter)

#1


37  

While not numpy primitives, itertools functions are often very fast, so do give this one a try (and measure times for various solutions including this one, of course):

虽然不是numpy原语,但itertools函数通常非常快,所以请尝试这个(并且测量各种解决方案的时间,当然包括这个):

def runs_of_ones(bits):
  for bit, group in itertools.groupby(bits):
    if bit: yield sum(group)

If you do need the values in a list, just can use list(runs_of_ones(bits)), of course; but maybe a list comprehension might be marginally faster still:

如果确实需要列表中的值,那么当然可以使用list(runs_of_ones(bits));但也许列表理解可能会略微加快:

def runs_of_ones_list(bits):
  return [sum(g) for b, g in itertools.groupby(bits) if b]

Moving to "numpy-native" possibilities, what about:

转向“numpy-native”的可能性,那么:

def runs_of_ones_array(bits):
  # make sure all runs of ones are well-bounded
  bounded = numpy.hstack(([0], bits, [0]))
  # get 1 at run starts and -1 at run ends
  difs = numpy.diff(bounded)
  run_starts, = numpy.where(difs > 0)
  run_ends, = numpy.where(difs < 0)
  return run_ends - run_starts

Again: be sure to benchmark solutions against each others in realistic-for-you examples!

再次:确保在实际的示例中针对彼此对比解决方案!

#2


23  

Fully numpy vectorized and generic RLE for any array (works with strings, booleans etc too).

适用于任何阵列的完全numpy矢量化和通用RLE(也适用于字符串,布尔值等)。

Outputs tuple of run lengths, start positions, and values.

输出运行长度,起始位置和值的元组。

import numpy as np

def rle(inarray):
        """ run length encoding. Partial credit to R rle function. 
            Multi datatype arrays catered for including non Numpy
            returns: tuple (runlengths, startpositions, values) """
        ia = np.asarray(inarray)                  # force numpy
        n = len(ia)
        if n == 0: 
            return (None, None, None)
        else:
            y = np.array(ia[1:] != ia[:-1])     # pairwise unequal (string safe)
            i = np.append(np.where(y), n - 1)   # must include last element posi
            z = np.diff(np.append(-1, i))       # run lengths
            p = np.cumsum(np.append(0, z))[:-1] # positions
            return(z, p, ia[i])

Pretty fast (i7):

很快(i7):

xx = np.random.randint(0, 5, 1000000)
%timeit yy = rle(xx)
100 loops, best of 3: 18.6 ms per loop

Multiple data types:

多种数据类型:

rle([True, True, True, False, True, False, False])
Out[8]: 
(array([3, 1, 1, 2]),
 array([0, 3, 4, 5]),
 array([ True, False,  True, False], dtype=bool))

rle(np.array([5, 4, 4, 4, 4, 0, 0]))
Out[9]: (array([1, 4, 2]), array([0, 1, 5]), array([5, 4, 0]))

rle(["hello", "hello", "my", "friend", "okay", "okay", "bye"])
Out[10]: 
(array([2, 1, 1, 2, 1]),
 array([0, 2, 3, 4, 6]),
 array(['hello', 'my', 'friend', 'okay', 'bye'], 
       dtype='|S6'))

Same results as Alex Martelli above:

与Alex Martelli相同的结果如上:

xx = np.random.randint(0, 2, 20)

xx
Out[60]: array([1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1])

am = runs_of_ones_array(xx)

tb = rle(xx)

am
Out[63]: array([4, 5, 2, 5])

tb[0][tb[2] == 1]
Out[64]: array([4, 5, 2, 5])

%timeit runs_of_ones_array(xx)
10000 loops, best of 3: 28.5 µs per loop

%timeit rle(xx)
10000 loops, best of 3: 38.2 µs per loop

Slightly slower than Alex (but still very fast), and much more flexible.

比Alex慢一点(但速度非常快),而且更加灵活。

#3


6  

Here is a solution using only arrays: it takes an array containing a sequence of bools and counts the length of the transitions.

这是一个仅使用数组的解决方案:它采用包含bool序列的数组并计算转换的长度。

>>> from numpy import array, arange
>>> b = array([0,0,0,1,1,1,0,0,0,1,1,1,1,0,0], dtype=bool)
>>> sw = (b[:-1] ^ b[1:]); print sw
[False False  True False False  True False False  True False False False
  True False]
>>> isw = arange(len(sw))[sw]; print isw
[ 2  5  8 12]
>>> lens = isw[1::2] - isw[::2]; print lens
[3 4]

sw contains a true where there is a switch, isw converts them in indexes. The items of isw are then subtracted pairwise in lens.

sw包含一个true,其中有一个开关,isw在索引中转换它们。然后在镜头中成对地减去isw的项目。

Notice that if the sequence started with an 1 it would count the length of the 0s sequences: this can be fixed in the indexing to compute lens. Also, I have not tested corner cases such sequences of length 1.

请注意,如果序列以1开始,它将计算0s序列的长度:这可以在索引中修复以计算镜头。另外,我没有测试过角情况,例如长度为1的序列。


Full function that returns start positions and lengths of all True-subarrays.

返回所有True子阵列的起始位置和长度的完整函数。

import numpy as np

def count_adjacent_true(arr):
    assert len(arr.shape) == 1
    assert arr.dtype == np.bool
    if arr.size == 0:
        return np.empty(0, dtype=int), np.empty(0, dtype=int)
    sw = np.insert(arr[1:] ^ arr[:-1], [0, arr.shape[0]-1], values=True)
    swi = np.arange(sw.shape[0])[sw]
    offset = 0 if arr[0] else 1
    lengths = swi[offset+1::2] - swi[offset:-1:2]
    return swi[offset:-1:2], lengths

Tested for different bool 1D-arrays (empty array; single/multiple elements; even/odd lengths; started with True/False; with only True/False elements).

测试了不同的bool 1D阵列(空数组;单个/多个元素;偶数/奇数长度;以True / False开头;仅使用True / False元素)。

#4


5  

Just in case anyone is curious (and since you mentioned MATLAB in passing), here's one way to solve it in MATLAB:

万一有人好奇(因为你传递了MATLAB),这里有一种方法可以在MATLAB中解决它:

threshold = 7;
d = 10*rand(1,100000);  % Sample data
b = diff([false (d < threshold) false]);
durations = find(b == -1)-find(b == 1);

I'm not too familiar with Python, but maybe this could help give you some ideas. =)

我对Python不太熟悉,但也许这可以帮助你提供一些想法。 =)

#5


0  

durations = []
counter   = 0

for bool in b:
    if bool:
        counter += 1
    elif counter > 0:
        durations.append(counter)
        counter = 0

if counter > 0:
    durations.append(counter)