
时间:2022-10-30 12:48:23

Suppose one has an array of observation times ts, each of which corresponds to some observed value in vs. The observation times are taken to be the number of elapsed hours (starting from zero) and can contain duplicates. I would like to find the indices that correspond to the maximum observed value per unique observation time. I am asking for the indices as opposed to the values, unlike a similar question I asked several months ago. This way, I can apply the same indices on various arrays. Below is a sample dataset, which I would like to use to adapt a code for a much larger dataset.


import numpy as np
ts = np.array([0, 0, 1, 2, 3, 3, 3, 4, 4, 5, 6, 7, 8, 8, 9, 10])
vs = np.array([500, 600, 550, 700, 500, 500, 450, 800, 900, 700, 600, 850, 850, 900, 900, 900])

My current approach is to split the array of values at any points at which there is not a duplicate time.


condition = np.where(np.diff(ts) != 0)[0]+1
ts_spl = np.split(ts, condition)
vs_spl = np.split(vs, condition)

>> [array([0, 0]), array([1]), array([2]), array([3, 3, 3]), array([4, 4]), array([5]), array([6]), array([7]), array([8, 8]), array([9]), array([10])]

>> [array([500, 600]), array([550]), array([700]), array([500, 500, 450]), array([800, 900]), array([700]), array([600]), array([850]), array([850, 900]), array([900]), array([900])]

In this case, duplicate max values at any duplicate times should be counted. Given this example, the returned indices would be:


[1, 2, 3, 4, 5, 8, 9, 10, 11, 13, 14, 15]
# indices = 4,5,6 correspond to values = 500, 500, 450 ==> count indices 4,5
# I might modify this part of the algorithm to return either 4 or 5 instead of 4,5 at some future time

Though I have not yet been able to adapt this algorithm for my purpose, I think it must be possible to exploit the size of each previously-split array in vs_spl to keep an index counter. Is this approach feasible for a large dataset (10,000 elements per array before padding; 70,000 elements per array after padding)? If so, how can I adapt it? If not, what are some other approaches that may be useful here?


1 个解决方案



70,000 isn't that insanely large, so yes it should be feasible. It is, however, faster to avoid the splitting and use the .reduceat method of relevant ufuncs. reduceat is like reduce applied to chunks, but you don't have to provide the chunks, just tell reduceat where you would have cut to get them. For example, like so

70,000并不是那么大,所以是的,它应该是可行的。但是,避免分裂并使用相关ufunc的.reduceat方法会更快。 reduceat就像应用于块的reduce一样,但是你不必提供块,只需告诉reduceat你可以在哪里剪切来获取它们。例如,像这样

import numpy as np

N = 10**6
ts = np.cumsum(np.random.rand(N) < 0.1)
vs = 50*np.random.randint(10, 20, (N,))

#ts = np.array([0, 0, 1, 2, 3, 3, 3, 4, 4, 5, 6, 7, 8, 8, 9, 10])
#vs = np.array([500, 600, 550, 700, 500, 500, 450, 800, 900, 700, 600, 850, 850, 900, 900, 900])

# flatnonzero is a bit faster than where
condition = np.r_[0, np.flatnonzero(np.diff(ts)) + 1, len(ts)]
sizes = np.diff(condition)
maxima = np.repeat(np.maximum.reduceat(vs, condition[:-1]), sizes)
maxat = maxima == vs
indices = np.flatnonzero(maxat)
# if you want to know how many maxima at each hour
nmax = np.add.reduceat(maxat, condition[:-1])



70,000 isn't that insanely large, so yes it should be feasible. It is, however, faster to avoid the splitting and use the .reduceat method of relevant ufuncs. reduceat is like reduce applied to chunks, but you don't have to provide the chunks, just tell reduceat where you would have cut to get them. For example, like so

70,000并不是那么大,所以是的,它应该是可行的。但是,避免分裂并使用相关ufunc的.reduceat方法会更快。 reduceat就像应用于块的reduce一样,但是你不必提供块,只需告诉reduceat你可以在哪里剪切来获取它们。例如,像这样

import numpy as np

N = 10**6
ts = np.cumsum(np.random.rand(N) < 0.1)
vs = 50*np.random.randint(10, 20, (N,))

#ts = np.array([0, 0, 1, 2, 3, 3, 3, 4, 4, 5, 6, 7, 8, 8, 9, 10])
#vs = np.array([500, 600, 550, 700, 500, 500, 450, 800, 900, 700, 600, 850, 850, 900, 900, 900])

# flatnonzero is a bit faster than where
condition = np.r_[0, np.flatnonzero(np.diff(ts)) + 1, len(ts)]
sizes = np.diff(condition)
maxima = np.repeat(np.maximum.reduceat(vs, condition[:-1]), sizes)
maxat = maxima == vs
indices = np.flatnonzero(maxat)
# if you want to know how many maxima at each hour
nmax = np.add.reduceat(maxat, condition[:-1])