如何获得数组A中与数组B中唯一值对应的所有最大值的索引?

时间:2022-10-30 12:48:23

Suppose one has an array of observation times ts, each of which corresponds to some observed value in vs. The observation times are taken to be the number of elapsed hours (starting from zero) and can contain duplicates. I would like to find the indices that correspond to the maximum observed value per unique observation time. I am asking for the indices as opposed to the values, unlike a similar question I asked several months ago. This way, I can apply the same indices on various arrays. Below is a sample dataset, which I would like to use to adapt a code for a much larger dataset.

假设一个观察时间ts的数组,每个观察时间对应于一些观察值。观察时间被视为经过的小时数(从零开始)并且可以包含重复。我想找到与每个独特观察时间的最大观测值相对应的指数。与几个月前我问过的类似问题不同,我要求的是指数而不是数值。这样,我可以在各种数组上应用相同的索引。下面是一个示例数据集,我想用它来为更大的数据集调整代码。

import numpy as np
ts = np.array([0, 0, 1, 2, 3, 3, 3, 4, 4, 5, 6, 7, 8, 8, 9, 10])
vs = np.array([500, 600, 550, 700, 500, 500, 450, 800, 900, 700, 600, 850, 850, 900, 900, 900])

My current approach is to split the array of values at any points at which there is not a duplicate time.

我目前的方法是在没有重复时间的任何点分割值数组。

condition = np.where(np.diff(ts) != 0)[0]+1
ts_spl = np.split(ts, condition)
vs_spl = np.split(vs, condition)

print(ts_spl)
>> [array([0, 0]), array([1]), array([2]), array([3, 3, 3]), array([4, 4]), array([5]), array([6]), array([7]), array([8, 8]), array([9]), array([10])]

print(vs_spl)
>> [array([500, 600]), array([550]), array([700]), array([500, 500, 450]), array([800, 900]), array([700]), array([600]), array([850]), array([850, 900]), array([900]), array([900])]

In this case, duplicate max values at any duplicate times should be counted. Given this example, the returned indices would be:

在这种情况下,应计算任何重复时间的重复最大值。在这个例子中,返回的索引将是:

[1, 2, 3, 4, 5, 8, 9, 10, 11, 13, 14, 15]
# indices = 4,5,6 correspond to values = 500, 500, 450 ==> count indices 4,5
# I might modify this part of the algorithm to return either 4 or 5 instead of 4,5 at some future time

Though I have not yet been able to adapt this algorithm for my purpose, I think it must be possible to exploit the size of each previously-split array in vs_spl to keep an index counter. Is this approach feasible for a large dataset (10,000 elements per array before padding; 70,000 elements per array after padding)? If so, how can I adapt it? If not, what are some other approaches that may be useful here?

虽然我还没有能够为我的目的调整这个算法,但我认为必须有可能利用vs_spl中每个先前拆分数组的大小来保留索引计数器。这种方法对于大型数据集是否可行(填充前每个阵列10,000个元素;填充后每个阵列70,000个元素)?如果是这样,我该如何适应它?如果没有,那么在这里可能有用的其他方法是什么?

1 个解决方案

#1


1  

70,000 isn't that insanely large, so yes it should be feasible. It is, however, faster to avoid the splitting and use the .reduceat method of relevant ufuncs. reduceat is like reduce applied to chunks, but you don't have to provide the chunks, just tell reduceat where you would have cut to get them. For example, like so

70,000并不是那么大,所以是的,它应该是可行的。但是,避免分裂并使用相关ufunc的.reduceat方法会更快。 reduceat就像应用于块的reduce一样,但是你不必提供块,只需告诉reduceat你可以在哪里剪切来获取它们。例如,像这样

import numpy as np


N = 10**6
ts = np.cumsum(np.random.rand(N) < 0.1)
vs = 50*np.random.randint(10, 20, (N,))

#ts = np.array([0, 0, 1, 2, 3, 3, 3, 4, 4, 5, 6, 7, 8, 8, 9, 10])
#vs = np.array([500, 600, 550, 700, 500, 500, 450, 800, 900, 700, 600, 850, 850, 900, 900, 900])


# flatnonzero is a bit faster than where
condition = np.r_[0, np.flatnonzero(np.diff(ts)) + 1, len(ts)]
sizes = np.diff(condition)
maxima = np.repeat(np.maximum.reduceat(vs, condition[:-1]), sizes)
maxat = maxima == vs
indices = np.flatnonzero(maxat)
# if you want to know how many maxima at each hour
nmax = np.add.reduceat(maxat, condition[:-1])

#1


1  

70,000 isn't that insanely large, so yes it should be feasible. It is, however, faster to avoid the splitting and use the .reduceat method of relevant ufuncs. reduceat is like reduce applied to chunks, but you don't have to provide the chunks, just tell reduceat where you would have cut to get them. For example, like so

70,000并不是那么大,所以是的,它应该是可行的。但是,避免分裂并使用相关ufunc的.reduceat方法会更快。 reduceat就像应用于块的reduce一样,但是你不必提供块,只需告诉reduceat你可以在哪里剪切来获取它们。例如,像这样

import numpy as np


N = 10**6
ts = np.cumsum(np.random.rand(N) < 0.1)
vs = 50*np.random.randint(10, 20, (N,))

#ts = np.array([0, 0, 1, 2, 3, 3, 3, 4, 4, 5, 6, 7, 8, 8, 9, 10])
#vs = np.array([500, 600, 550, 700, 500, 500, 450, 800, 900, 700, 600, 850, 850, 900, 900, 900])


# flatnonzero is a bit faster than where
condition = np.r_[0, np.flatnonzero(np.diff(ts)) + 1, len(ts)]
sizes = np.diff(condition)
maxima = np.repeat(np.maximum.reduceat(vs, condition[:-1]), sizes)
maxat = maxima == vs
indices = np.flatnonzero(maxat)
# if you want to know how many maxima at each hour
nmax = np.add.reduceat(maxat, condition[:-1])