使数组中的低值为零的最快方法?

时间:2021-03-29 21:21:45

So, lets say I have 100,000 float arrays with 100 elements each. I need the highest X number of values, BUT only if they are greater than Y. Any element not matching this should be set to 0. What would be the fastest way to do this in Python? Order must be maintained. Most of the elements are already set to 0.

假设有100,000个浮点数组,每个数组包含100个元素。我需要X个数最大的值,但只有当它们大于y时,任何不匹配它的元素都应该设置为0。在Python中,最快的方法是什么?必须维护秩序。大多数元素已经被设置为0。

sample variables:

样本变量:

array = [.06, .25, 0, .15, .5, 0, 0, 0.04, 0, 0]
highCountX = 3
lowValY = .1

expected result:

预期结果:

array = [0, .25, 0, .15, .5, 0, 0, 0, 0, 0]

9 个解决方案

#1


73  

This is a typical job for NumPy, which is very fast for these kinds of operations:

这是NumPy的典型工作,这种操作非常快:

array_np = numpy.asarray(array)
low_values_flags = array_np < lowValY  # Where values are low
array_np[low_values_flags] = 0  # All low values set to 0

Now, if you only need the highCountX largest elements, you can even "forget" the small elements (instead of setting them to 0 and sorting them) and only sort the list of large elements:

现在,如果您只需要highCountX的最大元素,您甚至可以“忘记”小元素(而不是将它们设置为0并对它们进行排序),并且只对大元素列表进行排序:

array_np = numpy.asarray(array)
print numpy.sort(array_np[array_np >= lowValY])[-highCountX:]

Of course, sorting the whole array if you only need a few elements might not be optimal. Depending on your needs, you might want to consider the standard heapq module.

当然,如果只需要几个元素,对整个数组进行排序可能不是最优的。根据您的需要,您可能需要考虑标准的heapq模块。

#2


19  

from scipy.stats import threshold
thresholded = threshold(array, 0.5)

:)

:)

#3


7  

There's a special MaskedArray class in NumPy that does exactly that. You can "mask" elements based on any precondition. This better represent your need than assigning zeroes: numpy operations will ignore masked values when appropriate (for example, finding mean value).

NumPy中有一个特殊的MaskedArray类,它就是这样做的。您可以基于任何先决条件“屏蔽”元素。这比分配0更好地表示了您的需求:numpy操作将在适当的时候忽略掩蔽值(例如,查找平均值)。

>>> from numpy import ma
>>> x = ma.array([.06, .25, 0, .15, .5, 0, 0, 0.04, 0, 0])
>>> x1 = ma.masked_inside(0, 0.1) # mask everything in 0..0.1 range
>>> x1
masked_array(data = [-- 0.25 -- 0.15 0.5 -- -- -- -- --],
         mask = [ True False True False False True True True True True],
   fill_value = 1e+20)
>>> print x.filled(0) # Fill with zeroes
[ 0 0.25 0 0.15 0.5 0 0 0 0 0 ]

As an addded benefit, masked arrays are well supported in matplotlib visualisation library if you need this.

作为附加的好处,蒙面数组在matplotlib可视化库中得到了很好的支持,如果您需要的话。

Docs on masked arrays in numpy

在numpy中屏蔽数组上的文档

#4


6  

Using numpy:

使用numpy:

# assign zero to all elements less than or equal to `lowValY`
a[a<=lowValY] = 0 
# find n-th largest element in the array (where n=highCountX)
x = partial_sort(a, highCountX, reverse=True)[:highCountX][-1]
# 
a[a<x] = 0 #NOTE: it might leave more than highCountX non-zero elements
           # . if there are duplicates

Where partial_sort could be:

partial_sort可以是:

def partial_sort(a, n, reverse=False):
    #NOTE: in general it should return full list but in your case this will do
    return sorted(a, reverse=reverse)[:n] 

The expression a[a<value] = 0 can be written without numpy as follows:

表达式a[a ]>

for i, x in enumerate(a):
    if x < value:
       a[i] = 0

#5


5  

The simplest way would be:

最简单的方法是:

topX = sorted([x for x in array if x > lowValY], reverse=True)[highCountX-1]
print [x if x >= topX else 0 for x in array]

In pieces, this selects all the elements greater than lowValY:

分块选择所有大于lowValY的元素:

[x for x in array if x > lowValY]

This array only contains the number of elements greater than the threshold. Then, sorting it so the largest values are at the start:

这个数组只包含大于阈值的元素个数。然后,对它进行排序,使最大的值在开始时:

sorted(..., reverse=True)

Then a list index takes the threshold for the top highCountX elements:

然后,一个列表索引取最高的highCountX元素的阈值:

sorted(...)[highCountX-1]

Finally, the original array is filled out using another list comprehension:

最后,使用另一个列表理解填充原始数组:

[x if x >= topX else 0 for x in array]

There is a boundary condition where there are two or more equal elements that (in your example) are 3rd highest elements. The resulting array will contain that element more than once.

有一个边界条件,有两个或两个以上相等的元素(在您的示例中)是第三高的元素。结果数组将不止一次地包含该元素。

There are other boundary conditions as well, such as if len(array) < highCountX. Handling such conditions is left to the implementor.

还有其他的边界条件,如len(array) < highCountX。处理这样的条件留给执行者。

#6


2  

Settings elements below some threshold to zero is easy:

设置某些阈值以下的元素为0很容易:

array = [ x if x > threshold else 0.0 for x in array ]

(plus the occasional abs() if needed.)

(如果需要的话,加上偶尔的abs()。)

The requirement of the N highest numbers is a bit vague, however. What if there are e.g. N+1 equal numbers above the threshold? Which one to truncate?

然而,N个最高数字的要求有点模糊。如果有N+1等于阈值以上的数呢?哪一个截断?

You could sort the array first, then set the threshold to the value of the Nth element:

您可以先对数组进行排序,然后将阈值设置为第n个元素的值:

threshold = sorted(array, reverse=True)[N]
array = [ x if x >= threshold else 0.0 for x in array ]

Note: this solution is optimized for readability not performance.

注意:这个解决方案是针对可读性而不是性能进行优化的。

#7


1  

You can use map and lambda, it should be fast enough.

你可以使用map和lambda,它应该足够快。

new_array = map(lambda x: x if x>y else 0, array)

#8


0  

Use a heap.

使用一个堆。

This works in time O(n*lg(HighCountX)).

它在O(n*lg(HighCountX)时间内工作。

import heapq

heap = []
array =  [.06, .25, 0, .15, .5, 0, 0, 0.04, 0, 0]
highCountX = 3
lowValY = .1

for i in range(1,highCountX):
    heappush(heap, lowValY)
    heappop(heap)

for i in range( 0, len(array) - 1)
    if array[i] > heap[0]:
        heappush(heap, array[i])

min = heap[0]

array = [x if x >= min else 0 for x in array]

deletemin works in heap O(lg(k)) and insertion O(lg(k)) or O(1) depending on which heap type you use.

deletemin可以在堆O(lg(k))和插入O(lg(k)或O(1)中工作,这取决于您使用的堆类型。

#9


0  

Using a heap is a good idea, as egon says. But you can use the heapq.nlargest function to cut down on some effort:

正如埃贡所说,使用堆是一个好主意。但是你可以用heapq。减少一些工作的最大功能:

import heapq 

array =  [.06, .25, 0, .15, .5, 0, 0, 0.04, 0, 0]
highCountX = 3
lowValY = .1

threshold = max(heapq.nlargest(highCountX, array)[-1], lowValY)
array = [x if x >= threshold else 0 for x in array]

#1


73  

This is a typical job for NumPy, which is very fast for these kinds of operations:

这是NumPy的典型工作,这种操作非常快:

array_np = numpy.asarray(array)
low_values_flags = array_np < lowValY  # Where values are low
array_np[low_values_flags] = 0  # All low values set to 0

Now, if you only need the highCountX largest elements, you can even "forget" the small elements (instead of setting them to 0 and sorting them) and only sort the list of large elements:

现在,如果您只需要highCountX的最大元素,您甚至可以“忘记”小元素(而不是将它们设置为0并对它们进行排序),并且只对大元素列表进行排序:

array_np = numpy.asarray(array)
print numpy.sort(array_np[array_np >= lowValY])[-highCountX:]

Of course, sorting the whole array if you only need a few elements might not be optimal. Depending on your needs, you might want to consider the standard heapq module.

当然,如果只需要几个元素,对整个数组进行排序可能不是最优的。根据您的需要,您可能需要考虑标准的heapq模块。

#2


19  

from scipy.stats import threshold
thresholded = threshold(array, 0.5)

:)

:)

#3


7  

There's a special MaskedArray class in NumPy that does exactly that. You can "mask" elements based on any precondition. This better represent your need than assigning zeroes: numpy operations will ignore masked values when appropriate (for example, finding mean value).

NumPy中有一个特殊的MaskedArray类,它就是这样做的。您可以基于任何先决条件“屏蔽”元素。这比分配0更好地表示了您的需求:numpy操作将在适当的时候忽略掩蔽值(例如,查找平均值)。

>>> from numpy import ma
>>> x = ma.array([.06, .25, 0, .15, .5, 0, 0, 0.04, 0, 0])
>>> x1 = ma.masked_inside(0, 0.1) # mask everything in 0..0.1 range
>>> x1
masked_array(data = [-- 0.25 -- 0.15 0.5 -- -- -- -- --],
         mask = [ True False True False False True True True True True],
   fill_value = 1e+20)
>>> print x.filled(0) # Fill with zeroes
[ 0 0.25 0 0.15 0.5 0 0 0 0 0 ]

As an addded benefit, masked arrays are well supported in matplotlib visualisation library if you need this.

作为附加的好处,蒙面数组在matplotlib可视化库中得到了很好的支持,如果您需要的话。

Docs on masked arrays in numpy

在numpy中屏蔽数组上的文档

#4


6  

Using numpy:

使用numpy:

# assign zero to all elements less than or equal to `lowValY`
a[a<=lowValY] = 0 
# find n-th largest element in the array (where n=highCountX)
x = partial_sort(a, highCountX, reverse=True)[:highCountX][-1]
# 
a[a<x] = 0 #NOTE: it might leave more than highCountX non-zero elements
           # . if there are duplicates

Where partial_sort could be:

partial_sort可以是:

def partial_sort(a, n, reverse=False):
    #NOTE: in general it should return full list but in your case this will do
    return sorted(a, reverse=reverse)[:n] 

The expression a[a<value] = 0 can be written without numpy as follows:

表达式a[a ]>

for i, x in enumerate(a):
    if x < value:
       a[i] = 0

#5


5  

The simplest way would be:

最简单的方法是:

topX = sorted([x for x in array if x > lowValY], reverse=True)[highCountX-1]
print [x if x >= topX else 0 for x in array]

In pieces, this selects all the elements greater than lowValY:

分块选择所有大于lowValY的元素:

[x for x in array if x > lowValY]

This array only contains the number of elements greater than the threshold. Then, sorting it so the largest values are at the start:

这个数组只包含大于阈值的元素个数。然后,对它进行排序,使最大的值在开始时:

sorted(..., reverse=True)

Then a list index takes the threshold for the top highCountX elements:

然后,一个列表索引取最高的highCountX元素的阈值:

sorted(...)[highCountX-1]

Finally, the original array is filled out using another list comprehension:

最后,使用另一个列表理解填充原始数组:

[x if x >= topX else 0 for x in array]

There is a boundary condition where there are two or more equal elements that (in your example) are 3rd highest elements. The resulting array will contain that element more than once.

有一个边界条件,有两个或两个以上相等的元素(在您的示例中)是第三高的元素。结果数组将不止一次地包含该元素。

There are other boundary conditions as well, such as if len(array) < highCountX. Handling such conditions is left to the implementor.

还有其他的边界条件,如len(array) < highCountX。处理这样的条件留给执行者。

#6


2  

Settings elements below some threshold to zero is easy:

设置某些阈值以下的元素为0很容易:

array = [ x if x > threshold else 0.0 for x in array ]

(plus the occasional abs() if needed.)

(如果需要的话,加上偶尔的abs()。)

The requirement of the N highest numbers is a bit vague, however. What if there are e.g. N+1 equal numbers above the threshold? Which one to truncate?

然而,N个最高数字的要求有点模糊。如果有N+1等于阈值以上的数呢?哪一个截断?

You could sort the array first, then set the threshold to the value of the Nth element:

您可以先对数组进行排序,然后将阈值设置为第n个元素的值:

threshold = sorted(array, reverse=True)[N]
array = [ x if x >= threshold else 0.0 for x in array ]

Note: this solution is optimized for readability not performance.

注意:这个解决方案是针对可读性而不是性能进行优化的。

#7


1  

You can use map and lambda, it should be fast enough.

你可以使用map和lambda,它应该足够快。

new_array = map(lambda x: x if x>y else 0, array)

#8


0  

Use a heap.

使用一个堆。

This works in time O(n*lg(HighCountX)).

它在O(n*lg(HighCountX)时间内工作。

import heapq

heap = []
array =  [.06, .25, 0, .15, .5, 0, 0, 0.04, 0, 0]
highCountX = 3
lowValY = .1

for i in range(1,highCountX):
    heappush(heap, lowValY)
    heappop(heap)

for i in range( 0, len(array) - 1)
    if array[i] > heap[0]:
        heappush(heap, array[i])

min = heap[0]

array = [x if x >= min else 0 for x in array]

deletemin works in heap O(lg(k)) and insertion O(lg(k)) or O(1) depending on which heap type you use.

deletemin可以在堆O(lg(k))和插入O(lg(k)或O(1)中工作,这取决于您使用的堆类型。

#9


0  

Using a heap is a good idea, as egon says. But you can use the heapq.nlargest function to cut down on some effort:

正如埃贡所说,使用堆是一个好主意。但是你可以用heapq。减少一些工作的最大功能:

import heapq 

array =  [.06, .25, 0, .15, .5, 0, 0, 0.04, 0, 0]
highCountX = 3
lowValY = .1

threshold = max(heapq.nlargest(highCountX, array)[-1], lowValY)
array = [x if x >= threshold else 0 for x in array]