有没有办法让numpy.argmin（）和min（）一样快？

I'm trying to find the minimum array indices along one dimension of a very large 2D numpy array. I'm finding that this is very slow (already tried speeding it up with bottleneck, which was only a minimal improvement). However, taking the straight minimum appears to be an order of magnitude faster:

我试图找到一个非常大的2D numpy数组的一个维度上的最小数组索引。我发现这很慢（已经尝试过加速瓶颈，这只是一个很小的改进）。但是，采用直线最小值似乎要快一个数量级：

import numpy as np
import time

randvals = np.random.rand(3000,160000)
start = time.time()
minval = randvals.min(axis=0)
print "Took {0:.2f} seconds to compute min".format(time.time()-start)
start = time.time()
minindex = np.argmin(randvals,axis=0)
print "Took {0:.2f} seconds to compute argmin".format(time.time()-start)

On my machine this outputs:

在我的机器上输出：

Took 0.83 seconds to compute min
Took 9.58 seconds to compute argmin

Is there any reason why argmin is so much slower? Is there any way to speed it up to comparable to min?

有没有理由说argmin这么慢？有没有什么方法可以加快到与min相当的速度？

3 个解决方案

#1

In [1]: import numpy as np

In [2]: a = np.random.rand(3000, 16000)

In [3]: %timeit a.min(axis=0)
1 loops, best of 3: 421 ms per loop

In [4]: %timeit a.argmin(axis=0)
1 loops, best of 3: 1.95 s per loop

In [5]: %timeit a.min(axis=1)
1 loops, best of 3: 302 ms per loop

In [6]: %timeit a.argmin(axis=1)
1 loops, best of 3: 303 ms per loop

In [7]: %timeit a.T.argmin(axis=1)
1 loops, best of 3: 1.78 s per loop

In [8]: %timeit np.asfortranarray(a).argmin(axis=0)
1 loops, best of 3: 1.97 s per loop

In [9]: b = np.asfortranarray(a)

In [10]: %timeit b.argmin(axis=0)
1 loops, best of 3: 329 ms per loop

Maybe min is smart enough to do its job sequentially over the array (hence with cache locality), and argmin is jumping around the array (causing a lot of cache misses)?

也许min足够聪明，能够在数组上顺序完成工作（因此具有缓存局部性），并且argmin在数组中跳转（导致大量缓存未命中）？

Anyway, if you're willing to keep randvals as a Fortran-ordered array from the start, it'll be faster, though copying into Fortran-ordered doesn't help.

无论如何，如果你愿意从一开始就将randvals保留为Fortran排序的数组，它会更快，尽管复制到Fortran-ordered也无济于事。

#2

I just took a look at the source code, and while I don't fully understand why things are being done the way they are, this is what happens:

我只是看了一下源代码，虽然我不完全理解为什么事情按照它们的方式完成，但这是发生的事情：

np.min is basically a call to np.minimum.reduce.

np.min基本上是对np.minimum.reduce的调用。
np.argmin first moves the axis you want to operate on to the end of the shape tuple, then makes it a contiguous array, which of course triggers a copy of the full array unless the axis was the last one to begin with.

np.argmin首先将你想要操作的轴移动到形状元组的末尾，然后使它成为一个连续的数组，这当然会触发完整数组的副本，除非轴是最后一个开始的轴。

Since a copy is being made, you can get creative and try to instantiate cheaper arrays:

由于正在制作副本，您可以获得创意并尝试实例化更便宜的数组：

a = np.random.rand(1000, 2000)

def fast_argmin_axis_0(a):
    matches = np.nonzero((a == np.min(a, axis=0)).ravel())[0]
    rows, cols = np.unravel_index(matches, a.shape)
    argmin_array = np.empty(a.shape[1], dtype=np.intp)
    argmin_array[cols] = rows
    return argmin_array

In [8]: np.argmin(a, axis=0)
Out[8]: array([230, 532, 815, ..., 670, 702, 989], dtype=int64)

In [9]: fast_argmin_axis_0(a)
Out[9]: array([230, 532, 815, ..., 670, 702, 989], dtype=int64)

In [10]: %timeit np.argmin(a, axis=0)
10 loops, best of 3: 27.3 ms per loop

In [11]: %timeit fast_argmin_axis_0(a)
100 loops, best of 3: 15 ms per loop

I wouldn't go as far as calling the current implementation a bug, since there may be good reasons for numpy doing what it does the way it does it, but that this kind of trickery can speed up what should be a highly optimized function, strongly suggests that things could be done better.

我不会把当前的实现调用为bug，因为numpy可能有很好的理由做它所做的事情，但是这种技巧可以加速应该是高度优化的函数，强烈建议事情可以做得更好。

#3

I was just hitting the same problem, and found the large difference in performance when axis 0 is selected for finding the minimum. As suggested, the problem seems to be related to internally copying the array.

我刚刚遇到同样的问题，并且在选择轴0以找到最小值时发现性能差异很大。如建议的那样，问题似乎与内部复制数组有关。

I devised a rather simple-minded workaround using Cython that simultaneously establishes the minimum values and their position in a 2D numpy array along a given axis. Note that for axis = 0, the algorithm works on an array of columns (maximum number specified by the constant blocksize - here set equivalent to 8 kByte of data) simultaneously to make good use of the cache:

我设计了一个使用Cython的相当简单的解决方法，同时在给定轴的2D numpy数组中建立最小值及其位置。请注意，对于axis = 0，算法同时处理列数组（由常量块大小指定的最大数量 - 此处设置相当于8 KB数据）以充分利用缓存：

%%cython -c=-O2 -c=-march=native

import numpy as np
cimport numpy as np 
cimport cython
from libc.stdint cimport uint32_t

@cython.boundscheck(False)
@cython.wraparound(False)
cdef void _minargmin_2d_64_axis_0(uint32_t[:] minloc, double[:] minimum, double[:, :] arr) nogil:
    """
    Find the minimum and argmin for a 2D array of 64-bit floats along axis 0
    Parameters:
    -----------
    arr: numpy_array, dtype=np.float64, shape=(m, n)
       The array for which the minima should be computed.
    minloc: numpy_array, dtype=np.uint32, shape=(n)
       Stores the rows where the minima occur for each column.
    minimum: numpy_array, dtype=np.float64, shape=(n)
       The minima along each column.
    """

    # Columns of the matrix are accessed in blocks to increase cache performance.
    # Specify the number of columns here:

    DEF blocksize = 1024

    cdef int i, j, k
    cdef double minim[blocksize]
    cdef uint32_t minimloc[blocksize]

    # Read blocks of data to make good use of the cache

    cdef int blocks
    blocks  = arr.shape[1] / blocksize
    remains = arr.shape[1] % blocksize

    for i in xrange(0, blocksize * blocks, blocksize):

        for k in xrange(blocksize):
            minim[k]    = arr[0, i + k]
            minimloc[k] = 0

        for j in xrange(1, arr.shape[0]):

            for k in xrange(blocksize):

                if arr[j, i + k] < minim[k]:
                    minim[k] = arr[j, i + k]
                    minimloc[k] = j

        for k in xrange(blocksize):
            minimum[i + k] = minim[k]
            minloc[i + k]  = minimloc[k]

    # Work on the final 'incomplete' block

    i = blocksize * blocks

    for k in xrange(remains):
        minim[k]    = arr[0, i + k]
        minimloc[k] = 0

    for j in xrange(1, arr.shape[0]):

        for k in xrange(remains):

            if arr[j, i + k] < minim[k]:
                minim[k] = arr[j, i + k]
                minimloc[k] = j

    for k in xrange(remains):
        minimum[i + k] = minim[k]
        minloc[i + k]  = minimloc[k]

    # Done!


@cython.boundscheck(False)
@cython.wraparound(False)
cdef void _minargmin_2d_64_axis_1(uint32_t[:] minloc, double[:] minimum, double[:, :] arr) nogil:
    """
    Find the minimum and argmin for a 2D array of 64-bit floats along axis 1
    Parameters:
    -----------
    arr: numpy_array, dtype=np.float64, shape=(m, n)
       The array for which the minima should be computed.
    minloc: numpy_array, dtype=np.uint32, shape=(m)
       Stores the rows where the minima occur for each row.
    minimum: numpy_array, dtype=np.float64, shape=(m)
       The minima along each row.
    """
    cdef int i
    cdef int j
    cdef double minim
    cdef uint32_t minimloc

    for i in xrange(arr.shape[0]):
        minim = arr[i, 0]
        minimloc = 0

        for j in xrange(1, arr.shape[1]):
            if arr[i, j] < minim:
                minim = arr[i, j]
                minimloc = j

        minimum[i] = minim
        minloc[i]  = minimloc


@cython.boundscheck(False)
@cython.wraparound(False)
cdef void _minargmin_2d_32_axis_0(uint32_t[:] minloc, float[:] minimum, float[:, :] arr) nogil:
    """
    Find the minimum and argmin for a 2D array of 32-bit floats along axis 0
    Parameters:
    -----------
    arr: numpy_array, dtype=np.float32, shape=(m, n)
       The array for which the minima should be computed.
    minloc: numpy_array, dtype=np.uint32, shape=(n)
       Stores the rows where the minima occur for each column.
    minimum: numpy_array, dtype=np.float32, shape=(n)
       The minima along each column.
    """

    # Columns of the matrix are accessed in blocks to increase cache performance.
    # Specify the number of columns here:  

    DEF blocksize = 2048

    cdef int i, j, k
    cdef float minim[blocksize]
    cdef uint32_t minimloc[blocksize]

    # Read blocks of data to make good use of the cache

    cdef int blocks
    blocks  = arr.shape[1] / blocksize
    remains = arr.shape[1] % blocksize

    for i in xrange(0, blocksize * blocks, blocksize):

        for k in xrange(blocksize):
            minim[k]    = arr[0, i + k]
            minimloc[k] = 0

        for j in xrange(1, arr.shape[0]):

            for k in xrange(blocksize):

                if arr[j, i + k] < minim[k]:
                    minim[k] = arr[j, i + k]
                    minimloc[k] = j

        for k in xrange(blocksize):
            minimum[i + k] = minim[k]
            minloc[i + k]  = minimloc[k]

    # Work on the final 'incomplete' block

    i = blocksize * blocks

    for k in xrange(remains):
        minim[k]    = arr[0, i + k]
        minimloc[k] = 0

    for j in xrange(1, arr.shape[0]):

        for k in xrange(remains):

            if arr[j, i + k] < minim[k]:
                minim[k] = arr[j, i + k]
                minimloc[k] = j

    for k in xrange(remains):
        minimum[i + k] = minim[k]
        minloc[i + k]  = minimloc[k]

    # Done!

@cython.boundscheck(False)
@cython.wraparound(False)
cdef void _minargmin_2d_32_axis_1(uint32_t[:] minloc, float[:] minimum, float[:, :] arr) nogil:
    """
    Find the minimum and argmin for a 2D array of 32-bit floats along axis 1
    Parameters:
    -----------
    arr: numpy_array, dtype=np.float32, shape=(m, n)
       The array for which the minima should be computed.
    minloc: numpy_array, dtype=np.uint32, shape=(m)
       Stores the rows where the minima occur for each row.
    minimum: numpy_array, dtype=np.float32, shape=(m)
       The minima along each row.
    """
    cdef int i
    cdef int j
    cdef float minim
    cdef uint32_t minimloc

    for i in xrange(arr.shape[0]):
        minim = arr[i, 0]
        minimloc = 0

        for j in xrange(1, arr.shape[1]):
            if arr[i, j] < minim:
                minim = arr[i, j]
                minimloc = j

        minimum[i] = minim
        minloc[i]  = minimloc

def Min_Argmin(array_2d, axis = 1):
    """
    Find the minima and corresponding locations (argmin) of a two-dimensional
    numpy array of floats along a given axis simultaneously, and returns them
    as a tuple of arrays (min_2d, argmin_2d).

    (Note: float16 arrays will be internally transformed to float32 arrays.)
    ----------
    array_2d: numpy_array, dtype=np.float32 or np.float64, shape=(m, n)
       The array for which the minima should be computed.
    axis : int, 0 or 1 (default) : 
        The axis along which minima are computed.
    min_2d: numpy_array, dtype=np.uint8, shape=(m) or shape=(n):
       The array where the minima along the given axis are stored.
    argmin_2d:
       The array storing the row/column where the minimum occurs.
    """

    # Sanity checks:
    if len(array_2d.shape) != 2:
        raise IndexError('Min_Argmin: Number of dimensions of array must be 2')

    if not (axis == 0 or axis == 1):
        raise ValueError('Min_Argmin: Invalid axis specified')

    arr_type = array_2d.dtype

    if not arr_type in ('float16', 'float32', 'float64'):
        raise ValueError('Min_Argmin: Not a float array')

    # Transform float16 arrays
    if arr_type == 'float16':
        array_2d = array_2d.astype(np.float32)

    # Run analysis

    if arr_type == 'float64':

        # Double accuracy

        min_array    = np.zeros(array_2d.shape[1 - axis], dtype = np.float64)
        argmin_array = np.zeros(array_2d.shape[1 - axis], dtype = np.uint32)

        if (axis == 0):
            _minargmin_2d_64_axis_0(argmin_array, min_array, array_2d)

        else:
            _minargmin_2d_64_axis_1(argmin_array, min_array, array_2d)

    else:

        # Single accuracy

        min_array    = np.zeros(array_2d.shape[1 - axis], dtype = np.float32)
        argmin_array = np.zeros(array_2d.shape[1 - axis], dtype = np.uint32)

        if (axis == 0):
            _minargmin_2d_32_axis_0(argmin_array, min_array, array_2d)

        else:
            _minargmin_2d_32_axis_1(argmin_array, min_array, array_2d)

        # Transform back to float16 type as necessary

        if arr_type == 'float16':
            min_array = min_array.astype(np.float16)


    # Return results
    return min_array, argmin_array

The code can be placed and compiled in an IPython notebook cell after loading Cython support:

加载Cython支持后，可以在IPython笔记本单元格中放置和编译代码：

%load_ext Cython

and then called in the form:

然后以下面的形式调用：

min_array, argmin_array = Min_Argmin(two_dim_array, axis = 0 or 1)

Timing example:

时间示例：

random_array = np.random.rand(20000, 20000).astype(np.float32)

%timeit min_array, argmin_array = Min_Argmin(random_array, axis = 0)
%timeit min_array, argmin_array = Min_Argmin(random_array, axis = 1)

1 loops, best of 3: 405 ms per loop
1 loops, best of 3: 307 ms per loop

For comparison:

为了比较：

%%timeit 
min_array    = random_array.min(axis = 0)
argmin_array = random_array.argmin(axis = 0)

1 loops, best of 3: 10.3 s per loop

%%timeit 
min_array    = random_array.min(axis = 1)
argmin_array = random_array.argmin(axis = 1)

1 loops, best of 3: 423 ms per loop

So, there is a significant speedup for axis = 0 (and still a small advantage for axis = 1, if one is interested in both minimum and location).

因此，对于轴= 0，存在显着的加速（并且对于轴= 1仍然是小的优势，如果一个人对最小值和位置都感兴趣）。

#1

In [1]: import numpy as np

In [2]: a = np.random.rand(3000, 16000)

In [3]: %timeit a.min(axis=0)
1 loops, best of 3: 421 ms per loop

In [4]: %timeit a.argmin(axis=0)
1 loops, best of 3: 1.95 s per loop

In [5]: %timeit a.min(axis=1)
1 loops, best of 3: 302 ms per loop

In [6]: %timeit a.argmin(axis=1)
1 loops, best of 3: 303 ms per loop

In [7]: %timeit a.T.argmin(axis=1)
1 loops, best of 3: 1.78 s per loop

In [8]: %timeit np.asfortranarray(a).argmin(axis=0)
1 loops, best of 3: 1.97 s per loop

In [9]: b = np.asfortranarray(a)

In [10]: %timeit b.argmin(axis=0)
1 loops, best of 3: 329 ms per loop