两个三维数组的卷积，一边的边距过慢。

In my current project I need to "convolve" two three dimensional arrays in a slightly unusual way:

在我目前的项目中，我需要以一种稍微不同寻常的方式“convolve”两个三维数组:

Assume we have two three dimensional arrays A and B with the dimensions dimA and dimB (same for every axis). Now we want to create a third array C with the dimensions dimA+dimB for every axis.

假设我们有两个三维数组A和B，它们的维数是dimA和dimB(每个轴是相同的)。现在我们想要创建一个第三个数组C，每个轴的维度是dimA+dimB。

The entries of C are computed as:

C的条目计算为:

c_{x1+x2,y1+y2,z1+z2} += a_{x1,y1,z1} * b_{x2,y2,z2}

My current version is straightforward:

我现在的版本很简单:

dimA = A.shape[0]
dimB = B.shape[0]
dimC = dimA+dimB

C = np.zeros((dimC,dimC,dimC))
for x1 in range(dimA):
    for x2 in range(dimB):
        for y1 in range(dimA):
            for y2 in range(dimB):
                for z1 in range(dimA):
                    for z2 in range(dimB):
                        x = x1+x2
                        y = y1+y2
                        z = z1+z2
                        C[x,y,z] += A[x1,y1,z1] * B[x2,y2,z2]

Unfortunately, this version is really slow and not usable.

不幸的是，这个版本非常慢，不能使用。

My second version was:

我的第二个版本是:

C = scipy.signal.fftconvolve(A,B,mode="full")

But this computes only the elements max(dimA,dimB)

但这只计算元素max(dimA,dimB)

Has anyone a better idea?

有人有更好的主意吗?

3 个解决方案

#1

Have you tried using Numba? It's a package that allows you to wrap Python code that is usually slow with a JIT compiler. I took a quick stab at your problem using Numba and got a significant speed up. Using IPython's magic timeit magic function, the custom_convolution function took ~18 s, while Numba's optimized function took 10.4 ms. That's a speed up of more than 1700.

你试过使用Numba吗?它是一个包，允许您用JIT编译器来封装通常很慢的Python代码。我使用了Numba快速地解决了你的问题，并取得了显著的进步。使用IPython的magic timeit magic函数，custom_卷积函数使用了~18 s，而Numba的优化函数取10.4 ms。这是超过1700的速度。

Here's how Numba is implemented.

这是Numba的实现方式。

import numpy as np
from numba import jit, double

s = 15
array_a = np.random.rand(s ** 3).reshape(s, s, s)
array_b = np.random.rand(s ** 3).reshape(s, s, s)

# Original code
def custom_convolution(A, B):

    dimA = A.shape[0]
    dimB = B.shape[0]
    dimC = dimA + dimB

    C = np.zeros((dimC, dimC, dimC))
    for x1 in range(dimA):
        for x2 in range(dimB):
            for y1 in range(dimA):
                for y2 in range(dimB):
                    for z1 in range(dimA):
                        for z2 in range(dimB):
                            x = x1 + x2
                            y = y1 + y2
                            z = z1 + z2
                            C[x, y, z] += A[x1, y1, z1] * B[x2, y2, z2]
    return C

# Numba'ing the function with the JIT compiler
fast_convolution = jit(double[:, :, :](double[:, :, :],
                        double[:, :, :]))(custom_convolution)

If you compute the residual between the results of both functions you will get zeros. This means that the JIT implementation is working without any problems.

如果你计算两个函数的结果之间的剩余，你会得到0。这意味着JIT实现工作没有任何问题。

slow_result = custom_convolution(array_a, array_b) 
fast_result = fast_convolution(array_a, array_b)

print np.max(np.abs(slow_result - fast_result))

The output I get for this is 0.0.

得到的结果是0。0。

You can either install Numba into your current Python setup or try it quickly with the AnacondaCE package from continuum.io.

您可以将Numba安装到当前的Python设置中，也可以使用continuum.io的AnacondaCE包快速尝试。

Last but not least, Numba's function is faster than the scipy.signal.fftconvolve function by a factor of a few.

最后但同样重要的是，Numba的函数要比scipy.signal快。fftconvolve函数的因数有几个。

Note: I'm using Anaconda and not AnacondaCE. There are some differences between the two packages for Numba's performance, but I don't think it will differ too much.

注意:我用的是水蟒，而不是水蟒。对于Numba的性能，这两个包之间有一些区别，但是我认为它不会有太大的差别。

#2

The Numba method above is a neat trick, but will only be an advantage for relatively small N. An O(N^6) algorithm will kill you every time, regardless of how fast it is implemented. In my tests, the fftconvolve method crosses over to be faster around N=20, and by N=32 is 10x as fast. Leaving out the definition of custom_convolution above:

上面的Numba方法是一个很好的技巧，但是对于相对较小的N. O(N . 6)算法来说，它将会是一个优势，无论它执行的多快，它都会杀死你。在我的测试中，fftconvolve方法在N=20的情况下越快越好，而N=32则是10x。省去了上面的custom_卷积的定义:

from timeit import Timer
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import fftconvolve
from numba import jit, double

# Numba'ing the function with the JIT compiler
numba_convolution = jit(double[:, :, :](double[:, :, :],
                        double[:, :, :]))(custom_convolution)

def fft_convolution(A, B):
    return fftconvolve(A, B, mode='full')

if __name__ == '__main__':
    reps = 3
    nt, ft = [], []
    x = range(2, 34)
    for N in x:
        print N
        A = np.random.rand(N, N, N)
        B = np.random.rand(N, N, N)
        C1 = numba_convolution(A, B)
        C2 = fft_convolution(A, B)
        assert np.allclose(C1[:-1, :-1, :-1], C2)
        t = Timer(lambda: numba_convolution(A, B))
        nt.append(t.timeit(number=reps))
        t = Timer(lambda: fft_convolution(A, B))
        ft.append(t.timeit(number=reps))
    plt.plot(x, ft, x, nt)
    plt.show()

It is also very dependent on N, since the FFT will be a lot faster for powers of 2. The time for the FFT version is essentially constant for N=17 to N=32, and it is still faster at N=33, where it starts diverging quickly again.

它也非常依赖于N，因为对于2的幂，FFT的速度要快得多。FFT版本的时间基本上是N=17到N=32的常数，在N=33的时候，它的速度仍然更快。

You could try wrapping an FFT implementation in Numba, but you can't do that directly with the scipy version.

您可以尝试在Numba中包装FFT实现，但是不能直接使用scipy版本。

(Sorry to create a new answer, but I don't have the points to reply directly.)

(不好意思，我想给你一个新的答案，但我没有直接回复的要点。)

#3

The general rule is to use the correct algorithm for the job, which, unless the convolution kernel is short compared to the data, is an FFT based convolution (short roughly means less than log2(n) where n is the length of the data).

一般规则是使用正确的工作算法，除非卷积核与数据相比是短的，它是一个基于FFT的卷积(short粗略地表示小于log2(n)，其中n是数据的长度)。

In your case, since you're convolving two equal sizes datasets, you probably want to be considering an FFT based convolution.

在您的例子中，由于您正在将两个大小相等的数据集进行卷积，您可能想要考虑一个基于FFT的卷积。

Clearly, scipy.signal.fftconvolve is being a touch deficient in this regard. Using a faster FFT algorithm, you can do much better by rolling your own convolution routine (it doesn't help that fftconvolve forces the transform size to powers of two, otherwise it could be monkey patched).

显然,scipy.signal。在这方面，fftconvolve是一种缺陷。使用更快的FFT算法，你可以通过滚动你自己的卷积例程来做得更好(它不会帮助fftconvolve强制转换大小为2，否则它可能是猴子补丁)。

The following code uses pyfftw, my wrappers around FFTW and creates a custom convolution class, CustomFFTConvolution:

下面的代码使用pyfftw，我在FFTW周围的包装器，创建一个自定义的卷积类，customfft卷积:

class CustomFFTConvolution(object):

    def __init__(self, A, B, threads=1):

        shape = (np.array(A.shape) + np.array(B.shape))-1

        if np.iscomplexobj(A) and np.iscomplexobj(B):
            self.fft_A_obj = pyfftw.builders.fftn(
                    A, s=shape, threads=threads)
            self.fft_B_obj = pyfftw.builders.fftn(
                    B, s=shape, threads=threads)
            self.ifft_obj = pyfftw.builders.ifftn(
                    self.fft_A_obj.get_output_array(), s=shape,
                    threads=threads)

        else:
            self.fft_A_obj = pyfftw.builders.rfftn(
                    A, s=shape, threads=threads)
            self.fft_B_obj = pyfftw.builders.rfftn(
                    B, s=shape, threads=threads)
            self.ifft_obj = pyfftw.builders.irfftn(
                    self.fft_A_obj.get_output_array(), s=shape,
                    threads=threads)

    def __call__(self, A, B):

        fft_padded_A = self.fft_A_obj(A)
        fft_padded_B = self.fft_B_obj(B)

        return self.ifft_obj(fft_padded_A * fft_padded_B)

This is used as:

这是用作:

custom_fft_conv = CustomFFTConvolution(A, B)
C = custom_fft_conv(A, B) # This can contain different values to during construction

with an optional threads argument when constructing the class. The purpose of creating a class is to benefit from the ability of FFTW to plan the transform in advance.

在构造类时使用可选的线程参数。创建一个类的目的是为了从FFTW的能力中获益，从而提前规划转换。

The full demo code below simply extends @Kelsey's answer for the timing and so on.

下面的完整演示代码简单地扩展了@Kelsey对时间的回答等等。

The speedup is substantial over both the numba solution and the vanilla fftconvolve solution. For n = 33, it's about 40-45x faster than both.

在numba解决方案和vanilla fftconvolve解决方案中，加速都是非常重要的。对于n = 33，它比两者都快40-45x。

from timeit import Timer
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import fftconvolve
from numba import jit, double
import pyfftw

# Original code
def custom_convolution(A, B):

    dimA = A.shape[0]
    dimB = B.shape[0]
    dimC = dimA + dimB

    C = np.zeros((dimC, dimC, dimC))
    for x1 in range(dimA):
        for x2 in range(dimB):
            for y1 in range(dimA):
                for y2 in range(dimB):
                    for z1 in range(dimA):
                        for z2 in range(dimB):
                            x = x1 + x2
                            y = y1 + y2
                            z = z1 + z2
                            C[x, y, z] += A[x1, y1, z1] * B[x2, y2, z2]
    return C

# Numba'ing the function with the JIT compiler
numba_convolution = jit(double[:, :, :](double[:, :, :],
                        double[:, :, :]))(custom_convolution)

def fft_convolution(A, B):
    return fftconvolve(A, B, mode='full')

class CustomFFTConvolution(object):

    def __init__(self, A, B, threads=1):

        shape = (np.array(A.shape) + np.array(B.shape))-1

        if np.iscomplexobj(A) and np.iscomplexobj(B):
            self.fft_A_obj = pyfftw.builders.fftn(
                    A, s=shape, threads=threads)
            self.fft_B_obj = pyfftw.builders.fftn(
                    B, s=shape, threads=threads)
            self.ifft_obj = pyfftw.builders.ifftn(
                    self.fft_A_obj.get_output_array(), s=shape,
                    threads=threads)

        else:
            self.fft_A_obj = pyfftw.builders.rfftn(
                    A, s=shape, threads=threads)
            self.fft_B_obj = pyfftw.builders.rfftn(
                    B, s=shape, threads=threads)
            self.ifft_obj = pyfftw.builders.irfftn(
                    self.fft_A_obj.get_output_array(), s=shape,
                    threads=threads)

    def __call__(self, A, B):

        fft_padded_A = self.fft_A_obj(A)
        fft_padded_B = self.fft_B_obj(B)

        return self.ifft_obj(fft_padded_A * fft_padded_B)

def run_test():
    reps = 10
    nt, ft, cft, cft2 = [], [], [], []
    x = range(2, 34)

    for N in x:
        print N
        A = np.random.rand(N, N, N)
        B = np.random.rand(N, N, N)

        custom_fft_conv = CustomFFTConvolution(A, B)
        custom_fft_conv_nthreads = CustomFFTConvolution(A, B, threads=2)

        C1 = numba_convolution(A, B)
        C2 = fft_convolution(A, B)
        C3 = custom_fft_conv(A, B)
        C4 = custom_fft_conv_nthreads(A, B)

        assert np.allclose(C1[:-1, :-1, :-1], C2)
        assert np.allclose(C1[:-1, :-1, :-1], C3)
        assert np.allclose(C1[:-1, :-1, :-1], C4)

        t = Timer(lambda: numba_convolution(A, B))
        nt.append(t.timeit(number=reps))
        t = Timer(lambda: fft_convolution(A, B))
        ft.append(t.timeit(number=reps))
        t = Timer(lambda: custom_fft_conv(A, B))
        cft.append(t.timeit(number=reps))
        t = Timer(lambda: custom_fft_conv_nthreads(A, B))
        cft2.append(t.timeit(number=reps))

    plt.plot(x, ft, label='scipy.signal.fftconvolve')
    plt.plot(x, nt, label='custom numba convolve')
    plt.plot(x, cft, label='custom pyfftw convolve')
    plt.plot(x, cft2, label='custom pyfftw convolve with threading')        
    plt.legend()
    plt.show()

if __name__ == '__main__':
    run_test()

EDIT: More recent scipy does a better job of not always padding to powers of 2 length so is closer in output to the pyFFTW case.

编辑:最近的scipy做的更好的工作不是总是填充到2的长度，所以更接近于pyFFTW案例的输出。

#1