如何加速Numpy sum和Python for循环？

I have 2 codes that do almost the same thing.

我有2个代码几乎完全相同的东西。

code 1:

代码1：

from __future__ import division
import numpy as np

m = 1
gamma = 1
lam = 1
alpha = 1
step_num = 2 ** 16
dt = 0.02


def E_and_x(x0):
    xi = x0
    vi = 0
    f = 0
    xsum = 0
    Ei, xavg = 0, 0
    for i in range(step_num):
        vi += f / m * dt / 2
        xi += vi * dt
        f = - gamma * xi - lam * xi ** 2 - alpha * xi ** 3
        vi += f / m * dt / 2
        Ei = 1 / 2 * m * vi ** 2 + 1 / 2 * gamma * xi ** 2 + \
            1 / 3 * lam * xi ** 3 + 1 / 4 * alpha * xi ** 4
        xsum += xi
        xavg = xsum / (i + 1)
    return Ei, xavg

E, x = [], []
for x0 in np.linspace(0, 1, 40):
    mdresult = E_and_x(x0)
    E.append(mdresult[0])
    x.append(mdresult[1])

code 2:

代码2：

from __future__ import division
import numpy as np
from numba import jit

time = 50
niter = 2 ** 16  # number of iterations
t = np.linspace(0, time, num=niter, endpoint=True)


class MolecularDynamics(object):
    def __init__(self, time, niter, initial_pos):
        self.position = np.array([])
        self.velocity = np.array([])
        self.energy = np.array([])
        self.x_average = np.array([])
        self.vel = 0  # intermediate variable
        self.force = 0  # intermediate variable
        self.e = 0  # intermediate energy
        self.initial_pos = initial_pos  # initial position
        self.pos = self.initial_pos
        self.time = time
        self.niter = niter
        self.time_step = self.time / self.niter
        self.mass = 1
        self.k = 1  # stiffness coefficient
        self.lamb = 1  # lambda
        self.alpha = 1  # quartic coefficient

    @jit
    def iter(self):
        for i in np.arange(niter):
            # step 1 of leap frog
            self.vel += self.time_step / 2.0 * self.force / self.mass
            self.pos += self.time_step * self.vel
            # step 2 of leap frog
            self.force = - self.k * self.pos - self.lamb * self.pos ** 2 - self.alpha * self.pos ** 3
            self.vel += self.time_step / 2.0 * self.force / self.mass

            # calculate energy
            self.e = 1 / 2 * self.mass * self.vel ** 2 + \
                     1 / 2 * self.k * self.pos ** 2 + \
                     1 / 3 * self.lamb * self.pos ** 3 + \
                     1 / 4 * self.alpha * self.pos ** 4

            self.velocity = np.append(self.velocity, [self.vel])  # record vel after 1 time step
            self.position = np.append(self.position, self.pos)  # record pos after 1 time step
            self.energy = np.append(self.energy, [self.e])  # record e after 1 time step
            self.x_average = np.append(self.x_average, np.sum(self.position) / (i + 1))


mds = [MolecularDynamics(time, niter, xx) for xx in np.linspace(0, 1, num=40)]
[md.iter() for md in mds]  # loop to change value
mds_x_avg = [md.x_average[-1] for md in mds]
mds_e = [md.e for md in mds]

Well, the main difference is code 2 uses OO, Numpy and JIT. However, code 2 is much much slower than code 1 (it takes many minutes to compute).

好吧，主要的区别是代码2使用OO，Numpy和JIT。但是，代码2比代码1慢得多（计算需要很多分钟）。

In [1]: %timeit code_1
10000000 loops, best of 3: 25.7 ns per loop

By profiling I know the bottleneck is iter() function and more specifically, on append and sum. But using Numpy is as far as I can do, I wonder why code 2 is much more slower and how can I speed it up?

通过分析我知道瓶颈是iter（）函数，更具体地说，是追加和求和。但是使用Numpy是我能做到的，我想知道为什么代码2要慢得多，我怎样才能加快速度呢？

2 个解决方案

#1

In addition to what MSeifert said, you can preallocate the arrays to the correct size instead of appending to them. So instead of creating them like this:

除了MSeifert所说的，你可以预先将数组分配到正确的大小而不是附加到它们。所以不要像这样创建它们：

self.position = np.array([])  # No

you would write:

你会写：

self.position = np.zeros(niter) # Yes

Then instead of appending like this:

然后而不是像这样追加：

self.velocity = np.append(self.velocity, [self.vel])

you would fill in like this:

你会这样填写：

self.velocity[i] = self.vel

This avoids reallocating the arrays at each iteration (You can do exactly the same with raw python lists BTW using array = [someValue]*size).

这避免了在每次迭代时重新分配数组（您可以使用array = [someValue] * size）对原始python列表BTW执行完全相同的操作。

Vectorizability

Vectorizability

I went on wondering about vectorizability of the OP's algorithm. It seems that it isn't vectorizable. To quote the Cornell Virtual Workshop :

我继续想知道OP算法的可矢量化。似乎它不可矢量化。引用康奈尔虚拟研讨会：

Read after write ("flow") dependency. This kind of dependency is not vectorizable. It occurs when the values of variables in a particular loop iteration (the "read") are determined by a previous loop iteration (the "write").

写入后读（“流”）依赖。这种依赖性不可矢量化。当特定循环迭代中的变量值（“读取”）由先前的循环迭代（“写入”）确定时，就会发生这种情况。

The "flow" dependency is seen in the loop, where several members' values are determined by the previous state of that same member. For example:

循环中可以看到“流”依赖关系，其中几个成员的值由该同一成员的先前状态确定。例如：

self.vel += self.time_step / 2.0 * self.force / self.mass

Here, self.force is from the previous iteration and was computed from the previous self.vel.

这里，self.force来自上一次迭代，并从之前的self.vel计算。

#2

You did something wrong with your timings, just testing your first code (slightly modified):

你的时间做错了，只测试你的第一个代码（稍加修改）：

from __future__ import division


def E_and_x(x0):
    m = 1
    gamma = 1
    lam = 1
    alpha = 1
    step_num = 2 ** 13   # much less iterations!
    dt = 0.02

    xi = x0
    vi = 0
    f = 0
    xsum = 0
    Ei, xavg = 0, 0
    for i in range(step_num):
        vi += f / m * dt / 2
        xi += vi * dt
        f = - gamma * xi - lam * xi ** 2 - alpha * xi ** 3
        vi += f / m * dt / 2
        Ei = 1 / 2 * m * vi ** 2 + 1 / 2 * gamma * xi ** 2 + \
            1 / 3 * lam * xi ** 3 + 1 / 4 * alpha * xi ** 4
        xsum += xi
        xavg = xsum / (i + 1)
    return Ei, xavg

The timings are not in the nanosecond regime:

时间不在纳秒范围内：

%timeit [E_and_x(x0) for x0 in np.linspace(0, 1, 40)]  # 1 loop, best of 3: 3.46 s per loop

However if numba would be an option I would definetly advise to jit the E_and_x functions:

但是如果numba是一个选项，我肯定会建议jit E_and_x函数：

import numba as nb

numbaE_and_x = nb.njit(E_and_x)

numbaE_and_x(1.2)  # warmup for the jit
%timeit [numbaE_and_x(x0) for x0 in np.linspace(0, 1, 40)]  # 100 loops, best of 3: 3.38 ms per loop

It's already 100 times faster. If you run the first code with PyPy (or Cythonize it) you should get similar results.

它已经快了100倍。如果你用PyPy（或Cythonize它）运行第一个代码，你应该得到类似的结果。

Apart from that:

除此之外：

np.append is a terrible option. Because np.append, np.concatenate, np.stack (all variants) need to allocate a new array and copy all other arrays into it! And you don't do anything with these arrays, just append to them. So the only thing you do with numpy is something that numpy is really bad at!
np.append是一个糟糕的选择。因为np.append，np.concatenate，np.stack（所有变种）需要分配一个新数组并将所有其他数组复制到其中！并且您不对这些数组执行任何操作，只需附加到它们即可。所以你用numpy做的唯一一件事就是numpy真的很糟糕！
Check if you can vectorize any of the calculations using numpy-arrays (without append, etc.).
检查是否可以使用numpy-arrays（无附加等）对任何计算进行矢量化。
And if it's still to slow all these self.xxx attribute accesses are slow, better to read them once and set them again later.
如果它仍然会减慢所有这些self.xxx属性访问速度很慢，最好一次读取它们并稍后再次设置它们。

#1