在Python / Numpy中矢量化牛顿方法

I am trying to figure out if Python/Numpy is a viable alternative to develop my numerical software which is already available in C++. In order to get performance in Python/Numpy, one need to "vectorize" the code. But it turns out that as soon as I move away from very simple examples, I struggle to vectorize the code (I am not talking about SIMD instructions but "efficient Numpy code" without loops). Here is an algorithm that I want to get efficiently in Python/Numpy.

我试图弄清楚Python / Numpy是否是开发我的数字软件的可行替代方案,该软件已经在C ++中提供。为了在Python / Numpy中获得性能,需要“向量化”代码。但事实证明,一旦我摆脱了非常简单的例子,我就很难对代码进行矢量化(我不是在讨论SIMD指令,而是在没有循环的情况下使用“高效的Numpy代码”)。这是一个我希望在Python / Numpy中有效获得的算法。

Create an numpy array containing: 1.0, 1.0 + 1/n, 1.0 + 2/n, ..., 2.0

创建一个numpy数组,包含:1.0,1.0 + 1 / n,1.0 + 2 / n,...,2.0

For every u in the array, compute the root of x^2 - u, using a Newton method, stopping when |dx| <= 1.0e-7. Store the result in an array result.

对于数组中的每个u,使用牛顿方法计算x ^ 2 - u的根,在| dx |时停止<= 1.0e-7。将结果存储在数组结果中。

Sum all the elements of the result array

求和结果数组的所有元素

Here is the algorithm in Python I want to speed up

这是我想加速的Python算法

import numpy as np

n = 1000000
data = np.arange(1.0, 2.0, 1.0 / n)

def newton(u):
  x = 2.0
  while True:
    f = x**2 - u
    df_dx = 2 * x
    dx = f / df_dx
    if (abs(dx) <= 1.0e-7):
      break
    x -= dx
  return x

  result = map(newton, data)

  print result[n - 1]

Here is a version of the algorithm in C++11

这是C ++ 11中的算法版本

#include <iostream>
#include <vector>
#include <cmath>

int main (int argc, char const *argv[]) {
  auto n = std::size_t{100000000};

  auto v = std::vector<double>(n + 1);
  for(size_t k = 0; k < v.size(); ++k) {
    v[k] = 1.0 + static_cast<double>(k) / n;
  }

  auto result = std::vector<double>(n + 1);
  for(size_t k = 0; k < v.size(); ++k) {
    auto x = double{2.0};
    while(true) {
      auto f = double{x * x - v[k]};
      auto df_dx = double{2 * x};
      auto dx = double{f / df_dx};
      if (std::abs(dx) <= 1.0e-7) {
        break;
      }
      x -= dx;
    }
    result[k] = x;
  }

  auto somme = double{0.0};
  for(size_t k = 0; k < result.size(); ++k) {
    somme += result[k];
  }

  std::cout << somme << std::endl;
  return 0;
}

It takes 2.9 seconds to run on my machine. Is there a way to make a fast Python/Numpy algorithm that does the same thing (I am willing to get something that is less than 5 times slower).

在我的机器上运行需要2.9秒。有没有办法制作一个快速的Python / Numpy算法来做同样的事情(我愿意得到一些不到5倍的东西)。

Thanks.

3 个解决方案

#1

You can do step 1. with numpy efficiently:

你可以有效地使用numpy执行第1步:

1.0 + np.arange(n + 1) / n

however I think you would need the np.vectorize() method to feed back x into your calculated values and it's not an efficient function (basically a wrapper for a python loop). If you can use scipy then there are built in methods that might do what you want http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.optimize.newton.html

但是我认为你需要使用np.vectorize()方法将x反馈到你的计算值中,它不是一个有效的函数(基本上是python循环的包装器)。如果你可以使用scipy那么内置的方法可能会做你想要的http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.optimize.newton.html

EDIT: Having thought a bit more about this I followed up on @ev-br's point and tried some alternatives. The masking uses too much processing but the abs().max() is pretty fast so a compromise might be to "divide the problem into blocks" both in the 1st dimension of the array and in iteration direction. The following doesn't do too badly (< 20s) on my pretty low power laptop - certainly much faster than np.vectorize() or any of the scipy solving systems I could find. (If I set m too big it runs out of something (memory?) and grinds to a complete halt!)

编辑:考虑到这一点后,我跟进了@ ev-br的观点并尝试了一些替代方案。掩码使用了太多的处理但是abs()。max()非常快,因此折衷可能是在数组的第一维和迭代方向上“将问题划分为块”。以下在我的低功耗笔记本电脑上做得不是太差(<20s) - 肯定比np.vectorize()或我能找到的任何scipy解决系统快得多。 (如果我将m设置得太大,那就会耗尽某些东西(内存?)并完全停止!)

n = 100000000
m = 5000000

block = 3
u = 1.0 + np.arange(n + 1) / n
x = np.full(u.shape, 2.0)
dx = np.ones(u.shape)

for i in range(0, n, m):
  while np.abs(dx[i:i+m]).max() > 1.0e-7:
    for j in range(block):
      dx[i:i+m] = (x[i:i+m] ** 2 - u[i:i+m]) / (2 * x[i:i+m])
      x[i:i+m] -= dx[i:i+m]

#2

Here's a toy example. Notice that often vectorization means writing your code as if you're manipulating numbers, and letting numpy do its magic:

这是一个玩具的例子。请注意,通常矢量化意味着编写代码就好像你在操纵数字一样,并让numpy发挥其魔力:

>>> import numpy as np
>>> a = np.array([1., 2., 3.])
>>> def f(x):
...    return x**2 - a, 2.*x    # function and derivative
>>>
>>> def newt(f, x0):
...    x = np.asarray(x0)
...    for _ in range(5):    # hardcode the number of iterations (I know)
...        v, dv = f(x)
...        x -=  v / dv
...    return x
>>> 
>>> newt(f, [1., 1., 1.])
array([ 1.        ,  1.41421356,  1.73205081])

If this is a performance bottleneck, this is unlikely to be competetive with hand-written C++ code: First of all, you're manipulating python objects with all the overhead; then numpy is likely doing a bunch of array allocations under the hood.

如果这是一个性能瓶颈,这不太可能与手写的C ++代码竞争:首先,你正在操纵python对象的所有开销;然后numpy可能会在引擎盖下做一堆数组分配。

An often viable strategy is to start by writing things in python/numpy, and then move bottlenecks into a compiled code --- eg Cython or C++ wrapped by Cython. In this particular case since you already have the C++ code, just wrapping it with Cython is likely easiest but YMMV.

一个经常可行的策略是首先在python / numpy中编写东西,然后将瓶颈移动到编译代码中 - 例如Cython或C ++包装的Cython。在这种特殊情况下,因为你已经有了C ++代码,所以用Cython包装它可能是最容易的但是YMMV。

#3

I'm not looking to wave small snippets of code as a solution, but here's something to get you started. I have a strong suspicion that you're having troubles just declaring such an array in python without spending too much time on it, so I'll mostly help you out there.

我不打算将小代码片段作为解决方案,但这里有一些东西可以帮助您入门。我有一种强烈的怀疑,你只是在python中声明这样一个阵列而不花太多时间就会遇到麻烦,所以我会主要帮助你。

As far as the square roots come in, please add your example python code and I'll see what I can help optimize from that point on. In my example roots and sums are found with the default numpy functions/methods.

就平方根来说,请添加你的示例python代码,我会看到从那时起我可以帮助优化的内容。在我的例子中,使用默认的numpy函数/方法找到了根和总和。

def summing():
    n = 1000000
    ar = np.arange(0, n)
    ar = ar/float(n)
    ar = ar + np.ones(n)
    sqrt = np.sqrt(ar)
    return np.sum(ar)

In short, to get the starting array it's best to use a "workaround".

简而言之,要获得起始阵列,最好使用“解决方法”。

initialize an array ar with values `[1,2,3,....n]

使用值“[1,2,3,... n]初始化数组ar

divide ar with n. This gets us the 1/n, 2/n ... members

将ar与n分开。这让我们成为1 / n,2 / n ......成员

add to that an array of same dimensions that contain just the number 1.0 This gets us the full array [ 1., 1.000001, 1.000002, ..., 1.999998, 1.999999]) we're after. If I understood you right.

添加一个包含数字1.0的相同维度的数组这将得到我们完整的数组[1.,1.000001,1.000002,...,1.999998,1.999999]。如果我理解你的话。

find square roots, sum it

找到平方根,总结一下

Average of 10 sequential execution times is 0.018786 seconds.

10个连续执行时间的平均值为0.018786秒。

#1