I am trying to figure out if Python/Numpy is a viable alternative to develop my numerical software which is already available in C++. In order to get performance in Python/Numpy, one need to "vectorize" the code. But it turns out that as soon as I move away from very simple examples, I struggle to vectorize the code (I am not talking about SIMD instructions but "efficient Numpy code" without loops). Here is an algorithm that I want to get efficiently in Python/Numpy.
我试图弄清楚Python / Numpy是否是开发我的数字软件的可行替代方案,该软件已经在C ++中提供。为了在Python / Numpy中获得性能,需要“向量化”代码。但事实证明,一旦我摆脱了非常简单的例子,我就很难对代码进行矢量化(我不是在讨论SIMD指令,而是在没有循环的情况下使用“高效的Numpy代码”)。这是一个我希望在Python / Numpy中有效获得的算法。
- Create an numpy array containing: 1.0, 1.0 + 1/n, 1.0 + 2/n, ..., 2.0
- For every u in the array, compute the root of x^2 - u, using a Newton method, stopping when |dx| <= 1.0e-7. Store the result in an array result.
- Sum all the elements of the result array
创建一个numpy数组,包含:1.0,1.0 + 1 / n,1.0 + 2 / n,...,2.0
对于数组中的每个u,使用牛顿方法计算x ^ 2 - u的根,在| dx |时停止<= 1.0e-7。将结果存储在数组结果中。
求和结果数组的所有元素
Here is the algorithm in Python I want to speed up
这是我想加速的Python算法
import numpy as np
n = 1000000
data = np.arange(1.0, 2.0, 1.0 / n)
def newton(u):
x = 2.0
while True:
f = x**2 - u
df_dx = 2 * x
dx = f / df_dx
if (abs(dx) <= 1.0e-7):
break
x -= dx
return x
result = map(newton, data)
print result[n - 1]
Here is a version of the algorithm in C++11
这是C ++ 11中的算法版本
#include <iostream>
#include <vector>
#include <cmath>
int main (int argc, char const *argv[]) {
auto n = std::size_t{100000000};
auto v = std::vector<double>(n + 1);
for(size_t k = 0; k < v.size(); ++k) {
v[k] = 1.0 + static_cast<double>(k) / n;
}
auto result = std::vector<double>(n + 1);
for(size_t k = 0; k < v.size(); ++k) {
auto x = double{2.0};
while(true) {
auto f = double{x * x - v[k]};
auto df_dx = double{2 * x};
auto dx = double{f / df_dx};
if (std::abs(dx) <= 1.0e-7) {
break;
}
x -= dx;
}
result[k] = x;
}
auto somme = double{0.0};
for(size_t k = 0; k < result.size(); ++k) {
somme += result[k];
}
std::cout << somme << std::endl;
return 0;
}
It takes 2.9 seconds to run on my machine. Is there a way to make a fast Python/Numpy algorithm that does the same thing (I am willing to get something that is less than 5 times slower).
在我的机器上运行需要2.9秒。有没有办法制作一个快速的Python / Numpy算法来做同样的事情(我愿意得到一些不到5倍的东西)。
Thanks.
3 个解决方案
#1
You can do step 1. with numpy efficiently:
你可以有效地使用numpy执行第1步:
1.0 + np.arange(n + 1) / n
however I think you would need the np.vectorize() method to feed back x into your calculated values and it's not an efficient function (basically a wrapper for a python loop). If you can use scipy then there are built in methods that might do what you want http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.optimize.newton.html
但是我认为你需要使用np.vectorize()方法将x反馈到你的计算值中,它不是一个有效的函数(基本上是python循环的包装器)。如果你可以使用scipy那么内置的方法可能会做你想要的http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.optimize.newton.html
EDIT: Having thought a bit more about this I followed up on @ev-br's point and tried some alternatives. The masking uses too much processing but the abs().max() is pretty fast so a compromise might be to "divide the problem into blocks" both in the 1st dimension of the array and in iteration direction. The following doesn't do too badly (< 20s) on my pretty low power laptop - certainly much faster than np.vectorize() or any of the scipy solving systems I could find. (If I set m too big it runs out of something (memory?) and grinds to a complete halt!)
编辑:考虑到这一点后,我跟进了@ ev-br的观点并尝试了一些替代方案。掩码使用了太多的处理但是abs()。max()非常快,因此折衷可能是在数组的第一维和迭代方向上“将问题划分为块”。以下在我的低功耗笔记本电脑上做得不是太差(<20s) - 肯定比np.vectorize()或我能找到的任何scipy解决系统快得多。 (如果我将m设置得太大,那就会耗尽某些东西(内存?)并完全停止!)
n = 100000000
m = 5000000
block = 3
u = 1.0 + np.arange(n + 1) / n
x = np.full(u.shape, 2.0)
dx = np.ones(u.shape)
for i in range(0, n, m):
while np.abs(dx[i:i+m]).max() > 1.0e-7:
for j in range(block):
dx[i:i+m] = (x[i:i+m] ** 2 - u[i:i+m]) / (2 * x[i:i+m])
x[i:i+m] -= dx[i:i+m]
#2
Here's a toy example. Notice that often vectorization means writing your code as if you're manipulating numbers, and letting numpy do its magic:
这是一个玩具的例子。请注意,通常矢量化意味着编写代码就好像你在操纵数字一样,并让numpy发挥其魔力:
>>> import numpy as np
>>> a = np.array([1., 2., 3.])
>>> def f(x):
... return x**2 - a, 2.*x # function and derivative
>>>
>>> def newt(f, x0):
... x = np.asarray(x0)
... for _ in range(5): # hardcode the number of iterations (I know)
... v, dv = f(x)
... x -= v / dv
... return x
>>>
>>> newt(f, [1., 1., 1.])
array([ 1. , 1.41421356, 1.73205081])
If this is a performance bottleneck, this is unlikely to be competetive with hand-written C++ code: First of all, you're manipulating python objects with all the overhead; then numpy is likely doing a bunch of array allocations under the hood.
如果这是一个性能瓶颈,这不太可能与手写的C ++代码竞争:首先,你正在操纵python对象的所有开销;然后numpy可能会在引擎盖下做一堆数组分配。
An often viable strategy is to start by writing things in python/numpy, and then move bottlenecks into a compiled code --- eg Cython or C++ wrapped by Cython. In this particular case since you already have the C++ code, just wrapping it with Cython is likely easiest but YMMV.
一个经常可行的策略是首先在python / numpy中编写东西,然后将瓶颈移动到编译代码中 - 例如Cython或C ++包装的Cython。在这种特殊情况下,因为你已经有了C ++代码,所以用Cython包装它可能是最容易的但是YMMV。
#3
I'm not looking to wave small snippets of code as a solution, but here's something to get you started. I have a strong suspicion that you're having troubles just declaring such an array in python without spending too much time on it, so I'll mostly help you out there.
我不打算将小代码片段作为解决方案,但这里有一些东西可以帮助您入门。我有一种强烈的怀疑,你只是在python中声明这样一个阵列而不花太多时间就会遇到麻烦,所以我会主要帮助你。
As far as the square roots come in, please add your example python code and I'll see what I can help optimize from that point on. In my example roots and sums are found with the default numpy functions/methods.
就平方根来说,请添加你的示例python代码,我会看到从那时起我可以帮助优化的内容。在我的例子中,使用默认的numpy函数/方法找到了根和总和。
def summing():
n = 1000000
ar = np.arange(0, n)
ar = ar/float(n)
ar = ar + np.ones(n)
sqrt = np.sqrt(ar)
return np.sum(ar)
In short, to get the starting array it's best to use a "workaround".
简而言之,要获得起始阵列,最好使用“解决方法”。
- initialize an array
ar
with values `[1,2,3,....n] - divide
ar
withn
. This gets us the1/n, 2/n ...
members - add to that an array of same dimensions that contain just the number
1.0
This gets us the full array[ 1., 1.000001, 1.000002, ..., 1.999998, 1.999999])
we're after. If I understood you right. - find square roots, sum it
使用值“[1,2,3,... n]初始化数组ar
将ar与n分开。这让我们成为1 / n,2 / n ......成员
添加一个包含数字1.0的相同维度的数组这将得到我们完整的数组[1.,1.000001,1.000002,...,1.999998,1.999999]。如果我理解你的话。
找到平方根,总结一下
Average of 10 sequential execution times is 0.018786
seconds.
10个连续执行时间的平均值为0.018786秒。
#1
You can do step 1. with numpy efficiently:
你可以有效地使用numpy执行第1步:
1.0 + np.arange(n + 1) / n
however I think you would need the np.vectorize() method to feed back x into your calculated values and it's not an efficient function (basically a wrapper for a python loop). If you can use scipy then there are built in methods that might do what you want http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.optimize.newton.html
但是我认为你需要使用np.vectorize()方法将x反馈到你的计算值中,它不是一个有效的函数(基本上是python循环的包装器)。如果你可以使用scipy那么内置的方法可能会做你想要的http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.optimize.newton.html
EDIT: Having thought a bit more about this I followed up on @ev-br's point and tried some alternatives. The masking uses too much processing but the abs().max() is pretty fast so a compromise might be to "divide the problem into blocks" both in the 1st dimension of the array and in iteration direction. The following doesn't do too badly (< 20s) on my pretty low power laptop - certainly much faster than np.vectorize() or any of the scipy solving systems I could find. (If I set m too big it runs out of something (memory?) and grinds to a complete halt!)
编辑:考虑到这一点后,我跟进了@ ev-br的观点并尝试了一些替代方案。掩码使用了太多的处理但是abs()。max()非常快,因此折衷可能是在数组的第一维和迭代方向上“将问题划分为块”。以下在我的低功耗笔记本电脑上做得不是太差(<20s) - 肯定比np.vectorize()或我能找到的任何scipy解决系统快得多。 (如果我将m设置得太大,那就会耗尽某些东西(内存?)并完全停止!)
n = 100000000
m = 5000000
block = 3
u = 1.0 + np.arange(n + 1) / n
x = np.full(u.shape, 2.0)
dx = np.ones(u.shape)
for i in range(0, n, m):
while np.abs(dx[i:i+m]).max() > 1.0e-7:
for j in range(block):
dx[i:i+m] = (x[i:i+m] ** 2 - u[i:i+m]) / (2 * x[i:i+m])
x[i:i+m] -= dx[i:i+m]
#2
Here's a toy example. Notice that often vectorization means writing your code as if you're manipulating numbers, and letting numpy do its magic:
这是一个玩具的例子。请注意,通常矢量化意味着编写代码就好像你在操纵数字一样,并让numpy发挥其魔力:
>>> import numpy as np
>>> a = np.array([1., 2., 3.])
>>> def f(x):
... return x**2 - a, 2.*x # function and derivative
>>>
>>> def newt(f, x0):
... x = np.asarray(x0)
... for _ in range(5): # hardcode the number of iterations (I know)
... v, dv = f(x)
... x -= v / dv
... return x
>>>
>>> newt(f, [1., 1., 1.])
array([ 1. , 1.41421356, 1.73205081])
If this is a performance bottleneck, this is unlikely to be competetive with hand-written C++ code: First of all, you're manipulating python objects with all the overhead; then numpy is likely doing a bunch of array allocations under the hood.
如果这是一个性能瓶颈,这不太可能与手写的C ++代码竞争:首先,你正在操纵python对象的所有开销;然后numpy可能会在引擎盖下做一堆数组分配。
An often viable strategy is to start by writing things in python/numpy, and then move bottlenecks into a compiled code --- eg Cython or C++ wrapped by Cython. In this particular case since you already have the C++ code, just wrapping it with Cython is likely easiest but YMMV.
一个经常可行的策略是首先在python / numpy中编写东西,然后将瓶颈移动到编译代码中 - 例如Cython或C ++包装的Cython。在这种特殊情况下,因为你已经有了C ++代码,所以用Cython包装它可能是最容易的但是YMMV。
#3
I'm not looking to wave small snippets of code as a solution, but here's something to get you started. I have a strong suspicion that you're having troubles just declaring such an array in python without spending too much time on it, so I'll mostly help you out there.
我不打算将小代码片段作为解决方案,但这里有一些东西可以帮助您入门。我有一种强烈的怀疑,你只是在python中声明这样一个阵列而不花太多时间就会遇到麻烦,所以我会主要帮助你。
As far as the square roots come in, please add your example python code and I'll see what I can help optimize from that point on. In my example roots and sums are found with the default numpy functions/methods.
就平方根来说,请添加你的示例python代码,我会看到从那时起我可以帮助优化的内容。在我的例子中,使用默认的numpy函数/方法找到了根和总和。
def summing():
n = 1000000
ar = np.arange(0, n)
ar = ar/float(n)
ar = ar + np.ones(n)
sqrt = np.sqrt(ar)
return np.sum(ar)
In short, to get the starting array it's best to use a "workaround".
简而言之,要获得起始阵列,最好使用“解决方法”。
- initialize an array
ar
with values `[1,2,3,....n] - divide
ar
withn
. This gets us the1/n, 2/n ...
members - add to that an array of same dimensions that contain just the number
1.0
This gets us the full array[ 1., 1.000001, 1.000002, ..., 1.999998, 1.999999])
we're after. If I understood you right. - find square roots, sum it
使用值“[1,2,3,... n]初始化数组ar
将ar与n分开。这让我们成为1 / n,2 / n ......成员
添加一个包含数字1.0的相同维度的数组这将得到我们完整的数组[1.,1.000001,1.000002,...,1.999998,1.999999]。如果我理解你的话。
找到平方根,总结一下
Average of 10 sequential execution times is 0.018786
seconds.
10个连续执行时间的平均值为0.018786秒。