I have a square matrix S (160 x 160), and a huge matrix X (160 x 250000). Both are dense numpy arrays.
我有一个方阵S (160 x 160)和一个巨大的矩阵x (160 x 250000)。两者都是密集的numpy数组。
My goal: find Q such that Q = inv(chol(S)) * X, where chol(S) is the lower cholesky factorization of S.
我的目标是:找到这样的Q: Q = inv(chol(S)) * X,其中chol(S)是S的较低的cholesky分解。
Naturally, a simple solution is
当然,一个简单的解决方案是。
cholS = scipy.linalg.cholesky( S, lower=True)
scipy.linalg.solve( cholS, X )
My problem: this solution is noticeably slower (>2x) in python than when I try the same in Matlab. Here are some timing experiments:
我的问题是:在python中,这个解决方案比在Matlab中尝试相同的方法要慢得多(>2x)。以下是一些计时实验:
timeit np.linalg.solve( cholS, X)
1 loops, best of 3: 1.63 s per loop
timeit scipy.linalg.solve_triangular( cholS, X, lower=True)
1 loops, best of 3: 2.19 s per loop
timeit scipy.linalg.solve( cholS, X)
1 loops, best of 3: 2.81 s per loop
[matlab]
cholS \ X
0.675 s
[matlab using only one thread via -singleCompThread]
cholS \ X
1.26 s
Basically, I'd like to know: (1) can I reach Matlab-like speeds in python? and (2) why is the scipy version so slow?
基本上,我想知道:(1)我能在python中达到matlablike的速度吗?(2) scipy版本为什么这么慢?
The solver should be able to take advantage of the fact that chol(S) is triangular. However, using numpy.linalg.solve() is faster than scipy.linalg.solve_triangular(), even though the numpy call doesn't use the triangular structure at all. What gives? The matlab solver seems to auto-detect when my matrix is triangular, but python cannot.
求解器应该能够利用chol(S)是三角形的这一事实。然而,使用numpy.linalg.solve()比scipy.linalg. solve_triangle()要快,尽管numpy调用根本不使用三角形结构。到底发生了什么事?当我的矩阵是三角形时,matlab求解器似乎可以自动检测,但是python不能。
I'd be happy to use a custom call to BLAS/LAPACK routines for solving triangular linear systems, but I really don't want to write that code myself.
我很乐意使用一个自定义调用来解决三角形线性系统的BLAS/LAPACK例程,但我真的不想自己编写代码。
For reference, I'm using scipy version 11.0 and the Enthought python distribution (which uses Intel's MKL library for vectorization), so I think I should be able to reach Matlab-like speeds.
作为参考,我正在使用scipy版本11.0和Enthought python发行版(使用英特尔的MKL库进行矢量化),因此我认为我应该能够达到matlab一样的速度。
3 个解决方案
#1
11
TL;DR: Don't use numpy's or scipy's solve
when you have a triangular system, just use scipy.linalg.solve_triangular
with at least the check_finite=False
keyword argument for fast and non-destructive solutions.
当你有一个三角形系统时,不要用numpy或scipy的解,只要用scipy.linalg就可以了。solve_triangle具有至少check_limited =False关键字参数,用于快速和非破坏性解决方案。
I found this thread after stumbling across some discrepancies between numpy.linalg.solve
and scipy.linalg.solve
(and scipy's lu_solve
, etc.). I don't have Enthought's MKL-based Numpy/Scipy, but I hope my findings can help you in some way.
在偶然发现numpi .linalg之间的一些差异之后,我找到了这条线索。解决和scipy.linalg。solve(以及scipy的lu_solve等)。我没有恩思的基于mklout的Numpy/Scipy,但是我希望我的发现能在某种程度上帮助你。
With the pre-built binaries for Numpy and Scipy (32-bit, running on Windows 7):
使用预构建的二进制文件来实现Numpy和Scipy(32位,运行在Windows 7上):
-
I see a significant difference between
numpy.linalg.solve
andscipy.linalg.solve
when solving for a vectorX
(i.e.,X
is 160 by 1). Scipy runtime is 1.23x numpy's, which is I think substantial.我发现numpy.linalg有很大的不同。解决和scipy.linalg。求解向量X时(即。Scipy运行时是1.23倍numpy's,我认为这很重要。
-
However, most of the difference appears to be due to scipy's
solve
checking for invalid entries. When passingcheck_finite=False
into scipy.linalg.solve, scipy'ssolve
runtime is 1.02x numpy's.然而,大多数差异似乎是由于scipy对无效条目的解决检查。当将check_limited =False传递到scipy.linalg时。解,scipy的解运行时是1。02x numpy的。
-
Scipy's solve using destructive updates, i.e., with
overwrite_a=True, overwrite_b=True
is slightly faster than numpy's solve (which is non-destructive). Numpy's solve runtime is 1.021x destructive scipy.linalg.solve. Scipy with justcheck_finite=False
has runtime 1.04x the destructive case. In summary, destructivescipy.linalg.solve
is very slightly faster than either of these cases.Scipy使用破坏性更新解决,即。,对于overwrite_a=True, overwrite_b=True略快于numpy的解法(它是非破坏性的)。Numpy的求解运行时是1.021倍的破坏性scipy.linalg.solve。只有check_limited =False的Scipy运行时是1.04x。总之,破坏性scipy.linalg。求解速度比这两种情况都要快。
-
The above are for a vector
X
. If I makeX
a wide array, specifically 160 by 10000,scipy.linalg.solve
withcheck_finite=False
is essentially as fast as withcheck_finite=False, overwrite_a=True, overwrite_b=True
. Scipy'ssolve
(without any special keywords) runtime is 1.09x this "unsafe" (check_finite=False
) call. Numpy'ssolve
has runtime 1.03x scipy's fastest for this arrayX
case.上面是向量X,如果我做一个宽数组,具体是160×10000,scipy.linalg。用check_limited =False求解本质上和用check_limited =False一样快,overwrite_a=True, overwrite_b=True。Scipy的求解(没有任何特殊关键字)运行时是1.09倍的这个“不安全”(check_limited =False)调用。Numpy的解决方案在运行时1.03x scipy是这个数组X的最快解。
-
scipy.linalg.solve_triangular
delivers significant speedups in both these cases, but you have to turn off input checking, i.e., pass incheck_finite=False
. The runtime for the fastest solve was 5.68x and 1.76xsolve_triangular
's, for vector and arrayX
, respectively, withcheck_finite=False
.scipy.linalg。solve_triangle在这两种情况下都提供了显著的加速,但是您必须关闭输入检查,即。,通过在check_finite = False。最快解的运行时为5.68x和1.76x solve_三角形的,分别为向量和数组X,用check_= False。
-
solve_triangular
with destructive computation (overwrite_b=True
) gives you no speedup on top ofcheck_finite=False
(and actually hurts slightly for the arrayX
case).solve_triangle with破坏性计算(overwrite_b=True)在check_limited_false之上没有加速(实际上对数组X的情况有点伤害)。
-
I, ignoramus, was previously unaware of
solve_triangular
and was usingscipy.linalg.lu_solve
as a triangular solver, i.e., instead ofsolve_triangular(cholS, X)
doinglu_solve((cholS, numpy.arange(160)), X)
(both produce the same answer). But I discovered thatlu_solve
used in this way has runtime 1.07x unsafesolve_triangular
for the vectorX
case, while its runtime was 1.76x for the arrayX
case. I'm not sure whylu_solve
is so much slower for arrayX
, compared to vectorX
, but the lesson is to usesolve_triangular
(without infinite checks).我,ignoramus,之前没有意识到solve_三角形并且使用scipy.linalg。lu_solve为三角形求解器,即。,而不是solve_triangle (cholS, X)做lu_solve(cholS, numpy.arange(160)), X(两者产生相同的答案)。但我发现,以这种方式使用的lu_solve对于向量X的运行时为1.07x,不安全的solve_triangle,而对于数组X的运行时为1.76x。与向量X相比,我不确定lu_solve为什么要慢得多,但是我们的经验是使用solve_三角形(没有无限的检查)。
-
Copying the data to Fortran format didn't seem to matter at all. Neither does converting to
numpy.matrix
.将数据复制到Fortran格式似乎并不重要。转换到numpi .matrix也是如此。
I might as well compare my non-MKL Python libraries against single-threaded (maxNumCompThreads=1
) Matlab 2013a. The fastest Python implementations above had 4.5x longer runtime for the vector X
case and 6.3x longer runtime for the fat matrix X
case.
我不妨将我的非mkl Python库与单线程(maxNumCompThreads=1) Matlab 2013a进行比较。上面提到的最快的Python实现对于向量X的运行时要长4.5倍,而对于fat矩阵X的运行时要长6.3倍。
However, here's the Python script I used to benchmark these, perhaps someone with MKL-accelerated Numpy/Scipy can post their numbers. Note that I just comment out the line n = 10000
to disable the fat matrix X
case and do the n=1
vector case. (Sorry.)
但是,这里是我用来对它们进行基准测试的Python脚本,也许使用mkl加速的Numpy/Scipy的人可以发布它们的编号。注意,我只是注释掉第n= 10000行,以禁用fat矩阵X的情况,然后做n=1向量的情况。(对不起)。
import scipy.linalg as sla
import numpy.linalg as nla
from numpy.random import RandomState
from timeit import timeit
import numpy as np
RNG = RandomState(69)
m=160
n=1
#n=10000
Ac = RNG.randn(m,m)
if 1:
Ac = np.triu(Ac)
bc = RNG.randn(m,n)
Af = Ac.copy("F")
bf = bc.copy("F")
if 0: # Save to Matlab format
import scipy.io as io
io.savemat("b_%d.mat"%(n,), dict(A=Ac, b=bc))
import sys
sys.exit(0)
def lapper(fn, source, **kwargs):
Alocal = source[0].copy()
blocal = source[1].copy()
fn(Alocal, blocal,**kwargs)
laps = (1000 if n<=1 else 100)
def printer(t, s=''):
print ("%g seconds, %d laps, " % (t/float(laps), laps)) + s
return t/float(laps)
t=[]
print "C"
t.append(printer(timeit(lambda: lapper(sla.solve, (Ac,bc)), number=laps),
"scipy.solve"))
t.append(printer(timeit(lambda: lapper(sla.solve, (Ac,bc), check_finite=False),
number=laps), "scipy.solve, infinite-ok"))
t.append(printer(timeit(lambda: lapper(nla.solve, (Ac,bc)), number=laps),
"numpy.solve"))
#print "F" # Doesn't seem to matter
#printer(timeit(lambda: lapper(sla.solve, (Af,bf)), number=laps))
#printer(timeit(lambda: lapper(nla.solve, (Af,bf)), number=laps))
print "sla with tweaks"
t.append(printer(timeit(lambda: lapper(sla.solve, (Ac,bc), overwrite_a=True,
overwrite_b=True, check_finite=False),
number=laps), "scipy.solve destructive"))
print "Tri"
t.append(printer(timeit(lambda: lapper(sla.solve_triangular, (Ac,bc)),
number=laps), "scipy.solve_triangular"))
t.append(printer(timeit(lambda: lapper(sla.solve_triangular, (Ac,bc),
check_finite=False), number=laps),
"scipy.solve_triangular, inf-ok"))
t.append(printer(timeit(lambda: lapper(sla.solve_triangular, (Ac,bc),
overwrite_b=True, check_finite=False),
number=laps), "scipy.solve_triangular destructive"))
print "LU"
piv = np.arange(m)
t.append(printer(timeit(lambda: lapper(
lambda X,b: sla.lu_solve((X, piv),b,check_finite=False),
(Ac,bc)), number=laps), "LU"))
print "all times:"
print t
Output of the above script for the vector case, n=1
:
对于向量情况,上面脚本的输出,n=1:
C
0.000739405 seconds, 1000 laps, scipy.solve
0.000624746 seconds, 1000 laps, scipy.solve, infinite-ok
0.000590003 seconds, 1000 laps, numpy.solve
sla with tweaks
0.000608365 seconds, 1000 laps, scipy.solve destructive
Tri
0.000208711 seconds, 1000 laps, scipy.solve_triangular
9.38371e-05 seconds, 1000 laps, scipy.solve_triangular, inf-ok
9.37682e-05 seconds, 1000 laps, scipy.solve_triangular destructive
LU
0.000100215 seconds, 1000 laps, LU
all times:
[0.0007394047886284343, 0.00062474593940593, 0.0005900030818282472, 0.0006083650710913095, 0.00020871054023307778, 9.383710445114923e-05, 9.37682389063692e-05, 0.00010021534750467032]
Output of the above script for the matrix case n=10000
:
矩阵情况n=10000时上述脚本的输出:
C
0.118985 seconds, 100 laps, scipy.solve
0.113687 seconds, 100 laps, scipy.solve, infinite-ok
0.115569 seconds, 100 laps, numpy.solve
sla with tweaks
0.113122 seconds, 100 laps, scipy.solve destructive
Tri
0.0725959 seconds, 100 laps, scipy.solve_triangular
0.0634396 seconds, 100 laps, scipy.solve_triangular, inf-ok
0.0638423 seconds, 100 laps, scipy.solve_triangular destructive
LU
0.1115 seconds, 100 laps, LU
all times:
[0.11898513112988955, 0.11368747217793944, 0.11556863916356903, 0.11312182352918797, 0.07259593807427585, 0.0634396208970783, 0.06384230931663318, 0.11150022257648459]
Note that the above Python script can save its arrays as Matlab .MAT data files. This is currently disabled (if 0
, sorry), but if enabled, you can test Matlab's speed on the exact same data. Here's a timing script for Matlab:
注意,上面的Python脚本可以将其数组保存为Matlab . mat数据文件。这是目前禁用的(如果是0,不好意思),但是如果启用,您可以在相同的数据上测试Matlab的速度。下面是Matlab的计时脚本:
clear
q = load('b_10000.mat');
A=q.A;
b=q.b;
clear q
matrix_time = timeit(@() A\b)
q = load('b_1.mat');
A=q.A;
b=q.b;
clear q
vector_time = timeit(@() A\b)
You'll need the timeit
function from Mathworks File Exchange: http://www.mathworks.com/matlabcentral/fileexchange/18798-timeit-benchmarking-function. This produces the following output:
您将需要Mathworks文件交换中的timeit函数:http://www.mathworks.com/matlabcentral/fileexchange/18798-timeit基准测试函数。这将产生以下输出:
matrix_time =
0.0099989
vector_time =
2.2487e-05
The upshot of this empirical analysis is, in Python at least, don't use numpy's or scipy's solve
when you have a triangular system, just use scipy.linalg.solve_triangular
with at least the check_finite=False
keyword argument for fast and non-destructive solutions.
这个经验分析的结果是,至少在Python中,当你有一个三角形系统时,不要使用numpy或scipy的解,只要使用scipy.linalg就可以了。solve_triangle具有至少check_limited =False关键字参数,用于快速和非破坏性解决方案。
#2
3
Why not just use the equation: Q = inv(chol(S)) * X
, here is my test:
为什么不直接用公式:Q = inv(chol(S)) * X,这是我的测试:
import scipy.linalg
import numpy as np
N = 160
M = 100000
S = np.random.randn(N, N)
B = np.random.randn(N, M)
S = np.dot(S, S.T)
cS = scipy.linalg.cholesky(S, lower=True)
Y1 = scipy.linalg.solve(cS, B)
icS = scipy.linalg.inv(cS)
Y2 = np.dot(icS, B)
np.allclose(Y1, Y2)
output:
输出:
True
Here is the time test:
这里是时间测试:
%time scipy.linalg.solve(cholS, B)
%time np.linalg.solve(cholS, B)
%time scipy.linalg.solve_triangular(cholS, B, lower=True)
%time ics=scipy.linalg.inv(cS);np.dot(ics, B)
output:
输出:
CPU times: user 2.07 s, sys: 0.00 s, total: 2.07 s
Wall time: 2.08 s
CPU times: user 1.93 s, sys: 0.00 s, total: 1.93 s
Wall time: 1.92 s
CPU times: user 1.12 s, sys: 0.00 s, total: 1.12 s
Wall time: 1.13 s
CPU times: user 0.71 s, sys: 0.00 s, total: 0.71 s
Wall time: 0.72 s
I don't know why scipy.linalg.solve_triangular
is slower than numpy.linalg.solve
on your system, but the inv
version is the fastest.
我不知道为什么要剪。solve_三角形比numpy.linalg慢。在您的系统上解决问题,但是inv版本是最快的。
#3
2
A couple of things to try:
有几件事值得一试:
-
X = X.copy('F')
# use fortran-order arrays, so that a copy is avoidedX = X.copy('F') #使用fortranorder数组,以避免复制
-
Y = solve_triangular(cholS, X, overwrite_b=True)
# avoid another copy, but trash contents ofX
Y = solve_triangle (cholS, X, overwrite_b=True
-
Y = solve_triangular(cholS, X, check_finite=False)
# Scipy >= 0.12 only --- but doesn't seem to have a large effect on speed...Y = solve_三角形(cholS, X, check_= False) # Scipy >= 0.12,但似乎并没有对速度产生很大影响……
With both of these, it should be pretty much equivalent to a direct call to MKL with no buffer copies.
对于这两种方法,它几乎相当于对MKL的直接调用,没有缓冲区副本。
I can't reproduce the issue with np.linalg.solve
and scipy.linalg.solve
having different speeds --- with the BLAS + LAPACK combination I have, both seem the same speed.
我不能用np.linalg重现这个问题。解决和scipy.linalg。解决有不同的速度---我有BLAS + LAPACK组合,两者看起来速度相同。
#1
11
TL;DR: Don't use numpy's or scipy's solve
when you have a triangular system, just use scipy.linalg.solve_triangular
with at least the check_finite=False
keyword argument for fast and non-destructive solutions.
当你有一个三角形系统时,不要用numpy或scipy的解,只要用scipy.linalg就可以了。solve_triangle具有至少check_limited =False关键字参数,用于快速和非破坏性解决方案。
I found this thread after stumbling across some discrepancies between numpy.linalg.solve
and scipy.linalg.solve
(and scipy's lu_solve
, etc.). I don't have Enthought's MKL-based Numpy/Scipy, but I hope my findings can help you in some way.
在偶然发现numpi .linalg之间的一些差异之后,我找到了这条线索。解决和scipy.linalg。solve(以及scipy的lu_solve等)。我没有恩思的基于mklout的Numpy/Scipy,但是我希望我的发现能在某种程度上帮助你。
With the pre-built binaries for Numpy and Scipy (32-bit, running on Windows 7):
使用预构建的二进制文件来实现Numpy和Scipy(32位,运行在Windows 7上):
-
I see a significant difference between
numpy.linalg.solve
andscipy.linalg.solve
when solving for a vectorX
(i.e.,X
is 160 by 1). Scipy runtime is 1.23x numpy's, which is I think substantial.我发现numpy.linalg有很大的不同。解决和scipy.linalg。求解向量X时(即。Scipy运行时是1.23倍numpy's,我认为这很重要。
-
However, most of the difference appears to be due to scipy's
solve
checking for invalid entries. When passingcheck_finite=False
into scipy.linalg.solve, scipy'ssolve
runtime is 1.02x numpy's.然而,大多数差异似乎是由于scipy对无效条目的解决检查。当将check_limited =False传递到scipy.linalg时。解,scipy的解运行时是1。02x numpy的。
-
Scipy's solve using destructive updates, i.e., with
overwrite_a=True, overwrite_b=True
is slightly faster than numpy's solve (which is non-destructive). Numpy's solve runtime is 1.021x destructive scipy.linalg.solve. Scipy with justcheck_finite=False
has runtime 1.04x the destructive case. In summary, destructivescipy.linalg.solve
is very slightly faster than either of these cases.Scipy使用破坏性更新解决,即。,对于overwrite_a=True, overwrite_b=True略快于numpy的解法(它是非破坏性的)。Numpy的求解运行时是1.021倍的破坏性scipy.linalg.solve。只有check_limited =False的Scipy运行时是1.04x。总之,破坏性scipy.linalg。求解速度比这两种情况都要快。
-
The above are for a vector
X
. If I makeX
a wide array, specifically 160 by 10000,scipy.linalg.solve
withcheck_finite=False
is essentially as fast as withcheck_finite=False, overwrite_a=True, overwrite_b=True
. Scipy'ssolve
(without any special keywords) runtime is 1.09x this "unsafe" (check_finite=False
) call. Numpy'ssolve
has runtime 1.03x scipy's fastest for this arrayX
case.上面是向量X,如果我做一个宽数组,具体是160×10000,scipy.linalg。用check_limited =False求解本质上和用check_limited =False一样快,overwrite_a=True, overwrite_b=True。Scipy的求解(没有任何特殊关键字)运行时是1.09倍的这个“不安全”(check_limited =False)调用。Numpy的解决方案在运行时1.03x scipy是这个数组X的最快解。
-
scipy.linalg.solve_triangular
delivers significant speedups in both these cases, but you have to turn off input checking, i.e., pass incheck_finite=False
. The runtime for the fastest solve was 5.68x and 1.76xsolve_triangular
's, for vector and arrayX
, respectively, withcheck_finite=False
.scipy.linalg。solve_triangle在这两种情况下都提供了显著的加速,但是您必须关闭输入检查,即。,通过在check_finite = False。最快解的运行时为5.68x和1.76x solve_三角形的,分别为向量和数组X,用check_= False。
-
solve_triangular
with destructive computation (overwrite_b=True
) gives you no speedup on top ofcheck_finite=False
(and actually hurts slightly for the arrayX
case).solve_triangle with破坏性计算(overwrite_b=True)在check_limited_false之上没有加速(实际上对数组X的情况有点伤害)。
-
I, ignoramus, was previously unaware of
solve_triangular
and was usingscipy.linalg.lu_solve
as a triangular solver, i.e., instead ofsolve_triangular(cholS, X)
doinglu_solve((cholS, numpy.arange(160)), X)
(both produce the same answer). But I discovered thatlu_solve
used in this way has runtime 1.07x unsafesolve_triangular
for the vectorX
case, while its runtime was 1.76x for the arrayX
case. I'm not sure whylu_solve
is so much slower for arrayX
, compared to vectorX
, but the lesson is to usesolve_triangular
(without infinite checks).我,ignoramus,之前没有意识到solve_三角形并且使用scipy.linalg。lu_solve为三角形求解器,即。,而不是solve_triangle (cholS, X)做lu_solve(cholS, numpy.arange(160)), X(两者产生相同的答案)。但我发现,以这种方式使用的lu_solve对于向量X的运行时为1.07x,不安全的solve_triangle,而对于数组X的运行时为1.76x。与向量X相比,我不确定lu_solve为什么要慢得多,但是我们的经验是使用solve_三角形(没有无限的检查)。
-
Copying the data to Fortran format didn't seem to matter at all. Neither does converting to
numpy.matrix
.将数据复制到Fortran格式似乎并不重要。转换到numpi .matrix也是如此。
I might as well compare my non-MKL Python libraries against single-threaded (maxNumCompThreads=1
) Matlab 2013a. The fastest Python implementations above had 4.5x longer runtime for the vector X
case and 6.3x longer runtime for the fat matrix X
case.
我不妨将我的非mkl Python库与单线程(maxNumCompThreads=1) Matlab 2013a进行比较。上面提到的最快的Python实现对于向量X的运行时要长4.5倍,而对于fat矩阵X的运行时要长6.3倍。
However, here's the Python script I used to benchmark these, perhaps someone with MKL-accelerated Numpy/Scipy can post their numbers. Note that I just comment out the line n = 10000
to disable the fat matrix X
case and do the n=1
vector case. (Sorry.)
但是,这里是我用来对它们进行基准测试的Python脚本,也许使用mkl加速的Numpy/Scipy的人可以发布它们的编号。注意,我只是注释掉第n= 10000行,以禁用fat矩阵X的情况,然后做n=1向量的情况。(对不起)。
import scipy.linalg as sla
import numpy.linalg as nla
from numpy.random import RandomState
from timeit import timeit
import numpy as np
RNG = RandomState(69)
m=160
n=1
#n=10000
Ac = RNG.randn(m,m)
if 1:
Ac = np.triu(Ac)
bc = RNG.randn(m,n)
Af = Ac.copy("F")
bf = bc.copy("F")
if 0: # Save to Matlab format
import scipy.io as io
io.savemat("b_%d.mat"%(n,), dict(A=Ac, b=bc))
import sys
sys.exit(0)
def lapper(fn, source, **kwargs):
Alocal = source[0].copy()
blocal = source[1].copy()
fn(Alocal, blocal,**kwargs)
laps = (1000 if n<=1 else 100)
def printer(t, s=''):
print ("%g seconds, %d laps, " % (t/float(laps), laps)) + s
return t/float(laps)
t=[]
print "C"
t.append(printer(timeit(lambda: lapper(sla.solve, (Ac,bc)), number=laps),
"scipy.solve"))
t.append(printer(timeit(lambda: lapper(sla.solve, (Ac,bc), check_finite=False),
number=laps), "scipy.solve, infinite-ok"))
t.append(printer(timeit(lambda: lapper(nla.solve, (Ac,bc)), number=laps),
"numpy.solve"))
#print "F" # Doesn't seem to matter
#printer(timeit(lambda: lapper(sla.solve, (Af,bf)), number=laps))
#printer(timeit(lambda: lapper(nla.solve, (Af,bf)), number=laps))
print "sla with tweaks"
t.append(printer(timeit(lambda: lapper(sla.solve, (Ac,bc), overwrite_a=True,
overwrite_b=True, check_finite=False),
number=laps), "scipy.solve destructive"))
print "Tri"
t.append(printer(timeit(lambda: lapper(sla.solve_triangular, (Ac,bc)),
number=laps), "scipy.solve_triangular"))
t.append(printer(timeit(lambda: lapper(sla.solve_triangular, (Ac,bc),
check_finite=False), number=laps),
"scipy.solve_triangular, inf-ok"))
t.append(printer(timeit(lambda: lapper(sla.solve_triangular, (Ac,bc),
overwrite_b=True, check_finite=False),
number=laps), "scipy.solve_triangular destructive"))
print "LU"
piv = np.arange(m)
t.append(printer(timeit(lambda: lapper(
lambda X,b: sla.lu_solve((X, piv),b,check_finite=False),
(Ac,bc)), number=laps), "LU"))
print "all times:"
print t
Output of the above script for the vector case, n=1
:
对于向量情况,上面脚本的输出,n=1:
C
0.000739405 seconds, 1000 laps, scipy.solve
0.000624746 seconds, 1000 laps, scipy.solve, infinite-ok
0.000590003 seconds, 1000 laps, numpy.solve
sla with tweaks
0.000608365 seconds, 1000 laps, scipy.solve destructive
Tri
0.000208711 seconds, 1000 laps, scipy.solve_triangular
9.38371e-05 seconds, 1000 laps, scipy.solve_triangular, inf-ok
9.37682e-05 seconds, 1000 laps, scipy.solve_triangular destructive
LU
0.000100215 seconds, 1000 laps, LU
all times:
[0.0007394047886284343, 0.00062474593940593, 0.0005900030818282472, 0.0006083650710913095, 0.00020871054023307778, 9.383710445114923e-05, 9.37682389063692e-05, 0.00010021534750467032]
Output of the above script for the matrix case n=10000
:
矩阵情况n=10000时上述脚本的输出:
C
0.118985 seconds, 100 laps, scipy.solve
0.113687 seconds, 100 laps, scipy.solve, infinite-ok
0.115569 seconds, 100 laps, numpy.solve
sla with tweaks
0.113122 seconds, 100 laps, scipy.solve destructive
Tri
0.0725959 seconds, 100 laps, scipy.solve_triangular
0.0634396 seconds, 100 laps, scipy.solve_triangular, inf-ok
0.0638423 seconds, 100 laps, scipy.solve_triangular destructive
LU
0.1115 seconds, 100 laps, LU
all times:
[0.11898513112988955, 0.11368747217793944, 0.11556863916356903, 0.11312182352918797, 0.07259593807427585, 0.0634396208970783, 0.06384230931663318, 0.11150022257648459]
Note that the above Python script can save its arrays as Matlab .MAT data files. This is currently disabled (if 0
, sorry), but if enabled, you can test Matlab's speed on the exact same data. Here's a timing script for Matlab:
注意,上面的Python脚本可以将其数组保存为Matlab . mat数据文件。这是目前禁用的(如果是0,不好意思),但是如果启用,您可以在相同的数据上测试Matlab的速度。下面是Matlab的计时脚本:
clear
q = load('b_10000.mat');
A=q.A;
b=q.b;
clear q
matrix_time = timeit(@() A\b)
q = load('b_1.mat');
A=q.A;
b=q.b;
clear q
vector_time = timeit(@() A\b)
You'll need the timeit
function from Mathworks File Exchange: http://www.mathworks.com/matlabcentral/fileexchange/18798-timeit-benchmarking-function. This produces the following output:
您将需要Mathworks文件交换中的timeit函数:http://www.mathworks.com/matlabcentral/fileexchange/18798-timeit基准测试函数。这将产生以下输出:
matrix_time =
0.0099989
vector_time =
2.2487e-05
The upshot of this empirical analysis is, in Python at least, don't use numpy's or scipy's solve
when you have a triangular system, just use scipy.linalg.solve_triangular
with at least the check_finite=False
keyword argument for fast and non-destructive solutions.
这个经验分析的结果是,至少在Python中,当你有一个三角形系统时,不要使用numpy或scipy的解,只要使用scipy.linalg就可以了。solve_triangle具有至少check_limited =False关键字参数,用于快速和非破坏性解决方案。
#2
3
Why not just use the equation: Q = inv(chol(S)) * X
, here is my test:
为什么不直接用公式:Q = inv(chol(S)) * X,这是我的测试:
import scipy.linalg
import numpy as np
N = 160
M = 100000
S = np.random.randn(N, N)
B = np.random.randn(N, M)
S = np.dot(S, S.T)
cS = scipy.linalg.cholesky(S, lower=True)
Y1 = scipy.linalg.solve(cS, B)
icS = scipy.linalg.inv(cS)
Y2 = np.dot(icS, B)
np.allclose(Y1, Y2)
output:
输出:
True
Here is the time test:
这里是时间测试:
%time scipy.linalg.solve(cholS, B)
%time np.linalg.solve(cholS, B)
%time scipy.linalg.solve_triangular(cholS, B, lower=True)
%time ics=scipy.linalg.inv(cS);np.dot(ics, B)
output:
输出:
CPU times: user 2.07 s, sys: 0.00 s, total: 2.07 s
Wall time: 2.08 s
CPU times: user 1.93 s, sys: 0.00 s, total: 1.93 s
Wall time: 1.92 s
CPU times: user 1.12 s, sys: 0.00 s, total: 1.12 s
Wall time: 1.13 s
CPU times: user 0.71 s, sys: 0.00 s, total: 0.71 s
Wall time: 0.72 s
I don't know why scipy.linalg.solve_triangular
is slower than numpy.linalg.solve
on your system, but the inv
version is the fastest.
我不知道为什么要剪。solve_三角形比numpy.linalg慢。在您的系统上解决问题,但是inv版本是最快的。
#3
2
A couple of things to try:
有几件事值得一试:
-
X = X.copy('F')
# use fortran-order arrays, so that a copy is avoidedX = X.copy('F') #使用fortranorder数组,以避免复制
-
Y = solve_triangular(cholS, X, overwrite_b=True)
# avoid another copy, but trash contents ofX
Y = solve_triangle (cholS, X, overwrite_b=True
-
Y = solve_triangular(cholS, X, check_finite=False)
# Scipy >= 0.12 only --- but doesn't seem to have a large effect on speed...Y = solve_三角形(cholS, X, check_= False) # Scipy >= 0.12,但似乎并没有对速度产生很大影响……
With both of these, it should be pretty much equivalent to a direct call to MKL with no buffer copies.
对于这两种方法,它几乎相当于对MKL的直接调用,没有缓冲区副本。
I can't reproduce the issue with np.linalg.solve
and scipy.linalg.solve
having different speeds --- with the BLAS + LAPACK combination I have, both seem the same speed.
我不能用np.linalg重现这个问题。解决和scipy.linalg。解决有不同的速度---我有BLAS + LAPACK组合,两者看起来速度相同。