快速解决一个三角形的线性系统?

时间:2021-09-29 21:24:01

I have a square matrix S (160 x 160), and a huge matrix X (160 x 250000). Both are dense numpy arrays.

我有一个方阵S (160 x 160)和一个巨大的矩阵x (160 x 250000)。两者都是密集的numpy数组。

My goal: find Q such that Q = inv(chol(S)) * X, where chol(S) is the lower cholesky factorization of S.

我的目标是:找到这样的Q: Q = inv(chol(S)) * X,其中chol(S)是S的较低的cholesky分解。

Naturally, a simple solution is

当然,一个简单的解决方案是。

cholS = scipy.linalg.cholesky( S, lower=True)
scipy.linalg.solve( cholS, X )

My problem: this solution is noticeably slower (>2x) in python than when I try the same in Matlab. Here are some timing experiments:

我的问题是:在python中,这个解决方案比在Matlab中尝试相同的方法要慢得多(>2x)。以下是一些计时实验:

timeit np.linalg.solve( cholS, X)
1 loops, best of 3: 1.63 s per loop

timeit scipy.linalg.solve_triangular( cholS, X, lower=True)
1 loops, best of 3: 2.19 s per loop

timeit scipy.linalg.solve( cholS, X)
1 loops, best of 3: 2.81 s per loop

[matlab]
cholS \ X
0.675 s

[matlab using only one thread via -singleCompThread]
cholS \ X
1.26 s

Basically, I'd like to know: (1) can I reach Matlab-like speeds in python? and (2) why is the scipy version so slow?

基本上,我想知道:(1)我能在python中达到matlablike的速度吗?(2) scipy版本为什么这么慢?

The solver should be able to take advantage of the fact that chol(S) is triangular. However, using numpy.linalg.solve() is faster than scipy.linalg.solve_triangular(), even though the numpy call doesn't use the triangular structure at all. What gives? The matlab solver seems to auto-detect when my matrix is triangular, but python cannot.

求解器应该能够利用chol(S)是三角形的这一事实。然而,使用numpy.linalg.solve()比scipy.linalg. solve_triangle()要快,尽管numpy调用根本不使用三角形结构。到底发生了什么事?当我的矩阵是三角形时,matlab求解器似乎可以自动检测,但是python不能。

I'd be happy to use a custom call to BLAS/LAPACK routines for solving triangular linear systems, but I really don't want to write that code myself.

我很乐意使用一个自定义调用来解决三角形线性系统的BLAS/LAPACK例程,但我真的不想自己编写代码。

For reference, I'm using scipy version 11.0 and the Enthought python distribution (which uses Intel's MKL library for vectorization), so I think I should be able to reach Matlab-like speeds.

作为参考,我正在使用scipy版本11.0和Enthought python发行版(使用英特尔的MKL库进行矢量化),因此我认为我应该能够达到matlab一样的速度。

3 个解决方案

#1


11  

TL;DR: Don't use numpy's or scipy's solve when you have a triangular system, just use scipy.linalg.solve_triangular with at least the check_finite=False keyword argument for fast and non-destructive solutions.

当你有一个三角形系统时,不要用numpy或scipy的解,只要用scipy.linalg就可以了。solve_triangle具有至少check_limited =False关键字参数,用于快速和非破坏性解决方案。


I found this thread after stumbling across some discrepancies between numpy.linalg.solve and scipy.linalg.solve (and scipy's lu_solve, etc.). I don't have Enthought's MKL-based Numpy/Scipy, but I hope my findings can help you in some way.

在偶然发现numpi .linalg之间的一些差异之后,我找到了这条线索。解决和scipy.linalg。solve(以及scipy的lu_solve等)。我没有恩思的基于mklout的Numpy/Scipy,但是我希望我的发现能在某种程度上帮助你。

With the pre-built binaries for Numpy and Scipy (32-bit, running on Windows 7):

使用预构建的二进制文件来实现Numpy和Scipy(32位,运行在Windows 7上):

  1. I see a significant difference between numpy.linalg.solve and scipy.linalg.solve when solving for a vector X (i.e., X is 160 by 1). Scipy runtime is 1.23x numpy's, which is I think substantial.

    我发现numpy.linalg有很大的不同。解决和scipy.linalg。求解向量X时(即。Scipy运行时是1.23倍numpy's,我认为这很重要。

  2. However, most of the difference appears to be due to scipy's solve checking for invalid entries. When passing check_finite=False into scipy.linalg.solve, scipy's solve runtime is 1.02x numpy's.

    然而,大多数差异似乎是由于scipy对无效条目的解决检查。当将check_limited =False传递到scipy.linalg时。解,scipy的解运行时是1。02x numpy的。

  3. Scipy's solve using destructive updates, i.e., with overwrite_a=True, overwrite_b=True is slightly faster than numpy's solve (which is non-destructive). Numpy's solve runtime is 1.021x destructive scipy.linalg.solve. Scipy with just check_finite=False has runtime 1.04x the destructive case. In summary, destructive scipy.linalg.solve is very slightly faster than either of these cases.

    Scipy使用破坏性更新解决,即。,对于overwrite_a=True, overwrite_b=True略快于numpy的解法(它是非破坏性的)。Numpy的求解运行时是1.021倍的破坏性scipy.linalg.solve。只有check_limited =False的Scipy运行时是1.04x。总之,破坏性scipy.linalg。求解速度比这两种情况都要快。

  4. The above are for a vector X. If I make X a wide array, specifically 160 by 10000, scipy.linalg.solve with check_finite=False is essentially as fast as with check_finite=False, overwrite_a=True, overwrite_b=True. Scipy's solve (without any special keywords) runtime is 1.09x this "unsafe" (check_finite=False) call. Numpy's solve has runtime 1.03x scipy's fastest for this array X case.

    上面是向量X,如果我做一个宽数组,具体是160×10000,scipy.linalg。用check_limited =False求解本质上和用check_limited =False一样快,overwrite_a=True, overwrite_b=True。Scipy的求解(没有任何特殊关键字)运行时是1.09倍的这个“不安全”(check_limited =False)调用。Numpy的解决方案在运行时1.03x scipy是这个数组X的最快解。

  5. scipy.linalg.solve_triangular delivers significant speedups in both these cases, but you have to turn off input checking, i.e., pass in check_finite=False. The runtime for the fastest solve was 5.68x and 1.76x solve_triangular's, for vector and array X, respectively, with check_finite=False.

    scipy.linalg。solve_triangle在这两种情况下都提供了显著的加速,但是您必须关闭输入检查,即。,通过在check_finite = False。最快解的运行时为5.68x和1.76x solve_三角形的,分别为向量和数组X,用check_= False。

  6. solve_triangular with destructive computation (overwrite_b=True) gives you no speedup on top of check_finite=False (and actually hurts slightly for the array X case).

    solve_triangle with破坏性计算(overwrite_b=True)在check_limited_false之上没有加速(实际上对数组X的情况有点伤害)。

  7. I, ignoramus, was previously unaware of solve_triangular and was using scipy.linalg.lu_solve as a triangular solver, i.e., instead of solve_triangular(cholS, X) doing lu_solve((cholS, numpy.arange(160)), X) (both produce the same answer). But I discovered that lu_solve used in this way has runtime 1.07x unsafe solve_triangular for the vector X case, while its runtime was 1.76x for the array X case. I'm not sure why lu_solve is so much slower for array X, compared to vector X, but the lesson is to use solve_triangular (without infinite checks).

    我,ignoramus,之前没有意识到solve_三角形并且使用scipy.linalg。lu_solve为三角形求解器,即。,而不是solve_triangle (cholS, X)做lu_solve(cholS, numpy.arange(160)), X(两者产生相同的答案)。但我发现,以这种方式使用的lu_solve对于向量X的运行时为1.07x,不安全的solve_triangle,而对于数组X的运行时为1.76x。与向量X相比,我不确定lu_solve为什么要慢得多,但是我们的经验是使用solve_三角形(没有无限的检查)。

  8. Copying the data to Fortran format didn't seem to matter at all. Neither does converting to numpy.matrix.

    将数据复制到Fortran格式似乎并不重要。转换到numpi .matrix也是如此。

I might as well compare my non-MKL Python libraries against single-threaded (maxNumCompThreads=1) Matlab 2013a. The fastest Python implementations above had 4.5x longer runtime for the vector X case and 6.3x longer runtime for the fat matrix X case.

我不妨将我的非mkl Python库与单线程(maxNumCompThreads=1) Matlab 2013a进行比较。上面提到的最快的Python实现对于向量X的运行时要长4.5倍,而对于fat矩阵X的运行时要长6.3倍。

However, here's the Python script I used to benchmark these, perhaps someone with MKL-accelerated Numpy/Scipy can post their numbers. Note that I just comment out the line n = 10000 to disable the fat matrix X case and do the n=1 vector case. (Sorry.)

但是,这里是我用来对它们进行基准测试的Python脚本,也许使用mkl加速的Numpy/Scipy的人可以发布它们的编号。注意,我只是注释掉第n= 10000行,以禁用fat矩阵X的情况,然后做n=1向量的情况。(对不起)。

import scipy.linalg as sla
import numpy.linalg as nla
from numpy.random import RandomState
from timeit import timeit
import numpy as np

RNG = RandomState(69)

m=160
n=1
#n=10000
Ac = RNG.randn(m,m)
if 1:
    Ac = np.triu(Ac)

bc = RNG.randn(m,n)
Af = Ac.copy("F")
bf = bc.copy("F")

if 0: # Save to Matlab format
    import scipy.io as io
    io.savemat("b_%d.mat"%(n,), dict(A=Ac, b=bc))
    import sys
    sys.exit(0)

def lapper(fn, source, **kwargs):
    Alocal = source[0].copy()
    blocal = source[1].copy()
    fn(Alocal, blocal,**kwargs)

laps = (1000 if n<=1 else 100)
def printer(t, s=''):
    print ("%g seconds, %d laps, " % (t/float(laps), laps)) + s
    return t/float(laps)

t=[]
print "C"
t.append(printer(timeit(lambda: lapper(sla.solve, (Ac,bc)), number=laps),
                 "scipy.solve"))
t.append(printer(timeit(lambda: lapper(sla.solve, (Ac,bc), check_finite=False),
                        number=laps), "scipy.solve, infinite-ok"))
t.append(printer(timeit(lambda: lapper(nla.solve, (Ac,bc)), number=laps),
                 "numpy.solve"))

#print "F" # Doesn't seem to matter
#printer(timeit(lambda: lapper(sla.solve, (Af,bf)), number=laps))
#printer(timeit(lambda: lapper(nla.solve, (Af,bf)), number=laps))

print "sla with tweaks"
t.append(printer(timeit(lambda: lapper(sla.solve, (Ac,bc), overwrite_a=True,
                              overwrite_b=True,  check_finite=False),
                        number=laps), "scipy.solve destructive"))

print "Tri"
t.append(printer(timeit(lambda: lapper(sla.solve_triangular, (Ac,bc)),
                        number=laps), "scipy.solve_triangular"))
t.append(printer(timeit(lambda: lapper(sla.solve_triangular, (Ac,bc),
                              check_finite=False), number=laps),
                 "scipy.solve_triangular, inf-ok"))
t.append(printer(timeit(lambda: lapper(sla.solve_triangular, (Ac,bc),
                                       overwrite_b=True, check_finite=False),
                        number=laps), "scipy.solve_triangular destructive"))

print "LU"
piv = np.arange(m)
t.append(printer(timeit(lambda: lapper(
    lambda X,b: sla.lu_solve((X, piv),b,check_finite=False),
    (Ac,bc)), number=laps), "LU"))

print "all times:"
print t

Output of the above script for the vector case, n=1:

对于向量情况,上面脚本的输出,n=1:

C
0.000739405 seconds, 1000 laps, scipy.solve
0.000624746 seconds, 1000 laps, scipy.solve, infinite-ok
0.000590003 seconds, 1000 laps, numpy.solve
sla with tweaks
0.000608365 seconds, 1000 laps, scipy.solve destructive
Tri
0.000208711 seconds, 1000 laps, scipy.solve_triangular
9.38371e-05 seconds, 1000 laps, scipy.solve_triangular, inf-ok
9.37682e-05 seconds, 1000 laps, scipy.solve_triangular destructive
LU
0.000100215 seconds, 1000 laps, LU
all times:
[0.0007394047886284343, 0.00062474593940593, 0.0005900030818282472, 0.0006083650710913095, 0.00020871054023307778, 9.383710445114923e-05, 9.37682389063692e-05, 0.00010021534750467032]

Output of the above script for the matrix case n=10000:

矩阵情况n=10000时上述脚本的输出:

C
0.118985 seconds, 100 laps, scipy.solve
0.113687 seconds, 100 laps, scipy.solve, infinite-ok
0.115569 seconds, 100 laps, numpy.solve
sla with tweaks
0.113122 seconds, 100 laps, scipy.solve destructive
Tri
0.0725959 seconds, 100 laps, scipy.solve_triangular
0.0634396 seconds, 100 laps, scipy.solve_triangular, inf-ok
0.0638423 seconds, 100 laps, scipy.solve_triangular destructive
LU
0.1115 seconds, 100 laps, LU
all times:
[0.11898513112988955, 0.11368747217793944, 0.11556863916356903, 0.11312182352918797, 0.07259593807427585, 0.0634396208970783, 0.06384230931663318, 0.11150022257648459]

Note that the above Python script can save its arrays as Matlab .MAT data files. This is currently disabled (if 0, sorry), but if enabled, you can test Matlab's speed on the exact same data. Here's a timing script for Matlab:

注意,上面的Python脚本可以将其数组保存为Matlab . mat数据文件。这是目前禁用的(如果是0,不好意思),但是如果启用,您可以在相同的数据上测试Matlab的速度。下面是Matlab的计时脚本:

clear
q = load('b_10000.mat');
A=q.A;
b=q.b;
clear q
matrix_time = timeit(@() A\b)

q = load('b_1.mat');
A=q.A;
b=q.b;
clear q
vector_time = timeit(@() A\b)

You'll need the timeit function from Mathworks File Exchange: http://www.mathworks.com/matlabcentral/fileexchange/18798-timeit-benchmarking-function. This produces the following output:

您将需要Mathworks文件交换中的timeit函数:http://www.mathworks.com/matlabcentral/fileexchange/18798-timeit基准测试函数。这将产生以下输出:

matrix_time =
    0.0099989
vector_time =
   2.2487e-05

The upshot of this empirical analysis is, in Python at least, don't use numpy's or scipy's solve when you have a triangular system, just use scipy.linalg.solve_triangular with at least the check_finite=False keyword argument for fast and non-destructive solutions.

这个经验分析的结果是,至少在Python中,当你有一个三角形系统时,不要使用numpy或scipy的解,只要使用scipy.linalg就可以了。solve_triangle具有至少check_limited =False关键字参数,用于快速和非破坏性解决方案。

#2


3  

Why not just use the equation: Q = inv(chol(S)) * X, here is my test:

为什么不直接用公式:Q = inv(chol(S)) * X,这是我的测试:

import scipy.linalg
import numpy as np

N = 160
M = 100000
S = np.random.randn(N, N)
B = np.random.randn(N, M)
S = np.dot(S, S.T)

cS = scipy.linalg.cholesky(S, lower=True)
Y1 = scipy.linalg.solve(cS, B)
icS = scipy.linalg.inv(cS)
Y2 = np.dot(icS, B)

np.allclose(Y1, Y2)

output:

输出:

True

Here is the time test:

这里是时间测试:

%time scipy.linalg.solve(cholS, B)
%time np.linalg.solve(cholS, B)
%time scipy.linalg.solve_triangular(cholS, B, lower=True)
%time ics=scipy.linalg.inv(cS);np.dot(ics, B)

output:

输出:

CPU times: user 2.07 s, sys: 0.00 s, total: 2.07 s
Wall time: 2.08 s
CPU times: user 1.93 s, sys: 0.00 s, total: 1.93 s
Wall time: 1.92 s
CPU times: user 1.12 s, sys: 0.00 s, total: 1.12 s
Wall time: 1.13 s
CPU times: user 0.71 s, sys: 0.00 s, total: 0.71 s
Wall time: 0.72 s

I don't know why scipy.linalg.solve_triangular is slower than numpy.linalg.solve on your system, but the inv version is the fastest.

我不知道为什么要剪。solve_三角形比numpy.linalg慢。在您的系统上解决问题,但是inv版本是最快的。

#3


2  

A couple of things to try:

有几件事值得一试:

  • X = X.copy('F') # use fortran-order arrays, so that a copy is avoided

    X = X.copy('F') #使用fortranorder数组,以避免复制

  • Y = solve_triangular(cholS, X, overwrite_b=True) # avoid another copy, but trash contents of X

    Y = solve_triangle (cholS, X, overwrite_b=True

  • Y = solve_triangular(cholS, X, check_finite=False) # Scipy >= 0.12 only --- but doesn't seem to have a large effect on speed...

    Y = solve_三角形(cholS, X, check_= False) # Scipy >= 0.12,但似乎并没有对速度产生很大影响……

With both of these, it should be pretty much equivalent to a direct call to MKL with no buffer copies.

对于这两种方法,它几乎相当于对MKL的直接调用,没有缓冲区副本。

I can't reproduce the issue with np.linalg.solve and scipy.linalg.solve having different speeds --- with the BLAS + LAPACK combination I have, both seem the same speed.

我不能用np.linalg重现这个问题。解决和scipy.linalg。解决有不同的速度---我有BLAS + LAPACK组合,两者看起来速度相同。

#1


11  

TL;DR: Don't use numpy's or scipy's solve when you have a triangular system, just use scipy.linalg.solve_triangular with at least the check_finite=False keyword argument for fast and non-destructive solutions.

当你有一个三角形系统时,不要用numpy或scipy的解,只要用scipy.linalg就可以了。solve_triangle具有至少check_limited =False关键字参数,用于快速和非破坏性解决方案。


I found this thread after stumbling across some discrepancies between numpy.linalg.solve and scipy.linalg.solve (and scipy's lu_solve, etc.). I don't have Enthought's MKL-based Numpy/Scipy, but I hope my findings can help you in some way.

在偶然发现numpi .linalg之间的一些差异之后,我找到了这条线索。解决和scipy.linalg。solve(以及scipy的lu_solve等)。我没有恩思的基于mklout的Numpy/Scipy,但是我希望我的发现能在某种程度上帮助你。

With the pre-built binaries for Numpy and Scipy (32-bit, running on Windows 7):

使用预构建的二进制文件来实现Numpy和Scipy(32位,运行在Windows 7上):

  1. I see a significant difference between numpy.linalg.solve and scipy.linalg.solve when solving for a vector X (i.e., X is 160 by 1). Scipy runtime is 1.23x numpy's, which is I think substantial.

    我发现numpy.linalg有很大的不同。解决和scipy.linalg。求解向量X时(即。Scipy运行时是1.23倍numpy's,我认为这很重要。

  2. However, most of the difference appears to be due to scipy's solve checking for invalid entries. When passing check_finite=False into scipy.linalg.solve, scipy's solve runtime is 1.02x numpy's.

    然而,大多数差异似乎是由于scipy对无效条目的解决检查。当将check_limited =False传递到scipy.linalg时。解,scipy的解运行时是1。02x numpy的。

  3. Scipy's solve using destructive updates, i.e., with overwrite_a=True, overwrite_b=True is slightly faster than numpy's solve (which is non-destructive). Numpy's solve runtime is 1.021x destructive scipy.linalg.solve. Scipy with just check_finite=False has runtime 1.04x the destructive case. In summary, destructive scipy.linalg.solve is very slightly faster than either of these cases.

    Scipy使用破坏性更新解决,即。,对于overwrite_a=True, overwrite_b=True略快于numpy的解法(它是非破坏性的)。Numpy的求解运行时是1.021倍的破坏性scipy.linalg.solve。只有check_limited =False的Scipy运行时是1.04x。总之,破坏性scipy.linalg。求解速度比这两种情况都要快。

  4. The above are for a vector X. If I make X a wide array, specifically 160 by 10000, scipy.linalg.solve with check_finite=False is essentially as fast as with check_finite=False, overwrite_a=True, overwrite_b=True. Scipy's solve (without any special keywords) runtime is 1.09x this "unsafe" (check_finite=False) call. Numpy's solve has runtime 1.03x scipy's fastest for this array X case.

    上面是向量X,如果我做一个宽数组,具体是160×10000,scipy.linalg。用check_limited =False求解本质上和用check_limited =False一样快,overwrite_a=True, overwrite_b=True。Scipy的求解(没有任何特殊关键字)运行时是1.09倍的这个“不安全”(check_limited =False)调用。Numpy的解决方案在运行时1.03x scipy是这个数组X的最快解。

  5. scipy.linalg.solve_triangular delivers significant speedups in both these cases, but you have to turn off input checking, i.e., pass in check_finite=False. The runtime for the fastest solve was 5.68x and 1.76x solve_triangular's, for vector and array X, respectively, with check_finite=False.

    scipy.linalg。solve_triangle在这两种情况下都提供了显著的加速,但是您必须关闭输入检查,即。,通过在check_finite = False。最快解的运行时为5.68x和1.76x solve_三角形的,分别为向量和数组X,用check_= False。

  6. solve_triangular with destructive computation (overwrite_b=True) gives you no speedup on top of check_finite=False (and actually hurts slightly for the array X case).

    solve_triangle with破坏性计算(overwrite_b=True)在check_limited_false之上没有加速(实际上对数组X的情况有点伤害)。

  7. I, ignoramus, was previously unaware of solve_triangular and was using scipy.linalg.lu_solve as a triangular solver, i.e., instead of solve_triangular(cholS, X) doing lu_solve((cholS, numpy.arange(160)), X) (both produce the same answer). But I discovered that lu_solve used in this way has runtime 1.07x unsafe solve_triangular for the vector X case, while its runtime was 1.76x for the array X case. I'm not sure why lu_solve is so much slower for array X, compared to vector X, but the lesson is to use solve_triangular (without infinite checks).

    我,ignoramus,之前没有意识到solve_三角形并且使用scipy.linalg。lu_solve为三角形求解器,即。,而不是solve_triangle (cholS, X)做lu_solve(cholS, numpy.arange(160)), X(两者产生相同的答案)。但我发现,以这种方式使用的lu_solve对于向量X的运行时为1.07x,不安全的solve_triangle,而对于数组X的运行时为1.76x。与向量X相比,我不确定lu_solve为什么要慢得多,但是我们的经验是使用solve_三角形(没有无限的检查)。

  8. Copying the data to Fortran format didn't seem to matter at all. Neither does converting to numpy.matrix.

    将数据复制到Fortran格式似乎并不重要。转换到numpi .matrix也是如此。

I might as well compare my non-MKL Python libraries against single-threaded (maxNumCompThreads=1) Matlab 2013a. The fastest Python implementations above had 4.5x longer runtime for the vector X case and 6.3x longer runtime for the fat matrix X case.

我不妨将我的非mkl Python库与单线程(maxNumCompThreads=1) Matlab 2013a进行比较。上面提到的最快的Python实现对于向量X的运行时要长4.5倍,而对于fat矩阵X的运行时要长6.3倍。

However, here's the Python script I used to benchmark these, perhaps someone with MKL-accelerated Numpy/Scipy can post their numbers. Note that I just comment out the line n = 10000 to disable the fat matrix X case and do the n=1 vector case. (Sorry.)

但是,这里是我用来对它们进行基准测试的Python脚本,也许使用mkl加速的Numpy/Scipy的人可以发布它们的编号。注意,我只是注释掉第n= 10000行,以禁用fat矩阵X的情况,然后做n=1向量的情况。(对不起)。

import scipy.linalg as sla
import numpy.linalg as nla
from numpy.random import RandomState
from timeit import timeit
import numpy as np

RNG = RandomState(69)

m=160
n=1
#n=10000
Ac = RNG.randn(m,m)
if 1:
    Ac = np.triu(Ac)

bc = RNG.randn(m,n)
Af = Ac.copy("F")
bf = bc.copy("F")

if 0: # Save to Matlab format
    import scipy.io as io
    io.savemat("b_%d.mat"%(n,), dict(A=Ac, b=bc))
    import sys
    sys.exit(0)

def lapper(fn, source, **kwargs):
    Alocal = source[0].copy()
    blocal = source[1].copy()
    fn(Alocal, blocal,**kwargs)

laps = (1000 if n<=1 else 100)
def printer(t, s=''):
    print ("%g seconds, %d laps, " % (t/float(laps), laps)) + s
    return t/float(laps)

t=[]
print "C"
t.append(printer(timeit(lambda: lapper(sla.solve, (Ac,bc)), number=laps),
                 "scipy.solve"))
t.append(printer(timeit(lambda: lapper(sla.solve, (Ac,bc), check_finite=False),
                        number=laps), "scipy.solve, infinite-ok"))
t.append(printer(timeit(lambda: lapper(nla.solve, (Ac,bc)), number=laps),
                 "numpy.solve"))

#print "F" # Doesn't seem to matter
#printer(timeit(lambda: lapper(sla.solve, (Af,bf)), number=laps))
#printer(timeit(lambda: lapper(nla.solve, (Af,bf)), number=laps))

print "sla with tweaks"
t.append(printer(timeit(lambda: lapper(sla.solve, (Ac,bc), overwrite_a=True,
                              overwrite_b=True,  check_finite=False),
                        number=laps), "scipy.solve destructive"))

print "Tri"
t.append(printer(timeit(lambda: lapper(sla.solve_triangular, (Ac,bc)),
                        number=laps), "scipy.solve_triangular"))
t.append(printer(timeit(lambda: lapper(sla.solve_triangular, (Ac,bc),
                              check_finite=False), number=laps),
                 "scipy.solve_triangular, inf-ok"))
t.append(printer(timeit(lambda: lapper(sla.solve_triangular, (Ac,bc),
                                       overwrite_b=True, check_finite=False),
                        number=laps), "scipy.solve_triangular destructive"))

print "LU"
piv = np.arange(m)
t.append(printer(timeit(lambda: lapper(
    lambda X,b: sla.lu_solve((X, piv),b,check_finite=False),
    (Ac,bc)), number=laps), "LU"))

print "all times:"
print t

Output of the above script for the vector case, n=1:

对于向量情况,上面脚本的输出,n=1:

C
0.000739405 seconds, 1000 laps, scipy.solve
0.000624746 seconds, 1000 laps, scipy.solve, infinite-ok
0.000590003 seconds, 1000 laps, numpy.solve
sla with tweaks
0.000608365 seconds, 1000 laps, scipy.solve destructive
Tri
0.000208711 seconds, 1000 laps, scipy.solve_triangular
9.38371e-05 seconds, 1000 laps, scipy.solve_triangular, inf-ok
9.37682e-05 seconds, 1000 laps, scipy.solve_triangular destructive
LU
0.000100215 seconds, 1000 laps, LU
all times:
[0.0007394047886284343, 0.00062474593940593, 0.0005900030818282472, 0.0006083650710913095, 0.00020871054023307778, 9.383710445114923e-05, 9.37682389063692e-05, 0.00010021534750467032]

Output of the above script for the matrix case n=10000:

矩阵情况n=10000时上述脚本的输出:

C
0.118985 seconds, 100 laps, scipy.solve
0.113687 seconds, 100 laps, scipy.solve, infinite-ok
0.115569 seconds, 100 laps, numpy.solve
sla with tweaks
0.113122 seconds, 100 laps, scipy.solve destructive
Tri
0.0725959 seconds, 100 laps, scipy.solve_triangular
0.0634396 seconds, 100 laps, scipy.solve_triangular, inf-ok
0.0638423 seconds, 100 laps, scipy.solve_triangular destructive
LU
0.1115 seconds, 100 laps, LU
all times:
[0.11898513112988955, 0.11368747217793944, 0.11556863916356903, 0.11312182352918797, 0.07259593807427585, 0.0634396208970783, 0.06384230931663318, 0.11150022257648459]

Note that the above Python script can save its arrays as Matlab .MAT data files. This is currently disabled (if 0, sorry), but if enabled, you can test Matlab's speed on the exact same data. Here's a timing script for Matlab:

注意,上面的Python脚本可以将其数组保存为Matlab . mat数据文件。这是目前禁用的(如果是0,不好意思),但是如果启用,您可以在相同的数据上测试Matlab的速度。下面是Matlab的计时脚本:

clear
q = load('b_10000.mat');
A=q.A;
b=q.b;
clear q
matrix_time = timeit(@() A\b)

q = load('b_1.mat');
A=q.A;
b=q.b;
clear q
vector_time = timeit(@() A\b)

You'll need the timeit function from Mathworks File Exchange: http://www.mathworks.com/matlabcentral/fileexchange/18798-timeit-benchmarking-function. This produces the following output:

您将需要Mathworks文件交换中的timeit函数:http://www.mathworks.com/matlabcentral/fileexchange/18798-timeit基准测试函数。这将产生以下输出:

matrix_time =
    0.0099989
vector_time =
   2.2487e-05

The upshot of this empirical analysis is, in Python at least, don't use numpy's or scipy's solve when you have a triangular system, just use scipy.linalg.solve_triangular with at least the check_finite=False keyword argument for fast and non-destructive solutions.

这个经验分析的结果是,至少在Python中,当你有一个三角形系统时,不要使用numpy或scipy的解,只要使用scipy.linalg就可以了。solve_triangle具有至少check_limited =False关键字参数,用于快速和非破坏性解决方案。

#2


3  

Why not just use the equation: Q = inv(chol(S)) * X, here is my test:

为什么不直接用公式:Q = inv(chol(S)) * X,这是我的测试:

import scipy.linalg
import numpy as np

N = 160
M = 100000
S = np.random.randn(N, N)
B = np.random.randn(N, M)
S = np.dot(S, S.T)

cS = scipy.linalg.cholesky(S, lower=True)
Y1 = scipy.linalg.solve(cS, B)
icS = scipy.linalg.inv(cS)
Y2 = np.dot(icS, B)

np.allclose(Y1, Y2)

output:

输出:

True

Here is the time test:

这里是时间测试:

%time scipy.linalg.solve(cholS, B)
%time np.linalg.solve(cholS, B)
%time scipy.linalg.solve_triangular(cholS, B, lower=True)
%time ics=scipy.linalg.inv(cS);np.dot(ics, B)

output:

输出:

CPU times: user 2.07 s, sys: 0.00 s, total: 2.07 s
Wall time: 2.08 s
CPU times: user 1.93 s, sys: 0.00 s, total: 1.93 s
Wall time: 1.92 s
CPU times: user 1.12 s, sys: 0.00 s, total: 1.12 s
Wall time: 1.13 s
CPU times: user 0.71 s, sys: 0.00 s, total: 0.71 s
Wall time: 0.72 s

I don't know why scipy.linalg.solve_triangular is slower than numpy.linalg.solve on your system, but the inv version is the fastest.

我不知道为什么要剪。solve_三角形比numpy.linalg慢。在您的系统上解决问题,但是inv版本是最快的。

#3


2  

A couple of things to try:

有几件事值得一试:

  • X = X.copy('F') # use fortran-order arrays, so that a copy is avoided

    X = X.copy('F') #使用fortranorder数组,以避免复制

  • Y = solve_triangular(cholS, X, overwrite_b=True) # avoid another copy, but trash contents of X

    Y = solve_triangle (cholS, X, overwrite_b=True

  • Y = solve_triangular(cholS, X, check_finite=False) # Scipy >= 0.12 only --- but doesn't seem to have a large effect on speed...

    Y = solve_三角形(cholS, X, check_= False) # Scipy >= 0.12,但似乎并没有对速度产生很大影响……

With both of these, it should be pretty much equivalent to a direct call to MKL with no buffer copies.

对于这两种方法,它几乎相当于对MKL的直接调用,没有缓冲区副本。

I can't reproduce the issue with np.linalg.solve and scipy.linalg.solve having different speeds --- with the BLAS + LAPACK combination I have, both seem the same speed.

我不能用np.linalg重现这个问题。解决和scipy.linalg。解决有不同的速度---我有BLAS + LAPACK组合,两者看起来速度相同。