Clojure与Numpy的矩阵乘法

I'm working on an application in Clojure that needs to multiply large matrices and am running into some large performance issues compared to an identical Numpy version. Numpy seems to be able to multiply a 1,000,000x23 matrix by its transpose in under a second, while the equivalent clojure code takes over six minutes. (I can print out the resulting matrix from Numpy, so it's definitely evaluating everything).

我正在研究Clojure中的一个应用程序，它需要繁殖大型矩阵，并且与相同的Numpy版本相比，遇到了一些大的性能问题。 Numpy似乎能够在一秒钟内通过其转置乘以1,000,000x23矩阵，而等效的clojure代码需要超过六分钟。（我可以从Numpy打印出结果矩阵，所以它肯定会评估所有内容）。

Am I doing something terribly wrong in this Clojure code? Is there some trick of Numpy that I can try to mimic?

我在这个Clojure代码中做了一些非常错误的事情吗？我可以尝试模仿Numpy的一些技巧吗？

Here's the python:

这是python：

import numpy as np

def test_my_mult(n):
    A = np.random.rand(n*23).reshape(n,23)
    At = A.T

    t0 = time.time()
    res = np.dot(A.T, A)
    print time.time() - t0
    print np.shape(res)

    return res

# Example (returns a 23x23 matrix):
# >>> results = test_my_mult(1000000)
# 
# 0.906938076019
# (23, 23)

And the clojure:

和clojure：

(defn feature-vec [n]
  (map (partial cons 1)
       (for [x (range n)]
         (take 22 (repeatedly rand)))))

(defn dot-product [x y]
  (reduce + (map * x y)))

(defn transpose
  "returns the transposition of a `coll` of vectors"
  [coll]
  (apply map vector coll))

(defn matrix-mult
  [mat1 mat2]
  (let [row-mult (fn [mat row]
                   (map (partial dot-product row)
                        (transpose mat)))]
    (map (partial row-mult mat2)
         mat1)))

(defn test-my-mult
  [n afn]
  (let [xs  (feature-vec n)
        xst (transpose xs)]
    (time (dorun (afn xst xs)))))

;; Example (yields a 23x23 matrix):
;; (test-my-mult 1000 i/mmult) => "Elapsed time: 32.626 msecs"
;; (test-my-mult 10000 i/mmult) => "Elapsed time: 628.841 msecs"

;; (test-my-mult 1000 matrix-mult) => "Elapsed time: 14.748 msecs"
;; (test-my-mult 10000 matrix-mult) => "Elapsed time: 434.128 msecs"
;; (test-my-mult 1000000 matrix-mult) => "Elapsed time: 375751.999 msecs"


;; Test from wikipedia
;; (def A [[14 9 3] [2 11 15] [0 12 17] [5 2 3]])
;; (def B [[12 25] [9 10] [8 5]])

;; user> (matrix-mult A B)
;; ((273 455) (243 235) (244 205) (102 160))

UPDATE: I implemented the same benchmark using the JBLAS library and found massive, massive speed improvements. Thanks to everyone for their input! Time to wrap this sucker in Clojure. Here's the new code:

更新：我使用JBLAS库实现了相同的基准测试，并发现了大量的大规模速度改进。感谢大家的投入！是时候把这个傻瓜包裹在Clojure中了。这是新代码：

(import '[org.jblas FloatMatrix])

(defn feature-vec [n]
  (FloatMatrix.
   (into-array (for [x (range n)]
                 (float-array (cons 1 (take 22 (repeatedly rand))))))))

(defn test-mult [n]
  (let [xs  (feature-vec n)
        xst (.transpose xs)]
    (time (let [result (.mmul xst xs)]
            [(.rows result)
             (.columns result)]))))

;; user> (test-mult 10000)
;; "Elapsed time: 6.99 msecs"
;; [23 23]

;; user> (test-mult 100000)
;; "Elapsed time: 43.88 msecs"
;; [23 23]

;; user> (test-mult 1000000)
;; "Elapsed time: 383.439 msecs"
;; [23 23]

(defn matrix-stream [rows cols]
  (repeatedly #(FloatMatrix/randn rows cols)))

(defn square-benchmark
  "Times the multiplication of a square matrix."
  [n]
  (let [[a b c] (matrix-stream n n)]
    (time (.mmuli a b c))
    nil))

;; forma.matrix.jblas> (square-benchmark 10)
;; "Elapsed time: 0.113 msecs"
;; nil
;; forma.matrix.jblas> (square-benchmark 100)
;; "Elapsed time: 0.548 msecs"
;; nil
;; forma.matrix.jblas> (square-benchmark 1000)
;; "Elapsed time: 107.555 msecs"
;; nil
;; forma.matrix.jblas> (square-benchmark 2000)
;; "Elapsed time: 793.022 msecs"
;; nil

9 个解决方案

#1

The Python version is compiling down to a loop in C while the Clojure version is building a new intermediate sequence for each of the calls to map in this code. It is likely that the performance difference you see is coming from the difference of data structures.

Python版本正在编译为C中的循环，而Clojure版本正在为此代码中的每个调用构建一个新的中间序列。您看到的性能差异很可能来自数据结构的差异。

To get better than this you could play with a library like Incanter or write your own version as explained in this SO question. see also this one, neanderthal or nd4j. If you really want to stay with sequences to keep the lazy evaluation properties etc. then you may get a real boost by looking into transients for the internal matrix calculations

为了比这更好，您可以使用像Incanter这样的库，或者按照此SO问题中的说明编写您自己的版本。另见这个，尼安德特人或nd4j。如果你真的想留下序列以保持懒惰的评估属性等，那么你可以通过查看内部矩阵计算的瞬态来获得真正的提升

EDIT: forgot to add the first step in tuning clojure, turn on "warn on reflection"

编辑：忘了添加调整clojure的第一步，打开“警告反思”

#2

Numpy is linking to BLAS/Lapack routines that have been optimized for decades at the level of machine architecture while the Clojure is a implementing the multiplication in the most straightforward and naive manner.

Numpy链接到BLAS / Lapack例程，这些例程已经在机器架构层面上进行了数十年的优化，而Clojure则是以最简单和天真的方式实现乘法。

Any time you have non-trivial matrix/vector operations to perform, you should probably link to BLAS/LAPACK.

每当你执行非平凡的矩阵/向量运算时，你应该链接到BLAS / LAPACK。

The only time this won't be faster is for small matrices from languages where the overhead of translating the data representation between the language runtime and the LAPACK exceed the time spent doing the calculation.

唯一不会更快的是来自语言的小矩阵，其中在语言运行时和LAPACK之间转换数据表示的开销超过了计算所花费的时间。

#3

I've just staged a small shootout between Incanter 1.3 and jBLAS 1.2.1. Here's the code:

我刚刚在Incanter 1.3和jBLAS 1.2.1之间进行了一次小小的枪战。这是代码：

(ns ml-class.experiments.mmult
  [:use [incanter core]]
  [:import [org.jblas DoubleMatrix]])

(defn -main [m]
  (let [n 23 m (Integer/parseInt m)
        ai (matrix (vec (double-array (* m n) (repeatedly rand))) n)
        ab (DoubleMatrix/rand m n)
        ti (copy (trans ai))
        tb (.transpose ab)]
    (dotimes [i 20]
      (print "Incanter: ") (time (mmult ti ai))
      (print "   jBLAS: ") (time (.mmul tb ab)))))

In my test, Incanter is consistently slower than jBLAS by about 45% in plain matrix multiplication. However, Incanter trans function does not create a new copy of a matrix, and therefore (.mmul (.transpose ab) ab) in jBLAS takes twice as much memory and is only 15% faster than (mmult (trans ai) ai) in Incanter.

在我的测试中，在普通矩阵乘法中，Incanter始终比jBLAS慢约45％。但是，Incanter trans函数不会创建矩阵的新副本，因此jBLAS中的（.mmul（.transpose ab）ab）需要两倍的内存，并且比（mmult（trans ai）ai）快15％。咒术。

Given Incanter rich feature set (especially it's plotting library), I don't think I'll switch to jBLAS any time soon. Still, I would love to see another shootout between jBLAS and Parallel Colt, and maybe it's worth considering to replace Parallel Colt with jBLAS in Incanter? :-)

鉴于Incanter丰富的功能集（特别是它的绘图库），我认为我不会很快切换到jBLAS。尽管如此，我还是希望看到jBLAS和Parallel Colt之间的另一次枪战，也许值得考虑在Incanter用jBLAS替换Parallel Colt？ :-)

EDIT: Here are absolute numbers (in msec.) I got on my (rather slow) PC:

编辑：这是绝对数字（以毫秒为单位）。我上了我的（相当慢）PC：

Incanter: 665.362452
   jBLAS: 459.311598
   numpy: 353.777885

For each library, I've picked the best time out of 20 runs, matrix size 23x400000.

对于每个库，我选择了20次运行中的最佳时间，矩阵大小为23x400000。

PS. Haskell hmatrix results are close to numpy, but I am not sure how to benchmark it correctly.

PS。 Haskell hmatrix结果接近于numpy，但我不确定如何正确地对其进行基准测试。

#4

Numpy code uses built-in libraries, written in Fortran over the last few decades and optimized by the authors, your CPU vendor, and you OS distributor (as well as the Numpy people) for maximal performance. You just did the completely direct, obvious approach to matrix multiplication. It's not surprise, really, that performance differs.

Numpy代码使用内置库，在过去几十年中使用Fortran编写，并由作者，CPU供应商和操作系统分销商（以及Numpy人员）进行优化，以获得最佳性能。您只是采用了完全直接，明显的矩阵乘法方法。真的，性能不同并不奇怪。

But if you're insistant upon doing it in Clojure, consider looking up better algorithms, using direct loops as opposed to higher-order functions like reduce, or find a proper matrix algebra library for Java (I doubt there are good ones in Clojure, but I don't really know) written by a competent mathematician.

但是如果你坚持在Clojure中做，请考虑查找更好的算法，使用直接循环而不是像reduce这样的高阶函数，或者为Java找到合适的矩阵代数库（我怀疑在Clojure中有好的算法，但我真的不知道）由一位称职的数学家撰写。

Finally, look up how to properly write fast Clojure. Use type hints, run a profiler on your code (surprise! you dot product function is using up the most time), and drop the high-level features inside your tight loops.

最后，查看如何正确编写快速Clojure。使用类型提示，在代码上运行一个分析器（出乎意料！你的产品功能耗尽了大部分时间），并将高级功能放入紧密循环中。

#5

As @littleidea and others have pointed out your numpy version is using LAPACK/BLAS/ATLAS which will be much faster than anything you do in clojure since it has been finely tuned for years. :)

正如@littleidea和其他人已经指出你的numpy版本正在使用LAPACK / BLAS / ATLAS，这将比你在clojure中所做的任何事情都要快得多，因为它经过了多年的精心调整。 :)

That said the biggest problem with the Clojure code is that it is using Doubles, as in boxed doubles. I call this the "lazy Double" problem and I've ran into it at work a number of times. As of now, even with 1.3, clojure's collections are not primitive friendly. (You can create a vector of primitives but it won't help you any since all of the seq. functions will end up boxing them! I should also say that the primitive improvements in 1.3 are quite nice and end up helping.. we just aren't 100% there WRT primitive support in collections.)

这就是说Clojure代码的最大问题在于它正在使用双打，就像盒装双打一样。我把它称为“懒惰的双重”问题，我在工作中遇到过很多次。截至目前，即使使用1.3，clojure的集合也不是原始友好的。（你可以创建一个基元矢量，但它不会帮助你，因为所有的seq。函数最终会装箱！我还应该说1.3中的原始改进非常好并最终帮助..我们只是不是100％有集合中的WRT原始支持。）

When doing any kind of matrix math in clojure you really need to use java arrays or, better yet, matrix libraries. Incanter does use parrelcolt but you need to be careful about what incanter functions you use... since a lot of them make the matrices seqable which ends up boxing the doubles giving you similar performance to what you are currently seeing. (BTW, I have my own set up parrelcolt wrappers that I could release if you think they would be helpful.)

在clojure中进行任何类型的矩阵数学运算时，你真的需要使用java数组，或者更好的是矩阵库。 Incanter确实使用了parrelcolt，但是你需要注意你使用的incanter功能...因为很多它们使得矩阵可以选择最终装箱的双打，让你获得与你目前看到的相似的性能。（顺便说一句，我有自己设置的parrelcolt包装器，如果你认为它们会有用，我可以发布它。）

In order to use the BLAS libraries you have a couple of options in java-land. With all of these options you have to pay a JNA tax... all of your data has to be copied before it can be processed. This tax makes sense when you are doing CPU bound operations like matrix decompositions and whose processing time takes longer than the time it takes to copy the data. For simpler operations with small matrices then staying in java-land will probably be faster. You just need to do a few tests like you've done above to see what works best for you.

为了使用BLAS库，您在java-land中有几个选项。使用所有这些选项，您必须支付JNA税...所有数据必须先复制才能处理。当你进行像矩阵分解这样的CPU绑定操作并且处理时间比复制数据所花费的时间更长时，这种税是有意义的。对于使用小矩阵的简单操作，保留在java-land中可能会更快。你只需要像上面那样做一些测试，看看什么最适合你。

Here are your options for using BLAS from java:

以下是从java中使用BLAS的选项：

http://jblas.org/

http://code.google.com/p/netlib-java/

API for above: http://code.google.com/p/matrix-toolkits-java/
以上API：http：//code.google.com/p/matrix-toolkits-java/

I should point out that parrelcolt uses the netlib-java project. Which means, I beleive, if you set it up correcly it will use BLAS. However, I have not verifed this. For an explaination on the differences between jblas and netlib-java see this thread I started about it on jblas's mailing list:

我应该指出parrelcolt使用netlib-java项目。这意味着，我相信，如果你正确地设置它将使用BLAS。但是，我没有证实这一点。有关jblas和netlib-java之间差异的解释，请参阅我在jblas的邮件列表上开始的这个帖子：

http://groups.google.com/group/jblas-users/browse_thread/thread/c9b3867572331aa5

I should also point out the Universal Java Matrix Package library:

我还应该指出Universal Java Matrix Package库：

http://sourceforge.net/projects/ujmp/

It wraps all of the libraries I have mentioned, and then some! I haven't looked too much at the API though to know how leaky their abstraction is. It seems like a nice project. I've ended up using my own parrelcolt clojure wrappers since they were fast enough and I actually quite liked the colt API. (Colt uses function objects, which means I was able to just pass clojure functions with little trouble!)

它包装了我提到的所有库，然后是一些！虽然知道它的抽象是多么漏洞，但我并没有太多关注API。这似乎是一个很好的项目。我最终使用了自己的parrelcolt clojure包装，因为它们足够快，我实际上非常喜欢colt API。（Colt使用函数对象，这意味着我能够轻松地传递clojure函数！）

#6

If you want to do numerics in Clojure, I'd strongly recommend using Incanter rather than trying to roll your own matrix functions and suchlike.

如果你想在Clojure中做数字，我强烈建议使用Incanter而不是尝试滚动你自己的矩阵函数等。

Incanter uses Parallel Colt under the hood, which is pretty fast.

Incanter在引擎盖下使用Parallel Colt，速度非常快。

EDIT:

编辑：

As of early 2013, if you want to do numerics in Clojure I strongly recommend checking out core.matrix

截至2013年初，如果你想在Clojure中做数字，我强烈建议你查看core.matrix

#7

Numpy is highly optimized for linear algebra. Certainly for large matrices, where most of the processing is in the native C code.

Numpy针对线性代数进行了高度优化。当然对于大型矩阵，大多数处理都在本机C代码中。

In order to match this performance (assuming its possible in Java) you would have to strip most of Clojure's abstractions away: Don't use map with anonymous functions when iterating over large matrices, add type hints to enable usage of raw Java arrays, etc.

为了匹配这种性能（假设它可能在Java中），您将不得不去除大部分Clojure的抽象：在迭代大型矩阵时不要使用带匿名函数的map，添加类型提示以启用原始Java数组的使用等。

Probably the best option is just to use a ready made Java library optimized for numerical computing (http://math.nist.gov/javanumerics/ or similar).

可能最好的选择就是使用为数值计算优化的现成Java库（http://math.nist.gov/javanumerics/或类似的）。

#8

I don't have any specific answers for you; just some suggestions.

我没有任何具体的答案;只是一些建议。

Use a profiler to figure out where time is being spent
使用分析器来确定花费的时间
set warn-on-reflection and use type hints where needed
设置warn-on-reflection并在需要时使用类型提示
You may have to give up some high-level constructs and go with loop-recur to sqeeze out that last ounce of performance
您可能不得不放弃一些高级构造并使用loop-recur来平衡最后一盎司的性能

IME, Clojure code should perform pretty close to Java (2 or 3X). But you have to work on it.

IME，Clojure代码应该非常接近Java（2或3X）。但你必须努力。

#9

-3

Only use map() if it makes sense. Which means: if you have a specific problem like multiplying two matrices, do not try and map() it, just multiply the matrices.

如果有意义，只使用map（）。这意味着：如果你有一个特定的问题，如乘以两个矩阵，不要尝试map（）它，只需乘以矩阵。

I tend to use map() only when it makes linguistic sense (i.e. if the program is really more readable than without it). Multiplying matrices is so obvious a loop that mapping it makes no sense.

我倾向于只在它具有语言意义时使用map（）（即如果程序真的比没有它时更可读）。乘法矩阵是如此明显的循环，映射它是没有意义的。

Yours.

你的。

Pedro Fortuny.

佩德罗福图尼。

#1