为什么lil_matrix和dok_matrix会比普通的命令来得慢?

I want to iteratively build sparse matrices, and noticed that there are two suitable options for this according to the SciPy documentation:

我想迭代地构建稀疏矩阵，并注意到根据SciPy文档有两个合适的选项:

LiL matrix:

李尔矩阵:

class scipy.sparse.lil_matrix(arg1, shape=None, dtype=None, copy=False)[source] Row-based linked list sparse matrix

类scipy.sparse。lil_matrix(arg1, shape=None, dtype=None, copy=False)[source]基于行的链表稀疏矩阵

This is an efficient structure for constructing sparse matrices incrementally.

这是一种增量构建稀疏矩阵的有效结构。

DoK matrix:

辩经矩阵:

class scipy.sparse.dok_matrix(arg1, shape=None, dtype=None, copy=False)[source] Dictionary Of Keys based sparse matrix.

类scipy.sparse。dok_matrix(arg1, shape=None, dtype=None, copy=False)[source]基于密钥的稀疏矩阵字典。

This is an efficient structure for constructing sparse matrices incrementally.

这是一种增量构建稀疏矩阵的有效结构。

But when I'm running benchmarks comparing to building a dictionary of dictionary of values (which later can be easily converted to sparse matrix), the latter turns out to be about 10-20 times faster than using any of the sparse matrix models:

但是，当我在运行基准测试时，与构建值字典(稍后可以很容易地转换为稀疏矩阵)相比，后者的速度要比使用任何稀疏矩阵模型快10-20倍:

from scipy.sparse import dok_matrix, lil_matrix
from timeit import timeit
from collections import defaultdict

def common_dict(rows, cols):
    freqs = defaultdict(lambda: defaultdict(int))
    for row, col in zip(rows, cols):
        freqs[row][col] += 1

    return freqs

def dok(rows, cols):
    freqs = dok_matrix((1000,1000))
    for row, col in zip(rows, cols):
        freqs[row,col] += 1

    return freqs

def lil(rows, cols):
    freqs = lil_matrix((1000,1000))
    for row, col in zip(rows, cols):
        freqs[row,col] += 1

    return freqs


def benchmark():
    cols = range(1000)
    rows = range(1000)

    res = timeit("common_dict({},{})".format(rows, cols), 
                 "from __main__ import common_dict", 
                 number=100)

    print("common_dict: {}".format(res))

    res = timeit("dok({},{})".format(rows, cols), 
                 "from __main__ import dok", 
                 number=100)

    print("dok: {}".format(res))

    res = timeit("lil({},{})".format(rows, cols), 
                 "from __main__ import lil", 
                 number=100)

    print("lil: {}".format(res))

Results:

结果:

benchmark()

common_dict: 0.11778324202168733
dok: 2.2927695910912007
lil: 1.3541790939634666

What is it that causes such a overhead for the matrix models, and is there some way to speed it up? Are there use cases where either dok or lil are to prefer over a common dict of dicts?

是什么原因导致了矩阵模型的开销，有没有什么方法可以加速它?是否在用例中，dok或lil更倾向于使用一种常见的dicts命令?

2 个解决方案

#1

When I change your += to just = for your 2 sparse arrays:

当我为你的2个稀疏数组将+=改为=时:

for row, col in zip(rows, cols):
    #freqs[row,col] += 1
    freqs[row,col] = 1

their respective times are cut in half. What's consuming the most time is the indexing. With += it is has to do both a __getitem__ and a __setitem__.

他们各自的时间被缩短了一半。最耗费时间的是索引。它必须同时做__getitem__和__setitem__。

When the docs say that dok and lil are better for iterative construction they mean that it's easier to expand their underlying data structures than for the other formats.

当文档说dok和lil更适合迭代构造时，它们意味着扩展它们的底层数据结构比扩展其他格式更容易。

When I try to make a csr matrix with your code, I get a:

当我尝试用你的代码做一个csr矩阵时，我得到a:

/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py:690: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient. SparseEfficiencyWarning)

/usr/lib/python2.7/dist-packages / scipy /稀疏/压缩。警告:更改csr_matrix的稀疏结构是昂贵的。lil_matrix更为高效。SparseEfficiencyWarning)

and 30x slower speed.

和30 x速度慢。

So the speed claims are relative to formats like csr, not relative to pure Python or numpy structures.

因此速度声明是相对于像csr这样的格式，而不是相对于纯Python或numpy结构。

You might want to look at the Python code for dok_matrix.__get_item__ and dok_matrix.__set_item__ to see what happens when you do freq[r,c].

您可能需要查看dok_matrix的Python代码。__get_item__ dok_matrix。__set_item__查看在执行freq[r,c]时发生了什么。

A faster way to construct your dok would be:

构建dok的更快方法是:

freqs = dok_matrix((1000,1000))
d = dict()
for row, col in zip(rows, cols):
    d[(row, col)] = 1
freqs.update(d)

taking advantage of the fact that a dok is a subclassed dictionary. Note that dok matrix is not a dictionary of dictionaries. Its keys are tuples like (50,50).

利用dok是子类字典这一事实。注意，dok矩阵不是字典。它的键是类似于(50,50)的元组。

Another fast way of constructing the same sparse array is:

另一种快速构造相同稀疏数组的方法是:

freqs = sparse.coo_matrix((np.ones(1000,int),(rows,cols)))

In other words, since you already have the rows and cols arrays (or ranges), calculate the corresponding data array, and THEN construct the sparse array.

换句话说，既然已经有了行和cols数组(或范围)，那么就计算相应的数据数组，然后构造稀疏数组。

But if you must perform sparse operations on your matrix between incremental growth steps, then dok or lil may be your best choices.

但是如果您必须在增量增长步骤之间对您的矩阵执行稀疏操作，那么dok或lil可能是您的最佳选择。

Sparse matricies were developed for linear algebra problems, such as solving a linear equation with a large sparse matrix. I used them years ago in MATLAB to solve finite difference problems. For that work the calculation friendly csr format is the ultimate goal, and the coo format was a convenient initialization format.

针对线性代数问题，如求解一个具有大稀疏矩阵的线性方程，建立了稀疏矩阵的计算公式。几年前，我在MATLAB中使用它们来解决有限差分问题。对于这项工作，计算友好的csr格式是最终目标，而coo格式是一种方便的初始化格式。

Now many of the SO scipy sparse questions arise from scikit-learn and text analysis problems. They are also used in a biological database files. But still the (data),(row,col) definition method works best.

现在，许多如此科学的问题产生于科学学习和文本分析问题。它们也用于生物数据库文件中。但(数据)、(行、col)定义方法仍然最有效。

So sparse matrices were never intended for fast incremental creation. The traditional Python structures like dictionaries and lists are much better for that.

所以稀疏矩阵从来就不适合快速递增的创建。传统的Python结构(如字典和列表)就更好了。

Here's a faster dok iteration that takes advantage of its dictionary methods. update seems to work as fast as on a plain dictionary. get is about 3x faster the equivalent indexing (freq[row,col]). Indexing probably uses get, but must have a lot of overhead.

这里有一个更快的dok迭代，它利用了它的字典方法。更新似乎和普通字典一样快。get比等效的索引快3倍(freq[row,col])。索引可能使用get，但是必须有很多开销。

def fast_dok(rows, cols):
    freqs = dok_matrix((1000,1000))
    for row, col in zip(rows,cols):
         i = freqs.get((row,col),0)
         freqs.update({(row,col):i+1})
    return freqs

Skipping the get, and just doing

跳过get，开始做

         freqs.update({(row,col): 1)

is even faster - faster than the defaultdict of defaultdict example, and nearly as fast as simple dictionary initialization ({(r, c):1 for r,c in zip(rows, cols)})

甚至更快——比defaultdict示例的defaultdict更快，几乎和简单的字典初始化一样快({(r, c):1 for r,c in zip(行，cols)})

#2

There are various reasons why your test is not fair. Firstly, you're including the overhead of constructing the sparse matrices as part of your timed loop.

考试不公平的原因有很多。首先，您需要将构建稀疏矩阵的开销作为时间循环的一部分。

Secondly, and arguably more importantly, you should use data structures as they are designed to be used, with operations on the whole array at once. That is, rather than iterating over the rows and columns and adding 1 each time, simply add 1 to the whole array.

其次，更重要的是，您应该使用数据结构，因为它们是设计用来使用的，同时对整个数组进行操作。也就是说，与其遍历行和列，每次都添加1，不如将1添加到整个数组中。

#1