如何分配scipy。稀疏矩阵的NumPy数组通过索引?

When I try to assign a scipy.sparse matrix s (any of the available sparse types) to a NumPy array a like this:

当我尝试分配一个scipy。稀疏矩阵s(任何可用的稀疏类型)到一个像这样的NumPy数组:

a[:] = s

I get a TypeError:

我得到一个TypeError:

TypeError: float() argument must be a string or a number

类型错误:float()参数必须是字符串或数字。

Is there a way to get around this?

有办法解决这个问题吗?

I know about the todense() and toarray() methods, but I'd really like to avoid the unnecessary copy and I'd prefer to use the same code for both NumPy arrays and SciPy sparse matrices. For now, I'm not concerned with getting the values from the sparse matrix being inefficient.

我知道to稠密()和toarray()方法，但是我非常希望避免不必要的复制，我更喜欢使用相同的代码来处理NumPy数组和SciPy稀疏矩阵。现在，我不关心从稀疏矩阵中获取值是低效的。

Is there probably some kind of wrapper around sparse matrices that works with NumPy indexing assignment?

是否可能存在某种关于稀疏矩阵的包装器，它适用于NumPy索引赋值?

If not, any advice how I could build such a thing by myself?

如果没有，有什么建议可以让我自己去做吗?

Is there a different sparse array library that cooperates with NumPy in this situation?

在这种情况下，是否存在与NumPy合作的不同的稀疏数组库?

UPDATE:

更新:

I poked around in the NumPy sources and, searching for the error message string, I think I found the section where the indexing assignment happens in numpy/core/src/multiarray/arraytypes.c.src around line 187 in the function @TYPE@_setitem().

我在NumPy源代码中进行了查找，并搜索了错误消息字符串，我认为我找到了在NumPy /core/src/multiarray/arraytypes.c中进行索引分配的部分。src在函数@TYPE@_setitem()的第187行。

I still don't really get it, but at some point, the float() function seems to be called (if a is a floating-point array). So I tried to monkey-patch one of the SciPy sparse matrix classes to allow this function to be called:

我仍然没有得到它，但是在某个点上，float()函数似乎被调用(如果a是浮点数组)。所以我尝试了monkey-patch一个SciPy稀疏矩阵类让这个函数被调用:

import scipy
s = scipy.sparse.dok_matrix((5, 1))
def myfloat(self):
    assert self.shape == (1, 1)
    return self[0, 0]
scipy.sparse.dok.dok_matrix.__float__ = myfloat
a[:] = s

Sadly, this doesn't work because float() is called on the whole sparse matrix and not on the individual items thereof.

遗憾的是，这不起作用，因为float()被调用在整个稀疏矩阵上，而不是在单个项上。

So I guess my new question is: how can I further change the sparse matrix class to make NumPy iterate over all the items and call float() on each of them?

因此，我想我的新问题是:如何进一步更改稀疏矩阵类，使NumPy遍历所有项，并在每个项上调用float() ?

ANOTHER UPDATE:

另一个更新:

I found a sparse array module on Github (https://github.com/FRidh/sparse), which allows assignment to a NumPy array. Sadly, the features of the module are quite limited (e.g. slicing doesn't really work yet), but it might help to understand how assigning to NumPy arrays can be achieved. I'll investigate that further ...

我在Github上找到了一个稀疏的数组模块(https://github.com/fridh/稀疏)，它允许赋值给一个NumPy数组。遗憾的是，模块的功能非常有限(例如，切片还没有真正起作用)，但是它可能有助于理解如何分配到NumPy数组。我将进一步调查……

YET ANOTHER UPDATE:

另一个更新:

I did some more digging and found that a more interesting source file is probably numpy/core/src/multiarray/ctors.c. I suspect that the function PySequence_Check() (docs/code) is called sometime during the assignment. The simple sparse array class from https://github.com/FRidh/sparse passes the test, but it looks like the sparse matrix classes from SciPy don't (although in my opinion they are sequences).

我做了更多的挖掘，发现一个更有趣的源文件可能是numpy/core/src/multiarray/ctors.c。我怀疑函数PySequence_Check()(文档/代码)在赋值期间被调用。来自https://github.com/fridh/稀疏的简单稀疏数组类通过了测试，但它看起来像SciPy中的稀疏矩阵类(尽管我认为它们是序列)。

They get checked for __array_struct__, __array_interface__ and __array__, and then it's somehow decided that they are not sequences. The attributes __getitem__ and __len__ (which all the sparse array classes have!) are not checked.

他们检查了__array_struct__， __array_interface__和__array__，然后他们决定它们不是序列。属性__getitem__和__len__(所有稀疏数组类都有!)没有被选中。

This leads me to yet another question: How can I manipulate the sparse matrix classes (or objects thereof) in a way that they pass PySequence_Check()?

这就引出了另一个问题:我如何通过PySequence_Check()来操作稀疏矩阵类(或对象)?

I think as soon as they are recognized as sequences, assignment should work, because __getitem__() and __len__() should be sufficient for that.

我认为一旦它们被识别为序列，分配就应该起作用，因为__getitem__()和__len__()应该就足够了。

2 个解决方案

#1

As mentioned in a comment to my question, the sequence interface won't work for sparse matrices, because they don't lose a dimension when indexed with a single number. To try it anyway, I created a very limited quick-and-dirty sparse array class in pure Python, which, when indexed with a single number, returns a "row" class (which holds a view to the original data), which again can be indexed with a single number to yield the actual value at this index. Using an instance s of my class, assigning to a NumPy array a works exactly as requested:

正如在对我的问题的注释中所提到的，序列接口对稀疏矩阵不起作用，因为它们在与单个数字进行索引时不会丢失一个维度。无论如何努力,我创建了一个非常有限的快速上手稀疏阵列在纯Python类,,当与一个数字索引时,返回一个“行”类(拥有查看原始数据),这与一个数字可以被索引,收益率这个指标的实际值。使用我的类的实例，分配给一个NumPy数组a的工作完全按照要求:

a[:] = s

I expected this to be somewhat inefficient, but it is really, really, really, extremely slow. Assigning a 500.000 x 100 sparse array took several minutes! The good news, though, is that no full-sized temporary array is created during the assignment. The memory usage stays about constant during the assignment (while one of the CPUs maxes out).

我原以为这会有点低效，但它真的非常非常慢。分配一个500.000 x 100稀疏数组需要几分钟!不过，好消息是在任务期间没有创建全尺寸的临时数组。在赋值期间，内存使用情况保持不变(而其中一个cpu峰值输出)。

So this is basically one solution to the original question.

这基本上是对原始问题的一个解。

To make the assignment more efficient and still use no temporary copy of the dense array data, NumPy would have to internally do something similar to

为了使分配更高效，并且仍然不使用密集数组数据的临时副本，NumPy必须在内部执行类似的操作。

s.toarray(out=a)

As far as I know, there is currently no way to get NumPy to do that.

据我所知，目前还没有办法让NumPy这样做。

However, there is a way to do something very similar, by providing an __array__() method that returns a NumPy array. Incidentally, SciPy sparse matrices already have such a method, just with a different name: toarray(). So I just renamed it:

但是，有一种方法可以做非常类似的事情，通过提供__array__()方法返回一个NumPy数组。顺便说一下，SciPy稀疏矩阵已经有了这样的方法，只是有一个不同的名称:toarray()。所以我把它重新命名为:

scipy.sparse.dok_matrix.__array__ = scipy.sparse.dok_matrix.toarray
a[:] = s

This works like a charm (also with the other sparse matrix classes) and is totally fast!

这就像一个符咒一样(也有其他稀疏矩阵类)，而且非常快!

According to my limited understanding of the situation, this should create a temporary NumPy array with the same size as a which holds all the values from s (and many zeros) and which is then assigned to a. But strangely, even when I use a very large a that occupies nearly all my available RAM, the assignment still happens very quickly and no additional RAM is used.

据我有限的了解,这将创建一个临时NumPy数组具有相同大小作为一个拥有所有的值从s(和许多0),然后分配给一个。但奇怪的是,即使我使用一个非常大的一个几乎占据了我所有的可用内存,任务仍然发生很快,没有使用额外的RAM。

So I guess this is another, much better solution to my original question.

所以我想这是对我最初的问题的另一个更好的解决方案。

Which leaves another question: why does this work without a temporary array?

这就留下了另一个问题:为什么这个工作没有临时数组?

#2

How about using nonzero to identify which elements are not zero?

如何使用非零来识别哪些元素不是零?

x = np.ones((3,4))
s = sparse.csr_matrix((3,4))
s[0,0] = 2
s[1,2] = 3
I,J = s.nonzero()
x[:] = 0  # omit if just changing nonzero values
x[I,J] = s.data
x

nonzero functions the same for both dense and csr arrays. I haven't tried it with the other formats.

非零函数对于密集和csr数组都是相同的。我还没有尝试其他格式。

For a csr (and coo) sparse matrix, values are stored the s.data array. In this example it looks like:

对于一个csr(和coo)稀疏矩阵，值被存储在s中。数据数组。在这个例子中，它看起来是这样的:

array([ 2.,  4.])

x values are in a data buffer, x.data. In this case it is 12 contiguous floats.

x值位于数据缓冲区中，x.data。在这种情况下，它是12个连续的浮点数。

x.ravel()
# array([ 2.,  0.,  0.,  0.,  0.,  0.,  4.,  0.,  0.,  0.,  0.,  0.])

There's no way that those 2 values of s can be mapped on to the 12 values of x without copying. The sparse data values do not, in general, match with a contiguous block of values in its dense equivalent.

这两个s的值不可能被映射到x的12个值而不需要复制。一般来说，稀疏数据值不匹配其稠密等效值中的连续块。

You worry about the size of the I and J arrays. If the sparse matrix was in coo format, its row and col attributes could be used in the same way:

您担心的是I和J数组的大小。如果稀疏矩阵为coo格式，则其行和col属性可以采用相同的方式:

sc=s.tocoo()
x[sc.row, sc.col]=sc.data

But converting from one sparse format to another involves copying data. And coo arrays are not subscriptable.

但是从一种稀疏格式转换到另一种格式需要复制数据。coo数组不是可下标的。

x = np.zeros((3,4))
x[:]=['123','321','0','1']

produces

生产

array([[ 123.,  321.,    0.,    1.],
       [ 123.,  321.,    0.,    1.],
       [ 123.,  321.,    0.,    1.]])

It does apply float to each item one the right side, and then 'broadcasts' it to fit the x size.

它会将浮点数应用到每一项的右侧，然后“广播”以适应x大小。

The [] translates into a call to -_setitem__.

转换成调用-_setitem__。

x.__setitem__((1,2),3)  # same as x[1,2]=3
x.__setitem__((None,2),'3') # sets the last row

It appears to call float on each item of any iterable (need to double check this). But if the value is some other object, we get an error similar to your original one:

它似乎在任何可迭代的项上调用float(需要双重检查)。但是如果这个值是另一个对象，我们会得到一个类似于你原来的错误的错误:

class Foo(): pass
x.__setitem__((1,2), Foo())
# TypeError: float() argument must be a string or a number, not 'Foo'

sparse coo and dok formats produce a similar error, while csr and lil produce a

稀疏coo和dok格式产生类似的错误，而csr和lil生成a。

ValueError: setting an array element with a sequence.

I haven't figured out which method or attribute of the sparse matrix is being used here.

我还没有弄清楚稀疏矩阵的哪个方法或属性在这里被使用。

Take a look at np.broadcast. I think that replicates the kind of iteration used in these assignments.

看一看nps广播。我认为它复制了这些作业中使用的迭代。

b = np.broadcast(x[:], [1,2,3,4])
list(b)

We could remove the float conversion complication by starting with an array with dtype object, which can hold anything:

我们可以从一个带有dtype对象的数组开始移除浮动转换的复杂性，它可以容纳任何东西:

xa=np.zeros((3,4), dtype=object)
xa[:]=s

But now s appears in each element of xa. It hasn't distributed the values of s over xa.

但现在s出现在xa的每个元素中。它没有分布s / xa的值。

I am guessing that when s is not an np.array, numpy first wraps it when doing the assignment, e.g.:

我猜当s不是np时。数组，numpy在执行任务时首先封装它，例如:

x[:] = np.array(s)

When s is a scalar, or list, this produces an array that can be broadcast to fit x. But when it is an object (a sparse array is not a numpy array), this wrapping is just a 0d array with dtype=object. You need to pass s through a function that turns it into an iterable that can be broadcast. The most obvious one is toarray().

当s是一个标量或列表时，它会生成一个可以被广播到fit x的数组，但是当它是一个对象(稀疏数组不是一个numpy数组)时，这个包装就是一个dtype=对象的0d数组。您需要通过一个函数，将其转换为可以被广播的iterable。最明显的是toarray()。

#1

a[:] = s