得到numpy稀疏矩阵行的规范

I have a sparse matrix that I obtained by using Sklearn's TfidfVectorizer object:

我有一个稀疏矩阵,我通过使用Sklearn的TfidfVectorizer对象获得:

vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word', vocabulary=my_vocab, stop_words='english')
tfidf = vect.fit_transform([my_docs])

The sparse matrix is (taking out the numbers for generality):

稀疏矩阵是(取出数字的一般性):

<sparse matrix of type '<type 'numpy.float64'>'
with stored elements in Compressed Sparse Row format>]

I am trying to get a numeric value for each row to tell me how high a document had the terms I am looking for. I don't really care about which words it contained, I just want to know how many it contained. So I want to get the norm of each or the row*row.T. However, I am having a very hard time working with numpy to obtain this.

我试图获得每行的数值,告诉我文档有多高我要找的条件。我真的不在乎它包含哪些词,我只是想知道它包含了多少词。所以我想得到每个或行* row.T的标准。但是,我很难与numpy一起工作来获得这个。

My first approach was to just simply do:

我的第一个方法是简单地做:

tfidf[i] * numpy.transpose(tfidf[i])

However, numpy will apparently not transpose an array with less than one dimension so that will just square the vector. So I tried doing:

然而,numpy显然不会转换少于一维的数组,因此只会对矢量求平方。所以我试着这样做:

tfidf[i] * numpy.transpose(numpy.atleast_2d(tfidf[0]))

But numpy.transpose(numpy.atleast_2d(tfidf[0])) still would not transpose the row.

但是numpy.transpose(numpy.atleast_2d(tfidf [0]))仍然不会转置行。

I moved on to trying to get the norm of the row (that approach is probably better anyways). My initial approach was using numpy.linalg.

我继续尝试获得行的标准(反正这种方法可能更好)。我最初的方法是使用numpy.linalg。

numpy.linalg.norm(tfidf[0])

But that gave me a "dimension mismatch" error. So I tried to calculate the norm manually. I started by just setting a variable equal to a numpy array version of the sparse matrix and printing out the len of the first row:

但这给了我一个“尺寸不匹配”的错误。所以我试着手动计算规范。我开始只是设置一个变量等于稀疏矩阵的numpy数组版本并打印出第一行的len:

my_array = numpy.array(tfidf)
print my_array
print len(my_array[0])

It prints out my_array correctly, but when I try to access the len it tells me:

它正确打印出my_array,但是当我尝试访问len时它会告诉我:

IndexError: 0-d arrays can't be indexed

I just simply want to get a numeric value of each row in the sparse matrix returned by fit_transform. Getting the norm would be best. Any help here is very appreciated.

我只是想得到fit_transform返回的稀疏矩阵中每一行的数值。获得常规是最好的。这里的任何帮助非常感谢。

2 个解决方案

#1

Some simple fake data:

一些简单的假数据:

a = np.arange(9.).reshape(3,3)
s = sparse.csr_matrix(a)

To get the norm of each row from the sparse, you can use:

要从稀疏中获取每一行的范数,您可以使用:

np.sqrt(s.multiply(s).sum(1))

And the renormalized s would be

重整化的s将是

s.multiply(1/np.sqrt(s.multiply(s).sum(1)))

or to keep it sparse before renormalizing:

或者在重新规范化之前保持稀疏:

s.multiply(sparse.csr_matrix(1/np.sqrt(s.multiply(s).sum(1))))

To get ordinary matrix or array from it, use:

要从中获取普通矩阵或数组,请使用:

m = s.todense()
a = s.toarray()

If you have enough memory for the dense version, you can get the norm of each row with:

如果你有足够的内存用于密集版本,你可以获得每行的标准:

n = np.sqrt(np.einsum('ij,ij->i',a,a))

n = np.apply_along_axis(np.linalg.norm, 1, a)

To normalize, you can do

为了规范化,你可以做到

an = a / n[:, None]

or, to normalize the original array in place:

或者,将原始数组规范化:

a /= n[:, None]

The [:, None] thing basically transposes n to be a vertical array.

[:,None]事物基本上将n转换为垂直数组。

#2

scipy.sparse is a great package, and it keeps getting better with every release, but a lot of things are still only half cooked, and you can get big performance improvements if you implement some of the algorithms yourself. For instance, a 7x improvement over @askewchan's implementation using scipy functions:

scipy.sparse是一个很棒的软件包,并且每次发布都会越来越好,但是很多东西仍然只有半熟,如果你自己实现一些算法,你可以获得很大的性能提升。例如,使用scipy函数比@ askewchan的实现提高了7倍:

In [18]: a = sps.rand(1000, 1000, format='csr')

In [19]: a
Out[19]: 
<1000x1000 sparse matrix of type '<type 'numpy.float64'>'
    with 10000 stored elements in Compressed Sparse Row format>

In [20]: %timeit a.multiply(a).sum(1)
1000 loops, best of 3: 288 us per loop

In [21]: %timeit np.add.reduceat(a.data * a.data, a.indptr[:-1])
10000 loops, best of 3: 36.8 us per loop

In [24]: np.allclose(a.multiply(a).sum(1).ravel(),
    ...:             np.add.reduceat(a.data * a.data, a.indptr[:-1]))
Out[24]: True

You can similarly normalize the array in place doing the following:

您可以类似地对阵列进行规范化,执行以下操作:

norm_rows = np.sqrt(np.add.reduceat(a.data * a.data, a.indptr[:-1]))
nnz_per_row = np.diff(a.indptr)
a.data /= np.repeat(norm_rows, nnz_per_row)

If you are going to be using sparse matrices often, read the wikipedia page on compressed sparse formats, and you will often find better ways than the default to do things.

如果您打算经常使用稀疏矩阵,请阅读压缩稀疏格式的*页面,并且您经常会找到比默认值更好的方法来执行操作。

#1