I gather Text documents (in Node.js) where one document i
is represented as a list of words. What is an efficient way to compute the similarity between these documents, taking into account that new documents are coming as a sort of stream of documents?
我收集文本文档(在Node.js中),其中一个文档我被表示为单词列表。考虑到新文档是作为一种文档流出现的,计算这些文档之间相似性的有效方法是什么?
I currently use cos-similarity on the Normalized Frequency of the words within each document. I don't use the TF-IDF (Term frequency, Inverse document frequency) because of the scalability issue since I get more and more documents.
我目前在每个文档中的单词的标准化频率上使用cos相似性。由于可扩展性问题,我不使用TF-IDF(术语频率,反向文档频率),因为我获得了越来越多的文档。
Initially
My first version was to start with the currently available documents, compute a big Term-Document matrix A
, and then compute S = A^T x A
so that S(i, j)
is (after normalization by both norm(doc(i))
and norm(doc(j))
) the cos-similarity between documents i
and j
whose word frequencies are respectively doc(i)
and doc(j)
.
我的第一个版本是从当前可用的文档开始,计算一个大的Term-Document矩阵A,然后计算S = A ^ T x A,这样S(i,j)就是(经过两个规范的标准化后(doc(i) ))和norm(doc(j)))文档i和j之间的cos相似性,其词频分别为doc(i)和doc(j)。
For new documents
What do I do when I get a new document doc(k)
? Well, I have to compute the similarity of this document with all the previous ones, which doesn't require to build a whole matrix. I can just take the inner-product of doc(k) dot doc(j)
for all previous j
, and that result in S(k, j)
, which is great.
获得新文档doc(k)后我该怎么办?好吧,我必须计算这个文档与之前所有文档的相似性,这不需要构建一个完整的矩阵。我可以把doc(k)dot doc(j)的内积用于所有以前的j,结果是S(k,j),这很好。
The troubles
-
Computing
S
in Node.js is really long. Way too long in fact! So I decided to create a C++ module which would do the whole thing much faster. And it does! But I cannot wait for it, I should be able to use intermediate results. And what I mean by "not wait for it" is both在Node.js中计算S真的很长。实际上太长了!所以我决定创建一个C ++模块,它可以更快地完成整个过程。确实如此!但我不能等待它,我应该能够使用中间结果。而我所说的“不等待它”就是两者
a. wait for the computation to be done, but also
b. wait for the matrixA
to be built (it's a big one).一个。等待计算完成,但也是b。等待矩阵A建立(这是一个很大的)。
-
Computing new
S(k, j)
can take advantage of the fact that documents have way less words than the set of all the given words (which I use to build the whole matrixA
). Thus, it looks faster to do it in Node.js, avoiding a lot of extra-resource to be taken to access the data.计算新的S(k,j)可以利用这样的事实:文档比所有给定单词的集合(我用来构建整个矩阵A)的单词少。因此,在Node.js中执行它看起来更快,避免了大量额外资源来访问数据。
But is there any better way to do that?
但有没有更好的方法呢?
Note : the reason I started computing S
is that I can easily build A
in Node.js where I have access to all the data, and then do the matrix multiplication in C++ and get it back in Node.js, which speeds the whole thing a lot. But now that computing S
gets impracticable, it looks useless.
注意:我开始计算S的原因是我可以轻松地在Node.js中构建A,在那里我可以访问所有数据,然后在C ++中进行矩阵乘法并将其返回到Node.js中,从而加速整个事物很多。但是现在计算S变得不切实际,它看起来毫无用处。
Note 2 : yep, I don't have to compute the whole S
, I can just compute the upper-right elements (or the lower-left ones), but that's not the issue. The time computation issue is not of that order.
注2:是的,我不必计算整个S,我可以计算右上角的元素(或左下角的元素),但这不是问题。时间计算问题不是那个顺序。
1 个解决方案
#1
0
If one has to solve it today, just use pre-trained word vectors from fasttext or word2vec
如果今天必须解决它,只需使用来自fasttext或word2vec的预训练的单词向量
#1
0
If one has to solve it today, just use pre-trained word vectors from fasttext or word2vec
如果今天必须解决它,只需使用来自fasttext或word2vec的预训练的单词向量