如何有效地找到类似的文件

I have lots of document that I have clustered using a clustering algorithm. In the clustering algorithm, each document may belong to more than one clusters. I've created a table storing the document-clusterassignment and another one which stores the cluster-document info. When I look for the list of similar documents to a given document (let's sat d_i). I first retrieve the list of clusters to which it belongs (from the document-cluster table) and then for each cluster c_j in the document-cluster I retrieve the lists of documents which belong to c_j from the cluster-document table. There are more than one c_j, so obviously there will be in multiple lists. Each list have many documents and apparently there might be overlaps among these lists.

我有很多使用聚类算法聚类的文档。在聚类算法中,每个文档可以属于多个聚类。我创建了一个存储document-clusterassignment的表,另一个存储了cluster-document信息。当我查找给定文档的类似文档列表时(让我们坐着d_i)。我首先检索它所属的集群列表(来自文档集群表),然后对于文档集群中的每个集群c_j,我从集群文档表中检索属于c_j的文档列表。有多个c_j,所以很明显会有多个列表。每个列表都有很多文档,显然这些列表之间可能存在重叠。

In the next phase and in order to find the most similar documents to d_i, I rank the similar documents based on the number of clusters they have in common with d_i.

在下一阶段,为了找到与d_i最相似的文档,我根据与d_i共有的聚类数量对相似文档进行排名。

My question is about the last phase. A naive solution is to create a sorted kind of HashMap which has the document as the key and # common clusters as the value. However as each list might contains many many documents, this may not be the best solution. Is there any other way to rank the similar items? Any preprocessing or ..?

我的问题是关于最后阶段。一个天真的解决方案是创建一个排序类型的HashMap,它将文档作为键,#common clusters作为值。但是,由于每个列表可能包含许多文档,因此这可能不是最佳解决方案。有没有其他方法来排名相似的项目?任何预处理或..?

1 个解决方案

#1

Assuming that the number of arrays is relatively small comparing to the number of elements (and in particular, the number of arrays is in o(logn)), you can do it by a modification of a bucket sort:

假设与元素数量相比,数组的数量相对较少(特别是,数组的数量在o(logn)中),您可以通过修改存储桶排序来实现:

Let m be the number of arrays create a list containing m buckets buckets[], where each bucket[i] is a hashset

设m是数组的数量,创建一个包含m buckets buckets []的列表,其中每个bucket [i]是一个hashset

for each array arr:
   for each element x in arr:
      find if x is in any bucket, if so - let that bucket id be i:
          remove x from bucket i  
          i <- i + 1
      If no such bucket exist, set i=1
      add x to bucket i

for each bucket i=m,m-1,...,1 in descending order:
   for each element x in bucket[i]:
      yield x

The above runs in O(m^2*n):

以上运行在O(m ^ 2 * n):

Iterating over each array

迭代每个数组

Iterating over all elements in each array

迭代每个数组中的所有元素

Finding the relevant bucket.

找到相关的桶。

Note that the last one can be done by adding a map:element->bucket_id, and be done in O(1) using hash tables, so we can improve it to O(m*n).

注意,最后一个可以通过添加map:element-> bucket_id完成,并使用哈希表在O(1)中完成,因此我们可以将其改进为O(m * n)。

An alternative is to use a hashmap as a histogram that maps from element to its number of occurances, and then sort the array including all elements based on the histogram. The benefit of this approach: it can be distributed very nicely with map-reduce:

另一种方法是使用散列映射作为从元素映射到其出现次数的直方图,然后根据直方图对包括所有元素的数组进行排序。这种方法的好处是:它可以通过map-reduce非常好地分发:

map(partial list of elements l):
    for each element x:
       emit(x,'1')
reduce(x, list<number>):
   s = sum{list}
   emit(x,s)
combine(x,list<number>):
   s = sum{list} //or size{list} for a combiner
   emit(x,s)

#1