scikit-学习如何知道集群中的文档？

I am new to both python and scikit-learn so please bear with me.

我是python和scikit-learn的新手，所以请耐心等待。

I took this source code for k means clustering algorithm from k means clustering.

我从k意味着聚类的k均值聚类算法获取了这个源代码。

I then modified to run on my local set by using load_file function.

然后我通过使用load_file函数修改为在我的本地集上运行。

Although the algorithm terminates, but it does not produce any output like which documents are clustered together.

虽然算法终止，但它不会产生任何输出，例如哪些文档聚集在一起。

I found that the km object has "km.label" array which lists the centroid id of each document.

我发现km对象有“km.label”数组，它列出了每个文档的质心id。

It also has the centroid vector with "km.cluster_centers_"

它还有“km.cluster_centers_”的质心向量

But what document it is ? I have to map it to "dataset" which is a "Bunch" object.

但它是什么文件？我必须将它映射到“数据集”，这是一个“束”对象。

If i print dataset.data[0], i get the data of first file which i think are shuffled. but i just want to know the name.

如果我打印dataset.data [0]，我得到的第一个文件的数据，我认为是洗牌。但我只是想知道这个名字。

I am confused with questions like Does the document at dataset.data[0] is clusterd to centoid at km.label[0] ?

我对于像dataset.data [0]的文档是否在km.label [0]中聚集到centoid这样的问题感到困惑？

My basic problem is to find which files are clustered together. How to find that ?

我的基本问题是找到哪些文件聚集在一起。怎么找到？

2 个解决方案

#1

Forget about the Bunch object. It's just an implementation detail to load the toy datasets that are bundled with scikit-learn.

忘记Bunch对象。它只是一个实现细节，用于加载与scikit-learn捆绑在一起的玩具数据集。

In real life, with you real data you just have to call directly:

在现实生活中，您需要直接调用真实数据：

km = KMeans(n_clusters).fit(my_document_features)

then collect cluster assignments from:

然后从以下位置收集群集分配：

km.labels_

my_document_features is a 2D datastructure: either a numpy array or a scipy.sparse matrix with shape (n_documents, n_features).

my_document_features是一个2D数据结构：numpy数组或带形状的scipy.sparse矩阵（n_documents，n_features）。

km.labels_ is a 1D numpy array with shape (n_documents,). Hence the first element in labels_ is the index of the cluster of the document described in the first row of the my_document_features feature matrix.

km.labels_是具有形状（n_documents，）的1D numpy数组。因此，labels_中的第一个元素是my_document_features特征矩阵的第一行中描述的文档的集群的索引。

Typically you would build my_document_features with a TfidfVectorizer object:

通常，您将使用TfidfVectorizer对象构建my_document_features：

my_document_features = TfidfVectorizer().fit_transform(my_text_documents)

and my_text_documents would a either a list python unicode objects if you read the documents directly (e.g. from a database or rows from a single CSV file or whatever you want) or alternatively:

如果您直接读取文档（例如，从数据库或单个CSV文件中的行或您想要的任何内容），则my_text_documents可以是列表python unicode对象，或者：

vec = TfidfVectorizer(input='filename')
my_document_features = vec.fit_transform(my_text_files)

where my_text_files is a python list of the path of your document files on your harddrive (assuming they are encoded using the UTF-8 encoding).

其中my_text_files是硬盘上文档文件路径的python列表（假设它们使用UTF-8编码进行编码）。

The length of the my_text_files or my_text_documents lists should be n_documents hence the mapping with km.labels_ is direct.

my_text_files或my_text_documents列表的长度应为n_documents，因此km.labels_的映射是直接的。

As scikit-learn is not just for clustering or categorizing documents, we use the name "sample" instead of "document". This is way you will see the we use n_samples instead of n_documents to document the expected shapes of the arguments and attributes of all the estimator in the library.

由于scikit-learn不仅仅用于聚类或分类文档，我们使用名称“sample”而不是“document”。这样您就可以看到我们使用n_samples而不是n_documents来记录库中所有估算器的参数和属性的预期形状。

#2

dataset.filenames is the key :)

dataset.filenames是关键:)

This is how i did it.

这就是我做到的。

load_files declaration is :

load_files声明是：

def load_files(container_path, description=None, categories=None,
           load_content=True, shuffle=True, charset=None,
           charse_error='strict', random_state=0)

so do

那样做

dataset_files = load_files("path_to_directory_containing_category_folders");

then when i got the result :

然后当我得到结果时：

i put them in the clusters which is a dictionary

我把它们放在字典中

clusters = defaultdict(list)

k = 0;
for i in km.labels_ :
  clusters[i].append(dataset_files.filenames[k])  
  k += 1

and then i print it :)

然后我打印:)

for clust in clusters :
  print "\n************************\n"
  for filename in clusters[clust] :
    print filename

#1