I am new to both python and scikit-learn so please bear with me.
我是python和scikit-learn的新手,所以请耐心等待。
I took this source code for k means clustering algorithm from k means clustering.
我从k意味着聚类的k均值聚类算法获取了这个源代码。
I then modified to run on my local set by using load_file function.
然后我通过使用load_file函数修改为在我的本地集上运行。
Although the algorithm terminates, but it does not produce any output like which documents are clustered together.
虽然算法终止,但它不会产生任何输出,例如哪些文档聚集在一起。
I found that the km object has "km.label" array which lists the centroid id of each document.
我发现km对象有“km.label”数组,它列出了每个文档的质心id。
It also has the centroid vector with "km.cluster_centers_"
它还有“km.cluster_centers_”的质心向量
But what document it is ? I have to map it to "dataset" which is a "Bunch" object.
但它是什么文件?我必须将它映射到“数据集”,这是一个“束”对象。
If i print dataset.data[0], i get the data of first file which i think are shuffled. but i just want to know the name.
如果我打印dataset.data [0],我得到的第一个文件的数据,我认为是洗牌。但我只是想知道这个名字。
I am confused with questions like Does the document at dataset.data[0] is clusterd to centoid at km.label[0] ?
我对于像dataset.data [0]的文档是否在km.label [0]中聚集到centoid这样的问题感到困惑?
My basic problem is to find which files are clustered together. How to find that ?
我的基本问题是找到哪些文件聚集在一起。怎么找到?
2 个解决方案
#1
11
Forget about the Bunch
object. It's just an implementation detail to load the toy datasets that are bundled with scikit-learn.
忘记Bunch对象。它只是一个实现细节,用于加载与scikit-learn捆绑在一起的玩具数据集。
In real life, with you real data you just have to call directly:
在现实生活中,您需要直接调用真实数据:
km = KMeans(n_clusters).fit(my_document_features)
then collect cluster assignments from:
然后从以下位置收集群集分配:
km.labels_
my_document_features
is a 2D datastructure: either a numpy array or a scipy.sparse matrix with shape (n_documents, n_features)
.
my_document_features是一个2D数据结构:numpy数组或带形状的scipy.sparse矩阵(n_documents,n_features)。
km.labels_
is a 1D numpy array with shape (n_documents,)
. Hence the first element in labels_
is the index of the cluster of the document described in the first row of the my_document_features
feature matrix.
km.labels_是具有形状(n_documents,)的1D numpy数组。因此,labels_中的第一个元素是my_document_features特征矩阵的第一行中描述的文档的集群的索引。
Typically you would build my_document_features
with a TfidfVectorizer
object:
通常,您将使用TfidfVectorizer对象构建my_document_features:
my_document_features = TfidfVectorizer().fit_transform(my_text_documents)
and my_text_documents
would a either a list python unicode objects if you read the documents directly (e.g. from a database or rows from a single CSV file or whatever you want) or alternatively:
如果您直接读取文档(例如,从数据库或单个CSV文件中的行或您想要的任何内容),则my_text_documents可以是列表python unicode对象,或者:
vec = TfidfVectorizer(input='filename')
my_document_features = vec.fit_transform(my_text_files)
where my_text_files
is a python list of the path of your document files on your harddrive (assuming they are encoded using the UTF-8 encoding).
其中my_text_files是硬盘上文档文件路径的python列表(假设它们使用UTF-8编码进行编码)。
The length of the my_text_files
or my_text_documents
lists should be n_documents
hence the mapping with km.labels_
is direct.
my_text_files或my_text_documents列表的长度应为n_documents,因此km.labels_的映射是直接的。
As scikit-learn is not just for clustering or categorizing documents, we use the name "sample" instead of "document". This is way you will see the we use n_samples
instead of n_documents
to document the expected shapes of the arguments and attributes of all the estimator in the library.
由于scikit-learn不仅仅用于聚类或分类文档,我们使用名称“sample”而不是“document”。这样您就可以看到我们使用n_samples而不是n_documents来记录库中所有估算器的参数和属性的预期形状。
#2
2
dataset.filenames
is the key :)
dataset.filenames是关键:)
This is how i did it.
这就是我做到的。
load_files declaration is :
load_files声明是:
def load_files(container_path, description=None, categories=None,
load_content=True, shuffle=True, charset=None,
charse_error='strict', random_state=0)
so do
那样做
dataset_files = load_files("path_to_directory_containing_category_folders");
then when i got the result :
然后当我得到结果时:
i put them in the clusters which is a dictionary
我把它们放在字典中
clusters = defaultdict(list)
k = 0;
for i in km.labels_ :
clusters[i].append(dataset_files.filenames[k])
k += 1
and then i print it :)
然后我打印:)
for clust in clusters :
print "\n************************\n"
for filename in clusters[clust] :
print filename
#1
11
Forget about the Bunch
object. It's just an implementation detail to load the toy datasets that are bundled with scikit-learn.
忘记Bunch对象。它只是一个实现细节,用于加载与scikit-learn捆绑在一起的玩具数据集。
In real life, with you real data you just have to call directly:
在现实生活中,您需要直接调用真实数据:
km = KMeans(n_clusters).fit(my_document_features)
then collect cluster assignments from:
然后从以下位置收集群集分配:
km.labels_
my_document_features
is a 2D datastructure: either a numpy array or a scipy.sparse matrix with shape (n_documents, n_features)
.
my_document_features是一个2D数据结构:numpy数组或带形状的scipy.sparse矩阵(n_documents,n_features)。
km.labels_
is a 1D numpy array with shape (n_documents,)
. Hence the first element in labels_
is the index of the cluster of the document described in the first row of the my_document_features
feature matrix.
km.labels_是具有形状(n_documents,)的1D numpy数组。因此,labels_中的第一个元素是my_document_features特征矩阵的第一行中描述的文档的集群的索引。
Typically you would build my_document_features
with a TfidfVectorizer
object:
通常,您将使用TfidfVectorizer对象构建my_document_features:
my_document_features = TfidfVectorizer().fit_transform(my_text_documents)
and my_text_documents
would a either a list python unicode objects if you read the documents directly (e.g. from a database or rows from a single CSV file or whatever you want) or alternatively:
如果您直接读取文档(例如,从数据库或单个CSV文件中的行或您想要的任何内容),则my_text_documents可以是列表python unicode对象,或者:
vec = TfidfVectorizer(input='filename')
my_document_features = vec.fit_transform(my_text_files)
where my_text_files
is a python list of the path of your document files on your harddrive (assuming they are encoded using the UTF-8 encoding).
其中my_text_files是硬盘上文档文件路径的python列表(假设它们使用UTF-8编码进行编码)。
The length of the my_text_files
or my_text_documents
lists should be n_documents
hence the mapping with km.labels_
is direct.
my_text_files或my_text_documents列表的长度应为n_documents,因此km.labels_的映射是直接的。
As scikit-learn is not just for clustering or categorizing documents, we use the name "sample" instead of "document". This is way you will see the we use n_samples
instead of n_documents
to document the expected shapes of the arguments and attributes of all the estimator in the library.
由于scikit-learn不仅仅用于聚类或分类文档,我们使用名称“sample”而不是“document”。这样您就可以看到我们使用n_samples而不是n_documents来记录库中所有估算器的参数和属性的预期形状。
#2
2
dataset.filenames
is the key :)
dataset.filenames是关键:)
This is how i did it.
这就是我做到的。
load_files declaration is :
load_files声明是:
def load_files(container_path, description=None, categories=None,
load_content=True, shuffle=True, charset=None,
charse_error='strict', random_state=0)
so do
那样做
dataset_files = load_files("path_to_directory_containing_category_folders");
then when i got the result :
然后当我得到结果时:
i put them in the clusters which is a dictionary
我把它们放在字典中
clusters = defaultdict(list)
k = 0;
for i in km.labels_ :
clusters[i].append(dataset_files.filenames[k])
k += 1
and then i print it :)
然后我打印:)
for clust in clusters :
print "\n************************\n"
for filename in clusters[clust] :
print filename