1) I am using scipy's hcluster module.
1)我正在使用scipy的hcluster模块。
so the variable that I have control over is the threshold variable. How do I know my performance per threshold? i.e. In Kmeans, this performance will be the sum of all the points to their centroids. Of course, this has to be adjusted since more clusters = less distance generally.
所以我控制的变量是阀值变量。我如何知道我的每一个门槛的表现?也就是说,在Kmeans中,这个性能将是所有指向其中心的点的和。当然,这需要调整,因为更多的集群=更少的距离。
Is there an observation that I can do with hcluster for this?
有没有观察到我可以用hcluster来做这个?
2) I am realize there are tons of metrics available for fclusterdata. I am clustering of text documents based on tf-idf of key terms. The deal is, some document are longer than others, and I think that cosine is a good way to "normalize" this length issue because the longer a document are, its "direction" in a n-dimensional field SHOULD stay the same if they content is consistent. Are there any other methods someone can suggest? How can I evaluate?
2)我意识到fclusterdata有很多可用的度量标准。我是基于关键术语tf-idf的文本文档集群。协议是,有些文档比其他文档长,我认为余弦是“标准化”这个长度问题的好方法,因为文档越长,它在n维领域中的“方向”就应该保持不变,如果它们的内容是一致的。还有什么方法可以推荐吗?我如何评价?
Thx
谢谢
1 个解决方案
#1
5
One can calculate average distances |x - cluster centre| for x in cluster, just as for K-means. The following does this brute-force. (It must be a builtin in scipy.cluster or scipy.spatial.distance but I can't find it either.)
在集群中,可以计算平均距离|x -簇中心|,就像K-means一样。下面是蛮力。(它一定是scipy中的一个内置组件。集群或scipy.spatial。距离,但我也找不到。
On your question 2, pass. Any links to good tutorials on hierarchical clustering would be welcome.
关于你的问题2,通过。任何关于分层集群的好教程的链接都是受欢迎的。
#!/usr/bin/env python
""" cluster cities: pdist linkage fcluster plot
util: clusters() avdist()
"""
from __future__ import division
import sys
import numpy as np
import scipy.cluster.hierarchy as hier # $scipy/cluster/hierarchy.py
import scipy.spatial.distance as dist
import pylab as pl
from citiesin import citiesin # 1000 US cities
__date__ = "27may 2010 denis"
def clusterlists(T):
""" T = hier.fcluster( Z, t ) e.g. [a b a b a c]
-> [ [0 2 4] [1 3] [5] ] sorted by len
"""
clists = [ [] for j in range( max(T) + 1 )]
for j, c in enumerate(T):
clists[c].append( j )
clists.sort( key=len, reverse=True )
return clists[:-1] # clip the []
def avdist( X, to=None ):
""" av dist X vecs to "to", None: mean(X) """
if to is None:
to = np.mean( X, axis=0 )
return np.mean( dist.cdist( X, [to] ))
#...............................................................................
Ndata = 100
method = "average"
t = 0
crit = "maxclust"
# 'maxclust': Finds a minimum threshold `r` so that the cophenetic distance
# between any two original observations in the same flat cluster
# is no more than `r` and no more than `t` flat clusters are formed.
# but t affects cluster sizes only weakly ?
# t 25: [10, 9, 8, 7, 6
# t 20: [12, 11, 10, 9, 7
plot = 0
seed = 1
exec "\n".join( sys.argv[1:] ) # Ndata= t= ...
np.random.seed(seed)
np.set_printoptions( 2, threshold=100, edgeitems=10, suppress=True ) # .2f
me = __file__.split('/') [-1]
# biggest US cities --
cities = np.array( citiesin( n=Ndata )[0] ) # N,2
if t == 0: t = Ndata // 4
#...............................................................................
print "# %s Ndata=%d t=%d method=%s crit=%s " % (me, Ndata, t, method, crit)
Y = dist.pdist( cities ) # n*(n-1) / 2
Z = hier.linkage( Y, method ) # n-1
T = hier.fcluster( Z, t, criterion=crit ) # n
clusters = clusterlists(T)
print "cluster sizes:", map( len, clusters )
print "# average distance to centre in the biggest clusters:"
for c in clusters:
if len(c) < len(clusters[0]) // 3: break
cit = cities[c].T
print "%.2g %s" % (avdist(cit.T), cit)
if plot:
pl.plot( cit[0], cit[1] )
if plot:
pl.title( "scipy.cluster.hierarchy of %d US cities, %s t=%d" % (
Ndata, crit, t) )
pl.grid(False)
if plot >= 2:
pl.savefig( "cities-%d-%d.png" % (Ndata, t), dpi=80 )
pl.show()
#1
5
One can calculate average distances |x - cluster centre| for x in cluster, just as for K-means. The following does this brute-force. (It must be a builtin in scipy.cluster or scipy.spatial.distance but I can't find it either.)
在集群中,可以计算平均距离|x -簇中心|,就像K-means一样。下面是蛮力。(它一定是scipy中的一个内置组件。集群或scipy.spatial。距离,但我也找不到。
On your question 2, pass. Any links to good tutorials on hierarchical clustering would be welcome.
关于你的问题2,通过。任何关于分层集群的好教程的链接都是受欢迎的。
#!/usr/bin/env python
""" cluster cities: pdist linkage fcluster plot
util: clusters() avdist()
"""
from __future__ import division
import sys
import numpy as np
import scipy.cluster.hierarchy as hier # $scipy/cluster/hierarchy.py
import scipy.spatial.distance as dist
import pylab as pl
from citiesin import citiesin # 1000 US cities
__date__ = "27may 2010 denis"
def clusterlists(T):
""" T = hier.fcluster( Z, t ) e.g. [a b a b a c]
-> [ [0 2 4] [1 3] [5] ] sorted by len
"""
clists = [ [] for j in range( max(T) + 1 )]
for j, c in enumerate(T):
clists[c].append( j )
clists.sort( key=len, reverse=True )
return clists[:-1] # clip the []
def avdist( X, to=None ):
""" av dist X vecs to "to", None: mean(X) """
if to is None:
to = np.mean( X, axis=0 )
return np.mean( dist.cdist( X, [to] ))
#...............................................................................
Ndata = 100
method = "average"
t = 0
crit = "maxclust"
# 'maxclust': Finds a minimum threshold `r` so that the cophenetic distance
# between any two original observations in the same flat cluster
# is no more than `r` and no more than `t` flat clusters are formed.
# but t affects cluster sizes only weakly ?
# t 25: [10, 9, 8, 7, 6
# t 20: [12, 11, 10, 9, 7
plot = 0
seed = 1
exec "\n".join( sys.argv[1:] ) # Ndata= t= ...
np.random.seed(seed)
np.set_printoptions( 2, threshold=100, edgeitems=10, suppress=True ) # .2f
me = __file__.split('/') [-1]
# biggest US cities --
cities = np.array( citiesin( n=Ndata )[0] ) # N,2
if t == 0: t = Ndata // 4
#...............................................................................
print "# %s Ndata=%d t=%d method=%s crit=%s " % (me, Ndata, t, method, crit)
Y = dist.pdist( cities ) # n*(n-1) / 2
Z = hier.linkage( Y, method ) # n-1
T = hier.fcluster( Z, t, criterion=crit ) # n
clusters = clusterlists(T)
print "cluster sizes:", map( len, clusters )
print "# average distance to centre in the biggest clusters:"
for c in clusters:
if len(c) < len(clusters[0]) // 3: break
cit = cities[c].T
print "%.2g %s" % (avdist(cit.T), cit)
if plot:
pl.plot( cit[0], cit[1] )
if plot:
pl.title( "scipy.cluster.hierarchy of %d US cities, %s t=%d" % (
Ndata, crit, t) )
pl.grid(False)
if plot >= 2:
pl.savefig( "cities-%d-%d.png" % (Ndata, t), dpi=80 )
pl.show()