I am trying to find the right number of clusters, k
, according to silhouette scores using sklearn.cluster.MiniBatchKMeans
.
我正在尝试寻找合适的集群数量,k,根据使用sklearn.cluster.MiniBatchKMeans的剪影得分。
from sklearn.cluster import MiniBatchKMeans
from sklearn.feature_extraction.text import HashingVectorizer
docs = ['hello monkey goodbye thank you', 'goodbye thank you hello', 'i am going home goodbye thanks', 'thank you very much sir', 'good golly i am going home finally']
vectorizer = HashingVectorizer()
X = vectorizer.fit_transform(docs)
for k in range(5):
model = MiniBatchKMeans(n_clusters = k)
model.fit(X)
And I receive this error:
我收到了这个错误:
Warning (from warnings module):
File "C:\Python34\lib\site-packages\sklearn\cluster\k_means_.py", line 1279
0, n_samples - 1, init_size)
DeprecationWarning: This function is deprecated. Please call randint(0, 4 + 1) instead
Traceback (most recent call last):
File "<pyshell#85>", line 3, in <module>
model.fit(X)
File "C:\Python34\lib\site-packages\sklearn\cluster\k_means_.py", line 1300, in fit
init_size=init_size)
File "C:\Python34\lib\site-packages\sklearn\cluster\k_means_.py", line 640, in _init_centroids
x_squared_norms=x_squared_norms)
File "C:\Python34\lib\site-packages\sklearn\cluster\k_means_.py", line 88, in _k_init
n_local_trials = 2 + int(np.log(n_clusters))
OverflowError: cannot convert float infinity to integer
I know the type(k)
is int
, so I don't know where this issue is coming from. I can run the following just fine, but I can't seem to iterate through integers in a list, even though the type(2)
is equal to k = 2; type(k)
我知道类型(k)是int型的,所以我不知道这个问题是从哪里来的。我可以运行下面的代码,但是我不能在列表中遍历整数,即使类型(2)等于k = 2;类型(k)
model = MiniBatchKMeans(n_clusters = 2)
model.fit(X)
Even running a different model
works:
甚至运行一个不同的模型工作:
>>> model = KMeans(n_clusters = 2)
>>> model.fit(X)
KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=2, n_init=10,
n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
verbose=0)
1 个解决方案
#1
1
Let's analyze your code:
让我们来分析你的代码:
-
for k in range(5)
returns the following sequence:0, 1, 2, 3, 4
- 0 1 2 3 4。
- k in range(5)返回如下序列:0、1、2、3、4。
-
model = MiniBatchKMeans(n_clusters = k)
inits model withn_clusters=k
- 模型= MiniBatchKMeans(n_cluster =k), n_cluster =k。
- Let's look at the first iteration:
-
n_clusters=0
is used - 使用n_clusters = 0
- Within the optimization-code (look at the output):
- 在优化代码中(查看输出):
int(np.log(n_clusters))
- int(np.log(n_clusters))
- =
int(np.log(0))
- = int(np.log(0))
- =
int(-inf)
- = int(负)
- ERROR: no infinity definition for integers!
- 错误:对整数没有无限定义!
- -> casting floating-point value of -inf to int not possible!
- -在不可能的情况下,将-inf的浮点值设为-inf !
-
- 让我们看看第一个迭代:n_cluster =0在优化代码中使用(查看输出):int(np.log(n_cluster)) = int(np.log(0)) = int(-inf)错误:对整数没有无限定义!-在不可能的情况下,将-inf的浮点值设为-inf !
Setting n_clusters=0
does not make sense!
设置n_cluster =0没有意义!
#1
1
Let's analyze your code:
让我们来分析你的代码:
-
for k in range(5)
returns the following sequence:0, 1, 2, 3, 4
- 0 1 2 3 4。
- k in range(5)返回如下序列:0、1、2、3、4。
-
model = MiniBatchKMeans(n_clusters = k)
inits model withn_clusters=k
- 模型= MiniBatchKMeans(n_cluster =k), n_cluster =k。
- Let's look at the first iteration:
-
n_clusters=0
is used - 使用n_clusters = 0
- Within the optimization-code (look at the output):
- 在优化代码中(查看输出):
int(np.log(n_clusters))
- int(np.log(n_clusters))
- =
int(np.log(0))
- = int(np.log(0))
- =
int(-inf)
- = int(负)
- ERROR: no infinity definition for integers!
- 错误:对整数没有无限定义!
- -> casting floating-point value of -inf to int not possible!
- -在不可能的情况下,将-inf的浮点值设为-inf !
-
- 让我们看看第一个迭代:n_cluster =0在优化代码中使用(查看输出):int(np.log(n_cluster)) = int(np.log(0)) = int(-inf)错误:对整数没有无限定义!-在不可能的情况下,将-inf的浮点值设为-inf !
Setting n_clusters=0
does not make sense!
设置n_cluster =0没有意义!