检查样本空间中是否存在一个样本(来自PCA或其他聚类分析)

时间:2021-02-10 20:11:17

I have a 200 by 50 matrix, here 200 means 200 compounds (row) and 50 means 50 independent varialbes (column), and then I use the 200 * 50 matrix to do cluster analysis (e.g. k-mean etc.), I can get a plot to show the distributions for these 2000 compounds.

我有一个200×50的矩阵,这里200表示200个化合物(行),50表示50个独立的变量(列),然后我用200 * 50矩阵做聚类分析(例如k-mean等),我可以得到一个图来显示这些2000个化合物的分布。

My question is that when I have a new compound, which have the same 50 independent variable as the 200 * 50 matrix, how can I test if the new compound is located in the cluster space?

我的问题是,当我有一个新的化合物,它和200 * 50矩阵有相同的50个自变量时,我如何测试新化合物是否位于簇空间中?

Thanks.

谢谢。

Edit: Plz note that I do not need find the element in the data.frame. I think the first step is to cluster the data (for example, using pca and plot(pca1, pca2)), then test the if the new record is located in the plot or out. Like this picture, where (2) belongs to the cluster and (1) does not belong to the cluster space, just like this.

编辑:Plz注意到,我不需要在data.frame中找到元素。我认为第一步是对数据进行聚类(例如,使用pca和plot(pca1, pca2)),然后测试新记录是否位于plot或out中。如图所示,(2)属于集群,(1)不属于集群空间,就像这样。

3 个解决方案

#1


1  

Here is a simple solution:

这里有一个简单的解决方案:

Step1: Setup the data

步骤1:设置数据

set.seed(1)
refData <- data.frame(matrix(runif(200*50),nrow=200))

newRec01 <- refData[11,]    # A record that exists in data
newRec02 <- runif(50)       # A record that does not exist in data

Step2: Testing:

步骤2:测试:

TRUE %in% sapply(1:nrow(refData),function(i) all(newRec01 == refData[i,]))
TRUE %in% sapply(1:nrow(refData),function(i) all(newRec02 == refData[i,]))

If needed you can package it in a function:

如果需要,您可以将其打包在一个函数中:

checkNewRec <- function(refData, newRec) {
  TRUE %in% sapply(1:nrow(refData),function(i) all(newRec == refData[i,]))
}

checkNewRec(refData, newRec01)
checkNewRec(refData, newRec02)

EDIT: Based on your new input below, try the following:

编辑:根据你的新输入,试试下面的:

Prep: Your code from the comments:

准备:你的代码来自评论:

  ALL <- rbind(refData, newRec02) 

  pca <- prcomp(ALL) 
  pca1 <- pca$x[, 1] 
  pca2 <- pca$x[, 2] 
  pca1.in <- pca1[-length(pca1)]
  pca2.in <- pca2[-length(pca2)]

Now we need to define the cluster in some way. For simplicity, lets assume a single cluster.

现在我们需要以某种方式定义集群。为了简单起见,让我们假设一个集群。

Step1: Find out the centroid of the refData:

Step1:找出refData的质心:

  cent <- c(mean(pca1.in),mean(pca2.in))

Step2: Find out the distance of all the data points from the center of refData:

Step2:从refData中心找出所有数据点的距离:

  ssq <- (pca1 - mean(pca1.in))^2 + (pca2 - mean(pca2.in))^2

Step3: Now we need to choose a cut off distance from the center beyond which the new incoming record will be considered as "outside" the cluster. For simplicity, I am taking a decision for it to be at 95th % quantile:

步骤3:现在我们需要选择一个远离中心的距离,在这个距离之外,新的传入记录将被认为是“外部”的集群。为了简单起见,我决定将它以95%的分位数计算:

  dec <- (quantile(head(ssq,-1), 0.95) > tail(ssq,1)) 

Step4: Now that a decision has been made on classification of newRec, we can plot it:

第4步:现在已经决定了newRec的分类,我们可以把它标出来:

  plot(pca1, pca2) 
  points(pca1[length(pca1)], pca2[length(pca2)], 
         col = ifelse(dec, "red", "green"),pch="X")

Additionally, to verify our decision, lets plot the errors, and see where does the newRec fall!!

另外,为了验证我们的决定,我们来画出错误,看看newRec的下落!!

  hist(ssq, main="Error Histogram",xlab="Square Error")
  points(pca1[length(pca1)], pca2[length(pca2)],
         col = ifelse(dec, "red", "green"),pch="X")
  text(pca1[length(pca1)], pca2[length(pca2)],labels="New Rec",col="red",pos=3)

Hope this helps!!

希望这有助于! !

#2


2  

So here is a different (but conceptually similar) approach, along with a cautionary tale. Since you did not provide any data, I'll use the built-in mtcars dataset for the example.

所以这里有一个不同的(但概念上相似的)方法,以及一个警示性的故事。由于您没有提供任何数据,我将使用内置的mtcars数据集作为示例。

First, we set up the data, run principle components analysis, and run a K-means cluster analysis.

首先,我们设置数据,运行主成分分析,运行K-means聚类分析。

set.seed(5)                                   # for reprpduceable example
df    <- mtcars[,c(1,3,4,5,6,7)]              # subset of mtcars dataset
trn   <- sample(1:nrow(df),nrow(df)-3)          
train <- mtcars[trn,]                         # training set: 29 obs.
test  <- mtcars[-trn,]                        # test set: 3 obs.

pca <- prcomp(train, scale.=T, retx=T)        # pca on training set
summary(pca)$importance[3,1:4]                # 84% of variation in first 2 PC
# PC1     PC2     PC3     PC4 
# 0.60268 0.83581 0.89643 0.92139             
scores <- data.frame(pca$x)[1:2]              # so use first two PC
km     <- kmeans(scores,centers=3,nstart=25)  # kmeans cluster analysis

pc.test  <- predict(pca,test)[,1:2]           # transform the test set
pc.test  <- rbind(pc.test,c(-1.25,-1))        # add "special point"
rownames(pc.test) <- c(LETTERS[1:3],"X")      # letters to make things simpler

Now, we plot the PC, the centroids, and the test set.

现在,我们绘制PC, centroid和测试集。

library(ggplot2)
# plot first two PC with cluster id
gg.train  <- data.frame(cluster=factor(km$cluster), scores)
centroids <- aggregate(cbind(PC1,PC2)~cluster,data=gg.train,mean)
gg.train  <- merge(gg.train,centroids,by="cluster",suffixes=c("",".centroid"))
gg.test   <- data.frame(pc.test[,1:2])
# generate cluster plot...
cluster.plot <- ggplot(gg.train, aes(x=PC1, y=PC2, color=cluster)) +
  geom_point(size=3) +
  geom_point(data=centroids, size=4) +
  geom_segment(aes(x=PC1.centroid, y=PC2.centroid, xend=PC1, yend=PC2))+
  geom_point(data=gg.test,color="purple",size=8,shape=1)+
  geom_text(data=gg.test,label=rownames(gg.test),color="purple")+
  coord_fixed()
plot(cluster.plot)

检查样本空间中是否存在一个样本(来自PCA或其他聚类分析)

Based on a visual examination, we'd likely place B and C in cluster 3 (the blue cluster) and A in cluster 1 (red). X is questionable (intentionally; that's what makes it "special"). But notice that if we assign to clusters based on proximity to the centroids, we would put A in cluster 3!

基于视觉检查,我们很可能将B和C放在第3组(蓝色集群)和a组(红色)中。X是可疑(故意;这就是为什么它“特别”的原因。但是请注意,如果我们将基于接近中心的集群分配给集群,那么我们将在集群3中设置A !

# "hard" prediction: assign to whichever cluster has closest centroid
predict.cluster <- function(z) {
  closest <-function(z)which.min(apply(km$centers,1,function(x,z)sum((x-z)^2),z))
  data.frame(pred.clust=apply(z,1,closest))
}
predict.cluster(pc.test)
#   pred.clust
# A          3
# B          3
# C          3
# X          2

So a different approach calculates the probability of cluster membership based on both distance from the centroid and a measure of scatter (how tightly grouped are the points in the cluster). This approach requires that we assume a distribution, which is risky, especially with a small number of points.

因此,一种不同的方法计算了基于距离中心和度量散点距离的集群成员的概率(分组中的点是如何紧密地分组的)。这种方法要求我们假设一个分布,这是有风险的,特别是少量的点。

The simplest approach is to assume that the points in a given cluster follow a multivariate normal distribution. Under this assumption,

最简单的方法是假定给定集群中的点遵循多变量正态分布。在这种假设下,

检查样本空间中是否存在一个样本(来自PCA或其他聚类分析)

That is, a random variable formed as above is distributed as chi-sq with k degrees of freedom (where k is the number of dimension, here 2). Here, x is a point under consideration for membership in a cluster, μ is the cluster centroid, Σ is the covariance matrix for the points in the cluster, and χ2 is the chi-sq statistic with k degrees of freedom at probability 1-α.

上面一个随机变量,形成分布式与k chi-sq*度(k是维度的数量,2)。在这里,x是一个点在考虑加入一个集群,μ是集群重心,Σ是集群中的点的协方差矩阵,和χ2 chi-sq统计与概率1-αk的*度。

We can use this to calculate the probability of membership by applying this equation, for a given x, (in the test set) to calculate α. We can also use this to calculate the "cluster boundaries" by calculating the set of points, x, which meet this condition for a given α. This latter exercise results in a confidence region of probability 1- α. Fortunately, this is already implemented in R (for 2 dimensions) using ellispe(...) in the ellipse package.

我们可以用这个来计算的概率会员通过应用这个方程,对于给定的x,(测试集)来计算α。我们还可以通过计算集合点x来计算“簇边界”,它满足给定的条件。后者运动结果的置信区域概率1 -α。幸运的是,在椭圆包中使用ellispe(…)实现了R(2维)。

library(ellipse)
conf.rgn  <- do.call(rbind,lapply(1:3,function(i)
  cbind(cluster=i,ellipse(cov(scores[km$cluster==i,]),centre=km$centers[i,]))))
conf.rgn  <- data.frame(conf.rgn)
conf.rgn$cluster <- factor(conf.rgn$cluster)
plot(cluster.plot + geom_path(data=conf.rgn, aes(x=PC1,y=PC2)))

检查样本空间中是否存在一个样本(来自PCA或其他聚类分析)

Based on this we would assign A to cluster 1 (red), even though it is closer to cluster 3 (blue). This is because cluster 3 is much more tightly grouped, so the hurdle for membership is higher. Note that X is "outside" of all the clusters.

基于此,我们将把A分配给1(红色),即使它更接近集群3(蓝色)。这是因为集群3更紧密地分组,所以成员的门槛更高。注意,X是所有集群的“外部”。

The code below calculates the probability of membership in each cluster for a given set of test points.

下面的代码计算给定测试点集的每个集群中成员的概率。

# "soft" prediction: probability that point belongs in each cluster
pclust <- function(point,km,df){
  get.p <- function(clust,x){
    d         <- as.numeric(x-km$centers[clust,])
    sigma.inv <- solve(cov(df[km$cluster==clust,]))
    X.sq      <- d %*% sigma.inv %*% d
    p         <- pchisq(X.sq,length(d),lower.tail=FALSE)
  }
  sapply(1:max(km$cluster),get.p,x=point)
}
p <- apply(pc.test,1,pclust, km=km, df=scores)
print(p)
#                 A            B            C            X
# [1,] 9.178631e-02 6.490108e-04 9.969140e-07 8.754585e-04
# [2,] 1.720396e-28 4.391488e-26 2.821694e-43 3.630565e-05
# [3,] 2.664676e-05 8.928103e-01 8.660860e-02 2.188450e-05

Here the value in the ith row is the probability of membership in cluster i. So we can see that there is a 9.2% probability that A belongs in cluster 1, while the probability of membership in the other clusters is less than 0.003%. Similarly, B and C clearly belong in cluster 3 (p = 89.2% and 8.6% respectively). Finally, we can identify most likely clusters as follows:

这里第i行的值是集群i中成员的概率,所以我们可以看到,a属于集群1的概率是9.2%,而其他集群中的成员的概率小于0.003%。同样,B和C显然属于cluster 3 (p = 89.2%和8.6%)。最后,我们可以确定最有可能的集群如下:

data.frame(t(sapply(data.frame(p),function(x)
     list(cluster=which.max(x),p.value=x[which.max(x)]))))
#   cluster      p.value
# A       1   0.09178631
# B       3    0.8928103
# C       3    0.0866086
# X       1 0.0008754585

By assigning a cutoff value of p (say, 0.05), we can assert that a point does not belong in the "cluster space" (using your terminology) if the most likely cluster has p.value < cutoff.

通过分配p的截止值(例如,0.05),我们可以断言一个点不属于“集群空间”(使用您的术语),如果最有可能的集群有p。值 <截止。< p>

The cautionary tale is that X gets excluded based in this analysis, even though is is quite near the grand mean of PC1 and PC2. This is because, while X sits in the middle of the dataset, it is in an "empty" region, where there are no clusters. Does this mean it should be excluded?

这个警示性的故事是,X在这个分析中被排除在外,尽管它与PC1和PC2的大平均值非常接近。这是因为,当X位于数据集中的中间时,它位于一个“空”区域,那里没有集群。这是否意味着它应该被排除在外?

#3


1  

This post follows jihoward's answer, extending it to python and showing some interesting cluster assignment conditions under multivariate gaussians. The generated data is sampled from three 2d gaussian distributions, with means and covariances provided in the source code. Points W, X, Y, Z are used to describe soft assignment to clusters. 检查样本空间中是否存在一个样本(来自PCA或其他聚类分析) We assume that each cluster has a chi-squared distribution with 2 degrees of freedom.

这篇文章遵循了jihoward的答案,将其扩展到python,并在多变量gausans下显示了一些有趣的集群分配条件。所生成的数据来自于三个二维高斯分布,其中包含了源代码中提供的方法和协方差。点W, X, Y, Z被用来描述对集群的软分配。我们假设每个星系团的分布有2个*度。

In the plot, the shaded areas represent 2 standard deviations from the mean. Note that as expected X does not belong to any cluster. Although Y is closer to the green centroid, Y is not assigned to the green cluster given its distribution. Consequences of using hard thresholding for cluster assignment are shown in the blue cluster. Note how points outside the 0.05 cutoff value would be classified as not belonging to the blue cluster.

在图中,阴影部分表示2个标准差。注意,正如预期的X不属于任何集群。虽然Y更接近绿色的质心,但由于它的分布,Y并没有被分配到绿色集群中。使用硬阈值进行集群分配的结果显示在蓝色集群中。注意在0.05截止值之外的点如何被归类为不属于蓝色集群。

Probabilities assuming a chi-squared distribution.

概率假设是卡方分布。

         Blue              Red             Green
W [  1.50465863e-01   0.00000000e+00   0.00000000e+00]
X [  2.44710474e-10   1.20447952e-05   0.00000000e+00]
Y [  0.00000000e+00   0.00000000e+00   0.00000000e+00]
Z [  0.00000000e+00   9.91055078e-01   0.00000000e+00]

Knowing that the data is multivariate gaussian, one could use scipy's implementation of Alan Genz’s multivariate normal CDF functions. I was not able to get convincing results from it using this example. For more details on scipy's implementation check this link.

知道数据是多变量高斯函数,可以使用scipy实现Alan Genz的multivariate normal CDF函数。使用这个例子,我无法得到令人信服的结果。有关scipy实现的更多细节,请检查此链接。

import numpy as np
import matplotlib.pylab as plt
from matplotlib.patches import Ellipse
from sklearn.cluster import KMeans
from scipy import stats
chi2_cdf = stats.chi2.cdf
plt.ion()

def eigenDecomposition(cov_mat):
    vals, vecs = np.linalg.eigh(cov_mat)
    order = vals.argsort()[::-1]
    return vals[order], vecs[:,order]


def plotEllipse(center, cov_mat, n_std, color):
    vals, vecs = eigenDecomposition(cov_mat)
    angle = np.degrees(np.arctan2(*vecs[:,0][::-1]))
    width, height = 2 * n_std * np.sqrt(vals)
    return Ellipse(xy=center, width=width, height=height, angle=angle, color=color, alpha=0.2)


def computeMembership(point, center, data):
    # (x - mu).T cov.inv (x - mu)
    cov_mat = np.cov(data.T)
    dist = np.array([point - center]).T
    X_sq = np.dot(dist.T, np.dot(np.linalg.inv(cov_mat), dist))
    return 1 - chi2_cdf(X_sq, len(center))[0][0]    

n_obs = 128
a = np.random.multivariate_normal((0, 0), [[1, 0], [0, 1]], n_obs)
b = np.random.multivariate_normal((10, 0), [[1, -0.9], [-0.9, 1]], n_obs)
c = np.random.multivariate_normal((10, 10), [[1, 0.05], [1, 0.05]], n_obs)
d = np.array([[0,2], [5, 5], [10, 9.5], [10, 0]])

markers = [r"$ {} $".format(lbl) for lbl in ('W', 'X', 'Y', 'Z')]
clustering = KMeans(n_clusters=3).fit(np.vstack((a, b, c)))
_, idx = np.unique(clustering.labels_, return_index=True)
ids = clustering.labels_[np.sort(idx)]
colors = 'rgb'

fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(a[:,0], a[:,1], color=colors[ids[0]])
ax.scatter(b[:,0], b[:,1], color=colors[ids[1]])
ax.scatter(c[:,0], c[:,1], color=colors[ids[2]])
for i in xrange(len(d)):
    ax.scatter(d[i,0], d[i,1], color='k', s=128, marker=markers[i])
ax.scatter(clustering.cluster_centers_[:,0], clustering.cluster_centers_[:,1], color='k', marker='D')

# plot ellipses with 2 std
n_std = 2
probs = []
for i, data in enumerate((a, b, c)):
    ax.add_artist(plotEllipse(clustering.cluster_centers_[ids[i]], 
                              np.cov(data.T), 
                              n_std, 
                              color=colors[ids[i]]))
    probs.append([computeMembership(x, clustering.cluster_centers_[ids[i]], data) for x in d])
print np.array(probs).T

#1


1  

Here is a simple solution:

这里有一个简单的解决方案:

Step1: Setup the data

步骤1:设置数据

set.seed(1)
refData <- data.frame(matrix(runif(200*50),nrow=200))

newRec01 <- refData[11,]    # A record that exists in data
newRec02 <- runif(50)       # A record that does not exist in data

Step2: Testing:

步骤2:测试:

TRUE %in% sapply(1:nrow(refData),function(i) all(newRec01 == refData[i,]))
TRUE %in% sapply(1:nrow(refData),function(i) all(newRec02 == refData[i,]))

If needed you can package it in a function:

如果需要,您可以将其打包在一个函数中:

checkNewRec <- function(refData, newRec) {
  TRUE %in% sapply(1:nrow(refData),function(i) all(newRec == refData[i,]))
}

checkNewRec(refData, newRec01)
checkNewRec(refData, newRec02)

EDIT: Based on your new input below, try the following:

编辑:根据你的新输入,试试下面的:

Prep: Your code from the comments:

准备:你的代码来自评论:

  ALL <- rbind(refData, newRec02) 

  pca <- prcomp(ALL) 
  pca1 <- pca$x[, 1] 
  pca2 <- pca$x[, 2] 
  pca1.in <- pca1[-length(pca1)]
  pca2.in <- pca2[-length(pca2)]

Now we need to define the cluster in some way. For simplicity, lets assume a single cluster.

现在我们需要以某种方式定义集群。为了简单起见,让我们假设一个集群。

Step1: Find out the centroid of the refData:

Step1:找出refData的质心:

  cent <- c(mean(pca1.in),mean(pca2.in))

Step2: Find out the distance of all the data points from the center of refData:

Step2:从refData中心找出所有数据点的距离:

  ssq <- (pca1 - mean(pca1.in))^2 + (pca2 - mean(pca2.in))^2

Step3: Now we need to choose a cut off distance from the center beyond which the new incoming record will be considered as "outside" the cluster. For simplicity, I am taking a decision for it to be at 95th % quantile:

步骤3:现在我们需要选择一个远离中心的距离,在这个距离之外,新的传入记录将被认为是“外部”的集群。为了简单起见,我决定将它以95%的分位数计算:

  dec <- (quantile(head(ssq,-1), 0.95) > tail(ssq,1)) 

Step4: Now that a decision has been made on classification of newRec, we can plot it:

第4步:现在已经决定了newRec的分类,我们可以把它标出来:

  plot(pca1, pca2) 
  points(pca1[length(pca1)], pca2[length(pca2)], 
         col = ifelse(dec, "red", "green"),pch="X")

Additionally, to verify our decision, lets plot the errors, and see where does the newRec fall!!

另外,为了验证我们的决定,我们来画出错误,看看newRec的下落!!

  hist(ssq, main="Error Histogram",xlab="Square Error")
  points(pca1[length(pca1)], pca2[length(pca2)],
         col = ifelse(dec, "red", "green"),pch="X")
  text(pca1[length(pca1)], pca2[length(pca2)],labels="New Rec",col="red",pos=3)

Hope this helps!!

希望这有助于! !

#2


2  

So here is a different (but conceptually similar) approach, along with a cautionary tale. Since you did not provide any data, I'll use the built-in mtcars dataset for the example.

所以这里有一个不同的(但概念上相似的)方法,以及一个警示性的故事。由于您没有提供任何数据,我将使用内置的mtcars数据集作为示例。

First, we set up the data, run principle components analysis, and run a K-means cluster analysis.

首先,我们设置数据,运行主成分分析,运行K-means聚类分析。

set.seed(5)                                   # for reprpduceable example
df    <- mtcars[,c(1,3,4,5,6,7)]              # subset of mtcars dataset
trn   <- sample(1:nrow(df),nrow(df)-3)          
train <- mtcars[trn,]                         # training set: 29 obs.
test  <- mtcars[-trn,]                        # test set: 3 obs.

pca <- prcomp(train, scale.=T, retx=T)        # pca on training set
summary(pca)$importance[3,1:4]                # 84% of variation in first 2 PC
# PC1     PC2     PC3     PC4 
# 0.60268 0.83581 0.89643 0.92139             
scores <- data.frame(pca$x)[1:2]              # so use first two PC
km     <- kmeans(scores,centers=3,nstart=25)  # kmeans cluster analysis

pc.test  <- predict(pca,test)[,1:2]           # transform the test set
pc.test  <- rbind(pc.test,c(-1.25,-1))        # add "special point"
rownames(pc.test) <- c(LETTERS[1:3],"X")      # letters to make things simpler

Now, we plot the PC, the centroids, and the test set.

现在,我们绘制PC, centroid和测试集。

library(ggplot2)
# plot first two PC with cluster id
gg.train  <- data.frame(cluster=factor(km$cluster), scores)
centroids <- aggregate(cbind(PC1,PC2)~cluster,data=gg.train,mean)
gg.train  <- merge(gg.train,centroids,by="cluster",suffixes=c("",".centroid"))
gg.test   <- data.frame(pc.test[,1:2])
# generate cluster plot...
cluster.plot <- ggplot(gg.train, aes(x=PC1, y=PC2, color=cluster)) +
  geom_point(size=3) +
  geom_point(data=centroids, size=4) +
  geom_segment(aes(x=PC1.centroid, y=PC2.centroid, xend=PC1, yend=PC2))+
  geom_point(data=gg.test,color="purple",size=8,shape=1)+
  geom_text(data=gg.test,label=rownames(gg.test),color="purple")+
  coord_fixed()
plot(cluster.plot)

检查样本空间中是否存在一个样本(来自PCA或其他聚类分析)

Based on a visual examination, we'd likely place B and C in cluster 3 (the blue cluster) and A in cluster 1 (red). X is questionable (intentionally; that's what makes it "special"). But notice that if we assign to clusters based on proximity to the centroids, we would put A in cluster 3!

基于视觉检查,我们很可能将B和C放在第3组(蓝色集群)和a组(红色)中。X是可疑(故意;这就是为什么它“特别”的原因。但是请注意,如果我们将基于接近中心的集群分配给集群,那么我们将在集群3中设置A !

# "hard" prediction: assign to whichever cluster has closest centroid
predict.cluster <- function(z) {
  closest <-function(z)which.min(apply(km$centers,1,function(x,z)sum((x-z)^2),z))
  data.frame(pred.clust=apply(z,1,closest))
}
predict.cluster(pc.test)
#   pred.clust
# A          3
# B          3
# C          3
# X          2

So a different approach calculates the probability of cluster membership based on both distance from the centroid and a measure of scatter (how tightly grouped are the points in the cluster). This approach requires that we assume a distribution, which is risky, especially with a small number of points.

因此,一种不同的方法计算了基于距离中心和度量散点距离的集群成员的概率(分组中的点是如何紧密地分组的)。这种方法要求我们假设一个分布,这是有风险的,特别是少量的点。

The simplest approach is to assume that the points in a given cluster follow a multivariate normal distribution. Under this assumption,

最简单的方法是假定给定集群中的点遵循多变量正态分布。在这种假设下,

检查样本空间中是否存在一个样本(来自PCA或其他聚类分析)

That is, a random variable formed as above is distributed as chi-sq with k degrees of freedom (where k is the number of dimension, here 2). Here, x is a point under consideration for membership in a cluster, μ is the cluster centroid, Σ is the covariance matrix for the points in the cluster, and χ2 is the chi-sq statistic with k degrees of freedom at probability 1-α.

上面一个随机变量,形成分布式与k chi-sq*度(k是维度的数量,2)。在这里,x是一个点在考虑加入一个集群,μ是集群重心,Σ是集群中的点的协方差矩阵,和χ2 chi-sq统计与概率1-αk的*度。

We can use this to calculate the probability of membership by applying this equation, for a given x, (in the test set) to calculate α. We can also use this to calculate the "cluster boundaries" by calculating the set of points, x, which meet this condition for a given α. This latter exercise results in a confidence region of probability 1- α. Fortunately, this is already implemented in R (for 2 dimensions) using ellispe(...) in the ellipse package.

我们可以用这个来计算的概率会员通过应用这个方程,对于给定的x,(测试集)来计算α。我们还可以通过计算集合点x来计算“簇边界”,它满足给定的条件。后者运动结果的置信区域概率1 -α。幸运的是,在椭圆包中使用ellispe(…)实现了R(2维)。

library(ellipse)
conf.rgn  <- do.call(rbind,lapply(1:3,function(i)
  cbind(cluster=i,ellipse(cov(scores[km$cluster==i,]),centre=km$centers[i,]))))
conf.rgn  <- data.frame(conf.rgn)
conf.rgn$cluster <- factor(conf.rgn$cluster)
plot(cluster.plot + geom_path(data=conf.rgn, aes(x=PC1,y=PC2)))

检查样本空间中是否存在一个样本(来自PCA或其他聚类分析)

Based on this we would assign A to cluster 1 (red), even though it is closer to cluster 3 (blue). This is because cluster 3 is much more tightly grouped, so the hurdle for membership is higher. Note that X is "outside" of all the clusters.

基于此,我们将把A分配给1(红色),即使它更接近集群3(蓝色)。这是因为集群3更紧密地分组,所以成员的门槛更高。注意,X是所有集群的“外部”。

The code below calculates the probability of membership in each cluster for a given set of test points.

下面的代码计算给定测试点集的每个集群中成员的概率。

# "soft" prediction: probability that point belongs in each cluster
pclust <- function(point,km,df){
  get.p <- function(clust,x){
    d         <- as.numeric(x-km$centers[clust,])
    sigma.inv <- solve(cov(df[km$cluster==clust,]))
    X.sq      <- d %*% sigma.inv %*% d
    p         <- pchisq(X.sq,length(d),lower.tail=FALSE)
  }
  sapply(1:max(km$cluster),get.p,x=point)
}
p <- apply(pc.test,1,pclust, km=km, df=scores)
print(p)
#                 A            B            C            X
# [1,] 9.178631e-02 6.490108e-04 9.969140e-07 8.754585e-04
# [2,] 1.720396e-28 4.391488e-26 2.821694e-43 3.630565e-05
# [3,] 2.664676e-05 8.928103e-01 8.660860e-02 2.188450e-05

Here the value in the ith row is the probability of membership in cluster i. So we can see that there is a 9.2% probability that A belongs in cluster 1, while the probability of membership in the other clusters is less than 0.003%. Similarly, B and C clearly belong in cluster 3 (p = 89.2% and 8.6% respectively). Finally, we can identify most likely clusters as follows:

这里第i行的值是集群i中成员的概率,所以我们可以看到,a属于集群1的概率是9.2%,而其他集群中的成员的概率小于0.003%。同样,B和C显然属于cluster 3 (p = 89.2%和8.6%)。最后,我们可以确定最有可能的集群如下:

data.frame(t(sapply(data.frame(p),function(x)
     list(cluster=which.max(x),p.value=x[which.max(x)]))))
#   cluster      p.value
# A       1   0.09178631
# B       3    0.8928103
# C       3    0.0866086
# X       1 0.0008754585

By assigning a cutoff value of p (say, 0.05), we can assert that a point does not belong in the "cluster space" (using your terminology) if the most likely cluster has p.value < cutoff.

通过分配p的截止值(例如,0.05),我们可以断言一个点不属于“集群空间”(使用您的术语),如果最有可能的集群有p。值 <截止。< p>

The cautionary tale is that X gets excluded based in this analysis, even though is is quite near the grand mean of PC1 and PC2. This is because, while X sits in the middle of the dataset, it is in an "empty" region, where there are no clusters. Does this mean it should be excluded?

这个警示性的故事是,X在这个分析中被排除在外,尽管它与PC1和PC2的大平均值非常接近。这是因为,当X位于数据集中的中间时,它位于一个“空”区域,那里没有集群。这是否意味着它应该被排除在外?

#3


1  

This post follows jihoward's answer, extending it to python and showing some interesting cluster assignment conditions under multivariate gaussians. The generated data is sampled from three 2d gaussian distributions, with means and covariances provided in the source code. Points W, X, Y, Z are used to describe soft assignment to clusters. 检查样本空间中是否存在一个样本(来自PCA或其他聚类分析) We assume that each cluster has a chi-squared distribution with 2 degrees of freedom.

这篇文章遵循了jihoward的答案,将其扩展到python,并在多变量gausans下显示了一些有趣的集群分配条件。所生成的数据来自于三个二维高斯分布,其中包含了源代码中提供的方法和协方差。点W, X, Y, Z被用来描述对集群的软分配。我们假设每个星系团的分布有2个*度。

In the plot, the shaded areas represent 2 standard deviations from the mean. Note that as expected X does not belong to any cluster. Although Y is closer to the green centroid, Y is not assigned to the green cluster given its distribution. Consequences of using hard thresholding for cluster assignment are shown in the blue cluster. Note how points outside the 0.05 cutoff value would be classified as not belonging to the blue cluster.

在图中,阴影部分表示2个标准差。注意,正如预期的X不属于任何集群。虽然Y更接近绿色的质心,但由于它的分布,Y并没有被分配到绿色集群中。使用硬阈值进行集群分配的结果显示在蓝色集群中。注意在0.05截止值之外的点如何被归类为不属于蓝色集群。

Probabilities assuming a chi-squared distribution.

概率假设是卡方分布。

         Blue              Red             Green
W [  1.50465863e-01   0.00000000e+00   0.00000000e+00]
X [  2.44710474e-10   1.20447952e-05   0.00000000e+00]
Y [  0.00000000e+00   0.00000000e+00   0.00000000e+00]
Z [  0.00000000e+00   9.91055078e-01   0.00000000e+00]

Knowing that the data is multivariate gaussian, one could use scipy's implementation of Alan Genz’s multivariate normal CDF functions. I was not able to get convincing results from it using this example. For more details on scipy's implementation check this link.

知道数据是多变量高斯函数,可以使用scipy实现Alan Genz的multivariate normal CDF函数。使用这个例子,我无法得到令人信服的结果。有关scipy实现的更多细节,请检查此链接。

import numpy as np
import matplotlib.pylab as plt
from matplotlib.patches import Ellipse
from sklearn.cluster import KMeans
from scipy import stats
chi2_cdf = stats.chi2.cdf
plt.ion()

def eigenDecomposition(cov_mat):
    vals, vecs = np.linalg.eigh(cov_mat)
    order = vals.argsort()[::-1]
    return vals[order], vecs[:,order]


def plotEllipse(center, cov_mat, n_std, color):
    vals, vecs = eigenDecomposition(cov_mat)
    angle = np.degrees(np.arctan2(*vecs[:,0][::-1]))
    width, height = 2 * n_std * np.sqrt(vals)
    return Ellipse(xy=center, width=width, height=height, angle=angle, color=color, alpha=0.2)


def computeMembership(point, center, data):
    # (x - mu).T cov.inv (x - mu)
    cov_mat = np.cov(data.T)
    dist = np.array([point - center]).T
    X_sq = np.dot(dist.T, np.dot(np.linalg.inv(cov_mat), dist))
    return 1 - chi2_cdf(X_sq, len(center))[0][0]    

n_obs = 128
a = np.random.multivariate_normal((0, 0), [[1, 0], [0, 1]], n_obs)
b = np.random.multivariate_normal((10, 0), [[1, -0.9], [-0.9, 1]], n_obs)
c = np.random.multivariate_normal((10, 10), [[1, 0.05], [1, 0.05]], n_obs)
d = np.array([[0,2], [5, 5], [10, 9.5], [10, 0]])

markers = [r"$ {} $".format(lbl) for lbl in ('W', 'X', 'Y', 'Z')]
clustering = KMeans(n_clusters=3).fit(np.vstack((a, b, c)))
_, idx = np.unique(clustering.labels_, return_index=True)
ids = clustering.labels_[np.sort(idx)]
colors = 'rgb'

fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(a[:,0], a[:,1], color=colors[ids[0]])
ax.scatter(b[:,0], b[:,1], color=colors[ids[1]])
ax.scatter(c[:,0], c[:,1], color=colors[ids[2]])
for i in xrange(len(d)):
    ax.scatter(d[i,0], d[i,1], color='k', s=128, marker=markers[i])
ax.scatter(clustering.cluster_centers_[:,0], clustering.cluster_centers_[:,1], color='k', marker='D')

# plot ellipses with 2 std
n_std = 2
probs = []
for i, data in enumerate((a, b, c)):
    ax.add_artist(plotEllipse(clustering.cluster_centers_[ids[i]], 
                              np.cov(data.T), 
                              n_std, 
                              color=colors[ids[i]]))
    probs.append([computeMembership(x, clustering.cluster_centers_[ids[i]], data) for x in d])
print np.array(probs).T