如何比较两个树图(R)之间的“相似性”?

时间:2021-03-01 23:08:31

I have two dendrograms which I wish to compare to each other in order to find out how "similar" they are. But I don't know of any method to do so (let alone a code to implement it, say, in R).

我有两个树形图,我想把它们相互比较,以便找出它们有多“相似”。但我不知道有什么方法可以做到这一点(更不用说用代码实现它了,比如用R)。

Any leads ?

领导吗?

UPDATE (2014-09-13):

更新(2014-09-13):

Since asking this question, I have written an R package called dendextend, for the visualization, manipulation and comparison of dendrogram. This package is on CRAN and comes with a detailed vignette. It includes functions such as cor_cophenetic, cor_bakers_gamma and Bk / Bk_plot. As well as a tanglegram function for visually comparing two trees.

问了这个问题之后,我写了一个R包叫dendextend,用于树状图的可视化、操作和比较。这个包在CRAN上,附带一个详细的简介。它包括cor_cophenetic、cor_bakers_gamma和Bk / Bk_plot等函数。以及一个形象化地比较两棵树的tanglegram函数。

6 个解决方案

#1


14  

Comparing dendrograms is not quite the same as comparing hierarchical clusterings, because the former includes the lengths of branches as well as the splits, but I also think that's a good start. I would suggest you read E. B. Fowlkes & C. L. Mallows (1983). "A Method for Comparing Two Hierarchical Clusterings". Journal of the American Statistical Association 78 (383): 553–584 (link).

比较树状图与比较层次簇并不完全相同,因为前者包括分支的长度和分叉,但我也认为这是一个良好的开端。我建议你读读e.b. Fowlkes & c.l.mallows(1983)。“一种比较两种等级集群的方法”。美国统计协会期刊78(383):553-584(链接)。

Their approach is based on cutting the trees at each level k, getting a measure Bk that compares the groupings into k clusters, and then examining the Bk vs k plots. The measure Bk is based upon looking at pairs of objects and seeing whether they fall into the same cluster or not.

他们的方法是在每一级k处砍树,得到一个比较分组到k组的Bk,然后检查Bk和k组的图。衡量Bk是基于观察对象对并观察它们是否属于同一簇。

I am sure that one can write code based on this method, but first we would need to know how the dendrograms are represented in R.

我确信可以基于这种方法编写代码,但是首先我们需要知道dendrogram是如何在R中表示的。

#2


4  

As you know, Dendrograms arise from hierarchical clustering - so what you are really asking is how can I compare the results of two hierarchical clustering runs. There are no standard metrics I know of, but I would be looking at the number of clusters found and comparing membership similarity between like clusters. Here is a good overview of hierarchical clustering that my colleague wrote on clustering scotch whiskey's.

正如您所知道的,Dendrograms来自分层集群—所以您真正想问的是,我如何比较两个分层集群运行的结果。我所知道的没有标准度量标准,但是我将查看发现的集群的数量,并比较类似集群之间的成员相似性。下面是我的同事写的关于集群苏格兰威士忌的等级集群的一个很好的概述。

#3


2  

have a look at this page:

看看这一页:

I also have similar question asked here

我也有类似的问题

It seems we can use cophenetic correlation to measure the similarity between two dendrograms. But there seems no function for this purpose in R currently.

似乎我们可以用共亨相关来衡量两个树状图之间的相似性。但是目前在R中似乎没有这个目的。

EDIT at 2014,9,18: The cophenetic function in stats package is capable to calculating the cophenetic dissimilarity matrix. and the correlation can be calculated using cor function. as @Tal has pointed the as.dendrogram function returned the tree with different order, which will cause wrong results if we calculate the correlation based on the dendrogram results. As showed in the example of function cor_cophenetic function in dendextend package:

编辑在2014年,9,18:在stats包中的cophenetic函数能够计算cophenetic不同的矩阵。相关可以用cor函数来计算。正如@Tal指出的,树图函数以不同的顺序返回树,如果我们根据dendrogram结果计算相关性,会导致错误的结果。如函数cor_cophenetic在dendextend package中的例子所示:

set.seed(23235)
ss <- sample(1:150, 10 )
hc1 <- iris[ss,-5] %>% dist %>% hclust("com")
hc2 <- iris[ss,-5] %>% dist %>% hclust("single")
dend1 <- as.dendrogram(hc1)
dend2 <- as.dendrogram(hc2)
# cutree(dend1)
cophenetic(hc1)
cophenetic(hc2)
# notice how the dist matrix for the dendrograms have different orders:
cophenetic(dend1)
cophenetic(dend2)
cor(cophenetic(hc1), cophenetic(hc2)) # 0.874
cor(cophenetic(dend1), cophenetic(dend2)) # 0.16
# the difference is becasue the order of the distance table in the case of
# stats:::cophenetic.dendrogram will change between dendrograms!

#4


1  

If you have access to the underlying distance matrix that generated each dendrogram (you probably do if you generated the dendorograms in R), couldn't you just use correlation between the corresponding values of the two matrices? I know this doesn't address the letter of what you asked, but it's a good solution to the spirit of what you asked.

如果您可以访问生成每个dendrogram的底层距离矩阵(如果您在R中生成dendorogram,您可能会这样做),您就不能在两个矩阵的对应值之间使用相关性吗?我知道这并不是你所要求的内容,但这是你所要求的精神的一个很好的解决方案。

#5


1  

Take a look at this page that has lots of information about software that deals with trees, including dendrograms. I noticed several tools that deal with tree comparison, although I haven't personally used any of them yet. There are a number of references cited there also.

看看这个页面,里面有很多关于处理树的软件的信息,包括树形图。我注意到一些处理树比较的工具,尽管我还没有亲自使用它们中的任何一个。这里也引用了一些参考文献。

#6


0  

There is a rich body of literature for tree distance metrics in the phylogenetics community that seems to have been neglected from the computer science perspective. See dist.topo of the ape package for two tree distance metrics and several citations (Penny and Hardy 1985, Kuhner and Felsenstein 1994) which considering the similarity of tree partitions, and also the Robinson-Foulds metric which has an R implementation in the phangorn package.

在系统遗传学领域,有大量的树木距离度量的文献,从计算机科学的角度来看,这些文献似乎被忽视了。在考虑到树分区的相似性,以及在phangorn包中有一个R实现的Robinson-Foulds度量标准中,有两个树距离度量和几个引用(Penny和Hardy 1985, Kuhner和Felsenstein 1994)的ape包的topo。

One problem is that these metrics don't have a fixed scale, so they are only useful in the cases of 1) tree comparison or 2) comparison to some generated baseline, perhaps via permutation tests similar to what Tal has done with Baker's Gamma in his fantastic dendextend package.

一个问题是,这些指标没有固定的尺度,所以它们只适用于1)树比较或2)与一些生成的基线的比较,可能是通过排列测试,就像Tal在他美妙的dendextend包中对Baker的Gamma所做的那样。

If you have hclust or dendrogram objects generated from R hierarchical clustering, using as.phylo from the ape package will convert your dendrograms to phylogenetic trees for usage in these functions.

如果有hclust或dendrogram对象从R层次聚类中生成,使用as。来自ape包的phylo将把你的dendrogram转换成系统发育树,用于这些功能。

#1


14  

Comparing dendrograms is not quite the same as comparing hierarchical clusterings, because the former includes the lengths of branches as well as the splits, but I also think that's a good start. I would suggest you read E. B. Fowlkes & C. L. Mallows (1983). "A Method for Comparing Two Hierarchical Clusterings". Journal of the American Statistical Association 78 (383): 553–584 (link).

比较树状图与比较层次簇并不完全相同,因为前者包括分支的长度和分叉,但我也认为这是一个良好的开端。我建议你读读e.b. Fowlkes & c.l.mallows(1983)。“一种比较两种等级集群的方法”。美国统计协会期刊78(383):553-584(链接)。

Their approach is based on cutting the trees at each level k, getting a measure Bk that compares the groupings into k clusters, and then examining the Bk vs k plots. The measure Bk is based upon looking at pairs of objects and seeing whether they fall into the same cluster or not.

他们的方法是在每一级k处砍树,得到一个比较分组到k组的Bk,然后检查Bk和k组的图。衡量Bk是基于观察对象对并观察它们是否属于同一簇。

I am sure that one can write code based on this method, but first we would need to know how the dendrograms are represented in R.

我确信可以基于这种方法编写代码,但是首先我们需要知道dendrogram是如何在R中表示的。

#2


4  

As you know, Dendrograms arise from hierarchical clustering - so what you are really asking is how can I compare the results of two hierarchical clustering runs. There are no standard metrics I know of, but I would be looking at the number of clusters found and comparing membership similarity between like clusters. Here is a good overview of hierarchical clustering that my colleague wrote on clustering scotch whiskey's.

正如您所知道的,Dendrograms来自分层集群—所以您真正想问的是,我如何比较两个分层集群运行的结果。我所知道的没有标准度量标准,但是我将查看发现的集群的数量,并比较类似集群之间的成员相似性。下面是我的同事写的关于集群苏格兰威士忌的等级集群的一个很好的概述。

#3


2  

have a look at this page:

看看这一页:

I also have similar question asked here

我也有类似的问题

It seems we can use cophenetic correlation to measure the similarity between two dendrograms. But there seems no function for this purpose in R currently.

似乎我们可以用共亨相关来衡量两个树状图之间的相似性。但是目前在R中似乎没有这个目的。

EDIT at 2014,9,18: The cophenetic function in stats package is capable to calculating the cophenetic dissimilarity matrix. and the correlation can be calculated using cor function. as @Tal has pointed the as.dendrogram function returned the tree with different order, which will cause wrong results if we calculate the correlation based on the dendrogram results. As showed in the example of function cor_cophenetic function in dendextend package:

编辑在2014年,9,18:在stats包中的cophenetic函数能够计算cophenetic不同的矩阵。相关可以用cor函数来计算。正如@Tal指出的,树图函数以不同的顺序返回树,如果我们根据dendrogram结果计算相关性,会导致错误的结果。如函数cor_cophenetic在dendextend package中的例子所示:

set.seed(23235)
ss <- sample(1:150, 10 )
hc1 <- iris[ss,-5] %>% dist %>% hclust("com")
hc2 <- iris[ss,-5] %>% dist %>% hclust("single")
dend1 <- as.dendrogram(hc1)
dend2 <- as.dendrogram(hc2)
# cutree(dend1)
cophenetic(hc1)
cophenetic(hc2)
# notice how the dist matrix for the dendrograms have different orders:
cophenetic(dend1)
cophenetic(dend2)
cor(cophenetic(hc1), cophenetic(hc2)) # 0.874
cor(cophenetic(dend1), cophenetic(dend2)) # 0.16
# the difference is becasue the order of the distance table in the case of
# stats:::cophenetic.dendrogram will change between dendrograms!

#4


1  

If you have access to the underlying distance matrix that generated each dendrogram (you probably do if you generated the dendorograms in R), couldn't you just use correlation between the corresponding values of the two matrices? I know this doesn't address the letter of what you asked, but it's a good solution to the spirit of what you asked.

如果您可以访问生成每个dendrogram的底层距离矩阵(如果您在R中生成dendorogram,您可能会这样做),您就不能在两个矩阵的对应值之间使用相关性吗?我知道这并不是你所要求的内容,但这是你所要求的精神的一个很好的解决方案。

#5


1  

Take a look at this page that has lots of information about software that deals with trees, including dendrograms. I noticed several tools that deal with tree comparison, although I haven't personally used any of them yet. There are a number of references cited there also.

看看这个页面,里面有很多关于处理树的软件的信息,包括树形图。我注意到一些处理树比较的工具,尽管我还没有亲自使用它们中的任何一个。这里也引用了一些参考文献。

#6


0  

There is a rich body of literature for tree distance metrics in the phylogenetics community that seems to have been neglected from the computer science perspective. See dist.topo of the ape package for two tree distance metrics and several citations (Penny and Hardy 1985, Kuhner and Felsenstein 1994) which considering the similarity of tree partitions, and also the Robinson-Foulds metric which has an R implementation in the phangorn package.

在系统遗传学领域,有大量的树木距离度量的文献,从计算机科学的角度来看,这些文献似乎被忽视了。在考虑到树分区的相似性,以及在phangorn包中有一个R实现的Robinson-Foulds度量标准中,有两个树距离度量和几个引用(Penny和Hardy 1985, Kuhner和Felsenstein 1994)的ape包的topo。

One problem is that these metrics don't have a fixed scale, so they are only useful in the cases of 1) tree comparison or 2) comparison to some generated baseline, perhaps via permutation tests similar to what Tal has done with Baker's Gamma in his fantastic dendextend package.

一个问题是,这些指标没有固定的尺度,所以它们只适用于1)树比较或2)与一些生成的基线的比较,可能是通过排列测试,就像Tal在他美妙的dendextend包中对Baker的Gamma所做的那样。

If you have hclust or dendrogram objects generated from R hierarchical clustering, using as.phylo from the ape package will convert your dendrograms to phylogenetic trees for usage in these functions.

如果有hclust或dendrogram对象从R层次聚类中生成,使用as。来自ape包的phylo将把你的dendrogram转换成系统发育树,用于这些功能。