在SPSS，R或Excel中由其他变量分组的向量之间的欧氏距离

I have a dataset containing something like this:

我有一个包含这样的东西的数据集：

case,group,val1,val2,val3,val4
1,1,3,5,6,8
2,1,2,7,5,4
3,2,1,3,6,8
4,2,5,4,3,7
5,1,8,6,5,3

I'm trying to compute programmatically the Euclidean distance between the vectors of values in groups.

我试图以编程方式计算组中值向量之间的欧几里德距离。

This means that I have x number of cases in n number of groups. The euclidean distance is computed between pairs of rows and then averaged for the group. So, in the example above, first I compute the mean and std dev of group 1 (case 1, 2 and 5), then standardise values (i.e. [(original value - mean)/st dev], then compute the ED between case 1 and case 2, case 2 and 5, and case 1 and 5, and finally average the ED for the group.

这意味着我在n个组中有x个案例。在成对行之间计算欧氏距离，然后对该组进行平均。因此，在上面的例子中，首先我计算组1的平均值和标准差（情况1,2和5），然后标准化值（即[（原始值 - 平均值）/ st dev]，然后计算案例之间的ED 1和案例2，案例2和5，以及案例1和5，最后平均组的ED。

Can anyone suggest a neat way of achieving this in a reasonably efficient way?

任何人都可以建议以一种合理有效的方式实现这一目标吗？

3 个解决方案

#1

As an example of how I would approach this in SPSS, first lets read the example data into SPSS.

作为我在SPSS中如何处理此问题的示例，首先让我们将示例数据读入SPSS。

data list list (",") / case group val1 val2 val3 val4 (6F1.0).
begin data
1,1,3,5,6,8
2,1,2,7,5,4
3,2,1,3,6,8
4,2,5,4,3,7
5,1,8,6,5,3
end data.
dataset name orig.

Then we can use SPLIT FILE and PROXIMITIES to get our distance matrix by group. Note, as you mentioned in the comments to flodel's answer, this produces a seperate dataset we need to work with (also note case practically never matters in SPSS syntax, e.g. split file and SPLIT FILE are equivalent).

然后我们可以使用SPLIT FILE和PROXIMITIES来按组获取距离矩阵。请注意，正如您在对flodel的答案的评论中所提到的，这会生成我们需要使用的单独数据集（同样注意案例在SPSS语法中几乎不重要，例如，分割文件和SPLIT FILE是等效的）。

sort cases by group.
split file by group.
dataset declare dist.
PROXIMITIES val1, val2, val3, val4
/STANDARDIZE = Z
/MEASURE = EUCLID
/PRINT = NONE
/MATRIX = OUT('dist').

Unlike R, basically everything within an SPSS data matrix is like an R data.frame, so SPLIT file near functionally replaces all the different *ply functions in R. Very convienant, but less flexible in general. So now we need to aggregate the distances in the dist file I saved the results to. We first sum across rows, and then sum by group via an AGGREGATE command.

与R不同，SPSS数据矩阵中的所有内容基本上都像R data.frame，因此SPLIT文件在功能上接近R中的所有不同的* ply函数。非常方便，但一般不太灵活。所以现在我们需要聚合我保存结果的dist文件中的距离。我们首先对行进行求和，然后通过AGGREGATE命令按组进行求和。

dataset activate dist.
compute dist_sum = SUM(VAR1 to VAR3).
*it appears SPSS keeps empty cases - we dont want them in the aggregation.
select if MISSING(dist_sum) = 0.
dataset activate dist.
DATASET DECLARE dist_agg.
AGGREGATE
  /OUTFILE='dist_agg'
  /BREAK=group
  /dist_sum = SUM(dist_sum)
  /N_Cases=N.
dataset activate dist_agg.
compute mean_dist = dist_sum /(N_Cases*(N_Cases - 1)).

Here I save the aggregated results into another dataset named dist_agg. Because SPSS (annoyingly) saves the full distance matrix, the mean will not be n*(n-1)/2 (as in the equivalent R syntax), but will be n*(n-1) assuming you do not want to count the diagonal elements towards the mean. Then we can just merge these back into the orig data file via a match files command.

在这里，我将聚合结果保存到另一个名为dist_agg的数据集中。因为SPSS（恼人地）保存了全距离矩阵，所以平均值不会是n *（n-1）/ 2（如等效的R语法中所示），但假设您不想要n *（n-1）计算对角线元素的平均值。然后我们可以通过match files命令将它们合并回原始数据文件。

*merge back into the original dataset.
dataset activate orig.
match files file = *
/table = 'dist_agg'
/by group.
exe.

*clean out old datasets if you like.
dataset close dist.
dataset close dist_agg.

The flexibility of R to go back and forth between matrix and data.frame objects makes SPSS a bit more clunky for this job. I could write a much more concise program to do this in SPSS's MATRIX language, but to do it across groups in MATRIX is a pain in the butt (compared to R's *ply syntax).

R在矩阵和data.frame对象之间来回的灵活性使得SPSS对这项工作更加笨拙。我可以用SPSS的MATRIX语言编写一个更简洁的程序，但是在MATRIX中跨组执行它是一个痛苦的屁股（与R的* ply语法相比）。

#2

Yes, it is probably easier in R...

是的，它可能更容易在R ...

Your data:

你的数据：

dat <- data.frame(case  = 1:5, 
                  group = c(1, 1, 2, 2, 1),
                  val1  = c(3, 2, 1, 5, 8),
                  val2  = c(5, 7, 3, 4, 6),
                  val3  = c(6, 5, 6, 3, 5),
                  val4  = c(8, 4, 8, 7, 3))

A short solution:

简短的解决方案：

library(plyr)
ddply(dat[c("group", "val1", "val2", "val3", "val4")],
      "group", function(x)c(mean.ED = mean(dist(scale(as.matrix(x))))))
#   group  mean.ED
# 1     1 3.121136
# 2     2 3.162278

#3

Here is a much simpler solution using base R.

这是使用基础R的更简单的解决方案。

d <- by (dat[,2:5], dat$group, function(x) dist(x))

sapply(d,mean)

sapply（d，平均值）

#1