在幂律分布后标准化/缩放/标准化多个变量的正确方法，用于线性组合

I'd like to combine a few metrics of nodes in a social network graph into a single value for rank ordering the nodes:

我想将社交网络图中的一些节点指标组合成单个值,以便对节点进行排序:

in_degree + betweenness_centrality = informal_power_index

in_degree + betweenness_centrality = informal_power_index

The problem is that in_degree and betweenness_centrality are measured on different scales, say 0-15 vs 0-35000 and follow a power law distribution (at least definitely not the normal distribution)

问题是in_degree和betweenness_centrality是在不同的尺度上测量的,比如说0-15对0-35000并且遵循幂律分布(至少绝对不是正态分布)

Is there a good way to rescale the variables so that one won't dominate the other in determining the informal_power_index?

有没有一种很好的方法来重新调整变量,以便在确定informal_power_index时不会主导另一个变量?

Three obvious approaches are:

三种明显的方法是:

Standardizing the variables (subtract mean and divide by stddev). This seems it would squash the distribution too much, hiding the massive difference between a value in the long tail and one near the peak.

标准化变量(减去平均值并除以stddev)。这似乎会破坏分布太多,隐藏长尾值和高峰值之间的巨大差异。

Re-scaling variables to the range [0,1] by subtracting min(variable) and dividing by max(variable). This seems closer to fixing the problem since it won't change the shape of the distribution, but maybe it won't really address the issue? In particular the means will be different.

通过减去min(变量)并除以max(变量)将变量重新调整到[0,1]范围。这似乎更接近解决问题,因为它不会改变分布的形状,但也许它不会真正解决问题?特别是手段会有所不同。

Equalize the means by dividing each value by mean(variable). This won't address the difference in scales, but perhaps the mean values are more important for the comparison?

通过将每个值除以平均值(变量)来均衡均值。这不会解决尺度上的差异,但平均值可能对比较更重要?

Any other ideas?

还有其他想法吗?

4 个解决方案

#1

You seem to have a strong sense of the underlying distributions. A natural rescaling is to replace each variate with its probability. Or, if your model is incomplete, choose a transformation that approximately acheives that. Failing that, here's a related approach: If you have a lot of univariate data from which to build a histogram (of each variate), you could convert each to a 10 point scale based on whether it is in the 0-10% percentile or 10-20%-percentile ...90-100% percentile. These transformed variates have, by construction, a uniform distribution on 1,2,...,10, and you can combine them however you wish.

您似乎对底层分布有强烈的感觉。自然重新缩放是用其概率替换每个变量。或者,如果您的模型不完整,请选择一个大致可以实现的转换。如果不这样做,这是一个相关的方法:如果你有很多单变量数据来构建一个直方图(每个变量),你可以根据它是否在0-10%百分位数或10-20%-percentile ... 90-100%百分位数。通过构造,这些变换后的变量在1,2,...,10上具有均匀分布,您可以根据需要将它们组合起来。

#2

you could translate each to a percentage and then apply each to a known qunantity. Then use the sum of the new value.

你可以将每个翻译成一个百分比,然后将每个应用到一个已知的数量。然后使用新值的总和。

((1 - (in_degee / 15) * 2000) + ((1 - (betweenness_centrality / 35000) * 2000) = ?

((1 - (in_degee / 15)* 2000)+((1 - (betweenness_centrality / 35000)* 2000)=?

#3

Very interesting question. Could something like this work:

非常有趣的问题。可以这样的工作:

Lets assume that we want to scale both the variables to a range of [-1,1] Take the example of betweeness_centrality that has a range of 0-35000

让我们假设我们想要将两个变量都缩放到[-1,1]范围内。以betweeness_centrality为例,其范围为0-35000

Choose a large number in the order of the range of the variable. As an example lets choose 25,000

按变量范围的顺序选择一个大数字。举个例子,我们选择25,000

create 25,000 bins in the original range [0-35000] and 25,000 bins in the new range [-1,1]

在原始范围[0-35000]创建25,000个垃圾箱,在新范围内创建25,000个垃圾箱[-1,1]

For each number x-i find out the bin# it falls in the original bin. Let this be B-i

对于每个数字x-i找出bin#它落在原始bin中。让这是B-i

Find the range of B-i in the range [-1,1].

找到[-1,1]范围内的B-i范围。

Use either the max/min of the range of B-i in [-1,1] as the scaled version of x-i.

使用[-1,1]中B-i范围的最大值/最小值作为x-i的缩放版本。

This preserves the power law distribution while also scaling it down to [-1,1] and does not have the problem as experienced by (x-mean)/sd.

这保留了幂律分布,同时也将其缩小到[-1,1],并且没有(x-mean)/ sd所经历的问题。

#4

normalizing to [0,1] would be my short answer recommendation to combine the 2 values as it will maintain the distribution shape as you mentioned and should solve the problem of combining the values.

归一化为[0,1]将是我的简短回答建议,结合2个值,因为它将保持你提到的分布形状,并应解决组合值的问题。

if the distribution of the 2 variables is different which sounds likely this won't really give you what i think your after, which is a combined measure of where each variable is within its given distribution. you would have to come up with a metric which determines where in the given distribution the value lies, this could be done many ways, one of which would be to determine how many standard deviations away from the mean the given value is, you could then combine these 2 values in some way to get your index. (addition may no longer be sufficient)

如果2个变量的分布是不同的,这听起来可能不会真正给你我认为你的追求,这是每个变量在给定分布范围内的组合度量。你必须提出一个指标来确定价值在给定分布中的位置,这可以通过多种方式完成,其中一种方法是确定与给定值的平均值相差多少标准偏差,然后你可以以某种方式组合这两个值来获取索引。 (添加可能不再足够)

you'd have to work out what makes the most sense for the data sets your looking at. standard deviations may well be meaningless for your application, but you need to look at statistical measures that related to the distribution and combine those, rather than combing absolute values, normalized or not.

你必须找出对你所看到的数据集最有意义的东西。对于您的应用,标准偏差可能毫无意义,但您需要查看与分布相关的统计度量并将它们组合起来,而不是将绝对值组合,归一化或不归一化。

#1

#2