计算流中的标准偏差

Using Python, assume I'm running through a known quantity of items I, and have the ability to time how long it takes to process each one t, as well as a running total of time spent processing T and the number of items processed so far c. I'm currently calculating the average on the fly A = T / c but this can be skewed by say a single item taking an extraordinarily long time to process (a few seconds compared to a few milliseconds).

使用Python，假设我正在运行已知数量的项目I，并且能够计算处理每个项目所需的时间，以及处理T所花费的总时间和处理的项目数量远c。我正在计算飞行中的平均值A = T / c，但这可能会因为单个项目需要非常长的时间来处理（几秒钟与几毫秒相比）。

I would like to show a running Standard Deviation. How can I do this without keeping a record of each t?

我想展示一个正在运行的标准偏差。如何在不记录每个t的情况下这样做？

3 个解决方案

#1

I use Welford's Method, which gives more accurate results. This link points to John D. Cook's overview. Here's a paragraph from it that summarizes why it is a preferred approach:

我使用Welford的方法，它可以提供更准确的结果。此链接指向John D. Cook的概述。这是一个段落，总结了为什么它是一种首选方法：

This better way of computing variance goes back to a 1962 paper by B. P. Welford and is presented in Donald Knuth’s Art of Computer Programming, Vol 2, page 232, 3rd edition. Although this solution has been known for decades, not enough people know about it. Most people are probably unaware that computing sample variance can be difficult until the first time they compute a standard deviation and get an exception for taking the square root of a negative number.

这种更好的计算方差的方法可以追溯到B. P. Welford在1962年发表的一篇论文，并在Donald Knuth的计算机程序设计，第2卷，第232页，第3版中有所介绍。尽管这种解决方案已有数十年的历史，但还不足以让人知道。大多数人可能不知道计算样本方差可能很难，直到他们第一次计算标准偏差并获得负数的平方根的例外。

#2

As outlined in the Wikipedia article on the standard deviation, it is enough to keep track of the following three sums:

正如*关于标准差的文章中所述，足以跟踪以下三个总和：

s0 = sum(1 for x in samples)
s1 = sum(x for x in samples)
s2 = sum(x*x for x in samples)

These sums are easily updated as new values arrive. The standard deviation can be calculated as

随着新值的到来，这些总和很容易更新。标准偏差可以计算为

std_dev = math.sqrt((s0 * s2 - s1 * s1)/(s0 * (s0 - 1)))

Note that this way of computing the standard deviation can be numerically ill-conditioned if your samples are floating point numbers and the standard deviation is small compared to the mean of the samples. If you expect samples of this type, you should resort to Welford's method (see the accepted answer).

请注意，如果样本是浮点数并且标准偏差与样本的平均值相比较小，则这种计算标准偏差的方式可能会受到数值条件的影响。如果您期望这种类型的样本，您应该采用Welford的方法（参见接受的答案）。

#3

Based on Welford's algorithm:

基于Welford的算法：

import numpy as np

class OnlineVariance(object):
    """
    Welford's algorithm computes the sample variance incrementally.
    """

    def __init__(self, iterable=None, ddof=1):
        self.ddof, self.n, self.mean, self.M2 = ddof, 0, 0.0, 0.0
        if iterable is not None:
            for datum in iterable:
                self.include(datum)

    def include(self, datum):
        self.n += 1
        self.delta = datum - self.mean
        self.mean += self.delta / self.n
        self.M2 += self.delta * (datum - self.mean)

    @property
    def variance(self):
        return self.M2 / (self.n - self.ddof)

    @property
    def std(self):
        return np.sqrt(self.variance)

Update the variance with each new piece of data:

使用每个新数据更新方差：

N = 100
data = np.random.random(N)
ov = OnlineVariance(ddof=0)
for d in data:
    ov.include(d)
std = ov.std
print(std)

Check our result against the standard deviation computed by numpy:

检查我们的结果与numpy计算的标准差：

assert np.allclose(std, data.std())

#1