恢复wald分布参数:从numpy到scipy。

could someone please help with a questions around the parametrization of scipy distributions and how to transform them?

有人能帮忙解答关于scipy分布的参数化以及如何转换它们的问题吗?

I basically would like to recover distribution parameters of data that I simulate with numpy...

我想恢复我用numpy模拟的数据的分布参数。

some_data = np.random.normal(loc=81, scale=7, size=100000)

...by fitting a distribution with scipy

…用scipy拟合一个分布

recovered_parms = scipy.stats.norm.fit(some_data)

For the normal distribution, this works. recovered_parms ~= (81,7)

对于正态分布，这行得通。recovered_parms ~ =(81 7)

However, for e.g. a wald distribution it does not.

但是，对于wald分布，它没有。

some_data = np.random.wald(mean=4, scale=41, size=100000)

recovered_parms = scipy.stats.wald.fit(some_data)

Result: recovered_parms ~= (1.28,3.66)

结果:recovered_parms ~ =(1.28,3.66)

I understand that they need to be transformed but just can't figure out how. Any help appreciated.

我知道它们需要被转换，但就是不知道怎么转换。任何帮助表示赞赏。

3 个解决方案

#1

If the problem is to just estimate the lambda and mean of the wald distribution. You can just do

如果问题是估计wald分布的和均值。你可以做

mean = np.mean(some_data)
lambda_ = 1/(np.mean(1/some_data) - 1/mean) # lambda is a reserved keyword :/

This estimate seems to be pretty close than whatever the scipy.stats.wald fit is returning (if we interpret one of them as mean or we know how to interpret it)

这个估计值似乎与scipy.stats非常接近。wald fit正在回归

#2

I don't know that you can; this appears to be a can of worms. See if you agree with my reasoning.

我不知道你能;这似乎是一个棘手的问题。看看你是否同意我的推理。

from numpy.random import wald
import scipy.stats

means = [1, 2, 4, 8]
samples = [wald(mean=mean, scale=1, size=100000) for mean in means]

print(('{:>10d}'*len(means)).format(*means))
stats = [scipy.stats.wald.fit(sample) for sample in samples]
print(('{:>10.2f}'*len(means)).format(*[stat[1] for stat in stats]))
print(('{:>10.2f}'*len(means)).format(*[stat[0] for stat in stats]))

scales = [1, 4, 16, 64]
samples = [wald(mean=1, scale=scale, size=100000) for scale in scales]

print(('{:>10d}'*len(scales)).format(*scales))
stats = [scipy.stats.wald.fit(sample) for sample in samples]
print(('{:>10.2f}'*len(scales)).format(*[stat[1] for stat in stats]))
print(('{:>10.2f}'*len(scales)).format(*[stat[0] for stat in stats]))

First I generate four samples, one for each of the means 1, 2, 4 and 8, keeping the scale the same at 1. I calculate a fit for each sample. Then I generate another four samples, one for each of the scales 1, 4, 16 and 64, this time keeping the mean the same at 1.

首先我生成4个样本，1、2、4和8各取1，比例保持不变。我计算每个样本的适合度。然后我再生成4个样本，分别对应1、4、16和64，这一次的均值是1。

Here are the results.

这里是结果。

     1         2         4         8
  1.00      1.90      3.53      6.43
 -0.00     -0.13     -0.43     -1.06
     1         4        16        64
  1.00      1.14      0.92      0.68
  0.00      0.12      0.35      0.55

I would expect the location to appear first in each pair of results but it appears that location is second. Still, at least the location does approximate the mean, even if it shows an increasing negative bias. It's difficult to interpret the scale. Over a large range the scale estimates might be on a logarithm scale.

我希望位置在每一对结果中都是第一个出现，但是位置似乎是第二个。尽管如此，至少这个位置是接*均值的，即使它显示出越来越大的负偏倚。很难解释它的规模。在很大范围内，规模估计可能是对数级的。

This might be a question to put on the developer's site.

这可能是一个放在开发人员站点上的问题。

#3

numpy.random.wald has two parameters, mean and scale. scale is, as the name suggests, a scale parameter, in the sense of a location-scale family. mean is a shape parameter; it is not a location parameter.

numpy.random。wald有两个参数，平均值和刻度。顾名思义，scale是一个scale参数，表示位置-scale家族。均值为形状参数;它不是一个位置参数。

If you look at the docstring for numpy.random.wald, it says "Draw samples from a Wald, or inverse Gaussian, distribution." The docstring for scipy.stats.wald, however, says that it is "a special case of invgauss with mu == 1", where mu is a shape parameter of scipy.stats.invgauss. scipy.stats.wald has only two parameters, loc and scale. (All the continuous distributions in scipy.stats have these parameters.) So the parameters of numpy.random.wald and scipy.stats.wald don't match up: numpy.random.wald has a shape and a scale parameter, but scipy.stats.wald has a location and a scale parameter.

如果你看numpy.random的docstring。瓦尔德，它说"从瓦尔德，或者说反高斯分布中抽取样本"所以scipy.stats。然而，沃尔德说，它是“一个特殊的invgauss的情况，具有mu = 1”，其中mu是scipy. state .invgauss的形状参数。scipy.stats。沃尔德只有两个参数，loc和scale。(所有的连续分布在scipy中。统计这些参数。)这是numpi。random的参数。瓦尔德和scipy.stats。沃尔德不匹配:numpy.random。wald有一个形状和一个比例参数，但是scipy.stats。wald有一个位置和一个尺度参数。

Instead of scipy.stats.wald, you must use scipy.stats.invgauss to fit data generated with numpy.random.wald. scipy.stats.invgauss is an implementation of the inverse Gaussian distribution that is mentioned in the docstring of numpy.random.wald. scipy.stats.invgauss has three parameters: one shape parameter called mu, along with the standard location (loc) and scale parameters.

而不是scipy.stats。沃尔德，你必须使用scipy.stats。invgauss用于匹配由numpy.random.wald生成的数据。scipy.stats。invgauss是在numpy.random.wald的docstring中提到的逆高斯分布的实现。scipy.stats。invgauss有三个参数:一个称为mu的形状参数，另一个是标准位置(loc)和尺度参数。

The shape parameter mu of scipy.stats.invgauss is not the same as the shape parameter mean of numpy.random.wald. If you do a little algebra with the PDFs of the two functions, you'll find that the relation is

形状参数mu。stats。invgauss与numpy.random.wald的形状参数均值不同。如果你对这两个函数的PDFs做一点代数运算，你会发现它们之间的关系是

mean = mu * scale

where mu is the invgauss shape parameter, mean is the shape parameter used in numpy.random.wald, and scale has the same meaning in both functions.

其中mu为invgauss形状参数，均值为numpy.random中使用的形状参数。wald和scale在两个函数中都有相同的意义。

If you generate a sample using numpy.random.wald and you then want to recover the parameters by fitting the inverse Gaussian distribution to it, you must use the above relation to convert the result of the fit to the mean used by numpy.random.wald. Also, numpy.random.wald doesn't have a location parameter, so you must restrict the location of scipy.stats.invgauss to be 0 by using the argument floc=0 in scipy.stats.invgauss.fit().

如果您使用numpy.random生成一个示例。然后你想通过拟合逆高斯分布来恢复参数，你必须使用上面的关系将拟合结果转换成numpy.random.wald所用的均值。同时,numpy.random。wald没有location参数，所以您必须限制scipy.stats的位置。在scipy.statuss.invgauss .fit()中使用参数floc=0将invgauss设为0。

Here's an example. First, generate some data using numpy.random.wald:

这是一个例子。首先，使用numpy.random.wald生成一些数据:

In [55]: m = 4

In [56]: s = 41

In [57]: some_data = np.random.wald(mean=m, scale=s, size=100000)

Now fit scipy.stats.invgauss to that data, with the restriction that the location parameter is 0:

现在适合scipy.stats。对该数据进行调用，限制位置参数为0:

In [58]: from scipy.stats import invgauss

In [59]: mu, loc, scale = invgauss.fit(some_data, floc=0)

In [60]: mu, loc, scale
Out[60]: (0.097186409353576975, 0, 41.155034600558793)

As expected, the scale parameter is close to the parameter that was used to generate the data. To get the estimate of the shape parameter that was used, multiply mu and scale:

正如预期的那样，scale参数接近用于生成数据的参数。为了得到所使用的形状参数的估计，乘以mu和scale:

In [61]: mu*scale
Out[61]: 3.9997100396505312

It is approximately 4, as expected.

它大约是4，和预期的一样。

A plot is always useful for visualizing the fit. In the plot, the blue bars show the normalized histogram of the data, and the black curve is the PDF of the fitted inverse Gaussian distribution.

一个情节总是很有用的，在视觉上显示适合度。图中蓝色条表示数据的归一化直方图，黑色曲线为拟合逆高斯分布的PDF。

In [86]: import matplotlib.pyplot as plt

In [87]: _ = plt.hist(some_data, bins=40, normed=True, alpha=0.6)

In [88]: xx = np.linspace(some_data.min(), some_data.max(), 500)

In [89]: yy = invgauss.pdf(xx, mu, loc, scale)

In [90]: plt.plot(xx, yy, 'k')
Out[90]: [<matplotlib.lines.Line2D at 0x11b6d64e0>]

#1