Although I have read lots of posts about fitting distributions in python, I am still confused about usage floc
and fscale
parameters. For general information I mainly used this, this and this sources.
虽然我已经阅读了很多关于python中拟合分布的文章,但我仍然对使用floc和fscale参数感到困惑。对于一般的信息,我主要使用这个,这个和这个来源。
I know, that given distribution lets say f(x) becomes more general distribution when using loc
and scale
parameters, which can be described by formula f(x) = f((x-loc)/scale).
我知道,给定的分布让f(x)在使用loc和scale参数时变得更一般,可以用公式f(x) = f(x -loc)/scale来描述。
In scipy, we have to choice. When fitting a distribution, using formula distr.fit(x)
, the initial guess of loc
parameter is 0 and initial guess of fscale
parameter is 1 (so that we assume that the parametrized distribution is close to nonparametrized distribution). We can also force scipy to fit 'original' distribution f(x) using distr.fit(x, floc = 0, fscale = 1)
.
在scipy中,我们必须选择。当拟合一个分布时,使用公式分散。拟合(x), loc参数的初始估计值为0,初始估计fscale参数为1(因此我们假设参数化分布接近非参数化分布)。我们还可以使用“分散”来迫使scipy符合“原始”的分布f(x)。
My question is: is there any general advice when to force scipy to fit 'original distribution' besides the 'parametrized one'?
我的问题是:除了“参数化的分布”之外,什么时候强制scipy适合“原始分布”?
Here is the example:
下面是例子:
# generate some data
from scipy.stats import lognorm, fisk, gamma
from statsmodels.distributions.empirical_distribution import ECDF
import numpy as np
import matplotlib.pyplot as plt
x1 = [18. for i in range(36)]
x2 = [19. for i in range(17)]
x3 = [22. for i in range(44)]
x4 = [27. for i in range(63)]
x5 = [28.2 for i in range(8)]
x6 = [32. for i in range(104)]
x7 = [32.6 for i in range(29)]
x8 = [33. for i in range(85)]
x9 = [33.4 for i in range(27)]
x10 = [34.2 for i in range(49)]
x11 = [36. for i in range(99)]
x12 = [36.2 for i in range(35)]
x13 = [37. for i in range(98)]
x14 = [38. for i in range(25)]
x15 = [38.4 for i in range(39)]
x16 = [39. for i in range(25)]
x17 = [42. for i in range(54)]
# empirical distribution function
xp = x1 + x2+x3+x4+x5+x6+x7+x8+x9+x10+x11+x12+x13+x14+x15+x16+x17
yp = ECDF(xp)
# fit lognormal distribution with parametrization
pars1 = lognorm.fit(xp)
# fit lognormal distribution with floc = 0
pars2 = lognorm.fit(xp, floc = 0)
#plot the result
X = np.linspace(min(xp), max(xp), 10000)
plt.plot(yp.x, yp.y, 'ro')
plt.plot(X, lognorm.cdf(X, pars1[0], pars1[1], pars1[2]), 'b-')
plt.plot(X, lognorm.cdf(X, pars2[0], pars2[1], pars2[2]), 'g-')
plt.show()
#fit the gamma distribution
pars1 = gamma.fit(xp)
pars2 = gamma.fit(xp, floc = 0)
#plot the result
X = np.linspace(min(xp), max(xp), 10000)
plt.plot(yp.x, yp.y, 'ro')
plt.plot(X, gamma.cdf(X, pars1[0], pars1[1], pars1[2]), 'b-')
plt.plot(X, gamma.cdf(X, pars2[0], pars2[1], pars2[2]), 'g-')
plt.show()
As you can see, the floc = 0
improved a lot the fit in lognorm case, in gamma case it didint change the fit at all.
正如你所看到的,floc = 0在lognorm的情况下改进了很多,在gamma的情况下它根本没有改变fit。
Sorry for long demontration, here is my question again: Is there any general advice when to specify floc = 0
and fscale = 1
and when to use custome loc = 0
and scale = 1
?
不好意思,我的问题是:在什么时候指定floc = 0, fscale = 1,什么时候使用custome loc = 0,什么时候使用scale = 1?
1 个解决方案
#1
2
Short answer
Provide some guess-estimates for loc
and scale
whenever you are able. Provide floc
and fscale
only when you actually need this for your subsequent use of the model; that is, if an answer with, say, distribution mean different from 0 is just not acceptable to you.
当你有能力的时候,提供一些关于loc和规模的猜测。仅当您在以后使用该模型时实际需要使用floc和fscale时,才提供floc和fscale;也就是说,如果答案是,分布的均值不同于0,你是不能接受的。
For example, if you model elastic force by Hooke's law F = k*x
and want to find k
from experimental force F and deformation x, there is no use in fitting a general linear model k*x+b
; we know that zero force produces zero deformation. Any nonzero value of b
may achieve better fit but only because it follows experimental errors better, which isn't the goal. So this is a situation where we want to force a certain parameter to be zero.
例如,如果根据胡克定律F = k*x建立弹性力模型,并想从实验力F和变形x中得到k,那么拟合一般线性模型k*x+b是没有用的;我们知道零力产生零变形。b的任何非零值都可能实现更好的拟合,但这只是因为它更好地遵循实验误差,而这不是目标。所以这是一个我们想要强制某个参数为零的情况。
Never use floc
or fscale
if you just want to improve the fit; use loc
and scale
instead.
如果你只是想改善身体状况,千万不要使用floc或fscale;用loc和scale代替。
Explanation
Fitting a distribution to data is a multivariable optimization problem. Such problems are difficult and solvers frequently fail when the starting point is far from the optimal. If floc
gives better result than unconstrained fit, that only means the unconstrained fit failed.
拟合数据的分布是一个多变量优化问题。这样的问题是困难的,当起点远不是最理想的时候,解决者往往会失败。如果floc给出的结果比无约束拟合更好,那只意味着无约束拟合失败。
To improve the outcome, you should provide tentative loc and scale parameters whenever you are able to come up with something reasonable.
为了改进结果,您应该提供临时loc和scale参数,只要您能够提出一些合理的东西。
In your lognormal example, you compare not giving any hint to imposing the restriction floc=0
. But the best strategy is just to give a hint with loc=0
:
在您的lognormal示例中,您比较了不使用任何限制floc=0的提示。但是最好的策略是用loc=0给出一个提示:
pars1 = lognorm.fit(xp, loc=0)
The resulting blue curve is better than the green one with floc=0
.
得到的蓝色曲线比floc=0的绿色曲线要好。
Of course it is better. loc=0
points the optimizer to a pretty good place to start, and lets it work from there. floc=0
points the optimizer to a pretty good place to start, but then tells it to stay there.
当然更好。loc=0将优化器指向一个非常好的起点,并让它从那里开始工作。floc=0将优化器指向一个非常好的起点,然后告诉它保持在那里。
#1
2
Short answer
Provide some guess-estimates for loc
and scale
whenever you are able. Provide floc
and fscale
only when you actually need this for your subsequent use of the model; that is, if an answer with, say, distribution mean different from 0 is just not acceptable to you.
当你有能力的时候,提供一些关于loc和规模的猜测。仅当您在以后使用该模型时实际需要使用floc和fscale时,才提供floc和fscale;也就是说,如果答案是,分布的均值不同于0,你是不能接受的。
For example, if you model elastic force by Hooke's law F = k*x
and want to find k
from experimental force F and deformation x, there is no use in fitting a general linear model k*x+b
; we know that zero force produces zero deformation. Any nonzero value of b
may achieve better fit but only because it follows experimental errors better, which isn't the goal. So this is a situation where we want to force a certain parameter to be zero.
例如,如果根据胡克定律F = k*x建立弹性力模型,并想从实验力F和变形x中得到k,那么拟合一般线性模型k*x+b是没有用的;我们知道零力产生零变形。b的任何非零值都可能实现更好的拟合,但这只是因为它更好地遵循实验误差,而这不是目标。所以这是一个我们想要强制某个参数为零的情况。
Never use floc
or fscale
if you just want to improve the fit; use loc
and scale
instead.
如果你只是想改善身体状况,千万不要使用floc或fscale;用loc和scale代替。
Explanation
Fitting a distribution to data is a multivariable optimization problem. Such problems are difficult and solvers frequently fail when the starting point is far from the optimal. If floc
gives better result than unconstrained fit, that only means the unconstrained fit failed.
拟合数据的分布是一个多变量优化问题。这样的问题是困难的,当起点远不是最理想的时候,解决者往往会失败。如果floc给出的结果比无约束拟合更好,那只意味着无约束拟合失败。
To improve the outcome, you should provide tentative loc and scale parameters whenever you are able to come up with something reasonable.
为了改进结果,您应该提供临时loc和scale参数,只要您能够提出一些合理的东西。
In your lognormal example, you compare not giving any hint to imposing the restriction floc=0
. But the best strategy is just to give a hint with loc=0
:
在您的lognormal示例中,您比较了不使用任何限制floc=0的提示。但是最好的策略是用loc=0给出一个提示:
pars1 = lognorm.fit(xp, loc=0)
The resulting blue curve is better than the green one with floc=0
.
得到的蓝色曲线比floc=0的绿色曲线要好。
Of course it is better. loc=0
points the optimizer to a pretty good place to start, and lets it work from there. floc=0
points the optimizer to a pretty good place to start, but then tells it to stay there.
当然更好。loc=0将优化器指向一个非常好的起点,并让它从那里开始工作。floc=0将优化器指向一个非常好的起点,然后告诉它保持在那里。