What is the equivalent of R's ecdf(x)(x)
function in Python, in either numpy or scipy? Is ecdf(x)(x)
basically the same as:
相当于R的ecdf(x)(x)在Python中,函数numpy或scipy吗?ecdf(x)(x)基本相同:
import numpy as np
def ecdf(x):
# normalize X to sum to 1
x = x / np.sum(x)
return np.cumsum(x)
or is something else required?
还是需要其他东西?
EDIT how can one control the number of bins used by ecdf
?
编辑一个人怎么能控制ecdf所使用的箱子的数量吗?
3 个解决方案
#1
6
Try these links:
试试这些链接:
statsmodels.ECDF
ECDF in python without step function?
没有step函数的python中的ECDF ?
#2
6
The OP implementation for ecdf
is wrong, you are not supposed to cumsum()
the values. So not ys = np.cumsum(x)/np.sum(x)
but ys = np.cumsum(1 for _ in x)/float(len(x))
or better ys = np.arange(1, len(x)+1)/float(len(x))
ecdf的OP实现是错误的,您不应该将值累加()。不是ys = np。cumsum(x)/np。sum(x)而是ys = np。cumsum(1 for _ in x)/float(len(x))或better ys = np。不等(len(x)+ 1)/浮动(len(x))
You either go with statmodels
's ECDF
if you are OK with that extra dependency or provide your own implementation. See below:
您可以使用statmodels的ECDF,如果您可以使用这种额外的依赖关系,或者提供您自己的实现。见下文:
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.distributions.empirical_distribution import ECDF
%matplotlib inline
grades = (93.5,93,60.8,94.5,82,87.5,91.5,99.5,86,93.5,92.5,78,76,69,94.5,
89.5,92.8,78,65.5,98,98.5,92.3,95.5,76,91,95,61)
def ecdf_wrong(x):
xs = np.sort(x) # need to be sorted
ys = np.cumsum(xs)/np.sum(xs) # normalize so sum == 1
return (xs,ys)
def ecdf(x):
xs = np.sort(x)
ys = np.arange(1, len(xs)+1)/float(len(xs))
return xs, ys
xs, ys = ecdf_wrong(grades)
plt.plot(xs, ys, label="wrong cumsum")
xs, ys = ecdf(grades)
plt.plot(xs, ys, label="handwritten", marker=">", markerfacecolor='none')
cdf = ECDF(grades)
plt.plot(cdf.x, cdf.y, label="statmodels", marker="<", markerfacecolor='none')
plt.legend()
plt.show()
#3
1
This author has a very nice example of a user-written ECDF function: John Stachurski's Python lectures. His lecture series is geared towards graduate students in computational economics; however they are my go-to resource for anyone learning general scientific computing in Python.
作者有一个很好的例子,用户编写ECDF功能:约翰Stachurski的Python讲座。他的系列讲座是针对研究生在计算经济学;但是他们是我的首选资源对于任何学习一般在Python中科学计算。
Edit: This is a year old now, but I thought I'd still answer the "Edit" part of your question, in case you (or others) still fin it useful.
编辑:这已经是一年前的事了,但我想我还是会回答你问题的“编辑”部分,以防你(或其他人)仍然觉得它有用。
There really aren't any "bins" with ECDFs as there are with histograms. If G is your empirical distribution function formed using data vector Z, G(x) is literally the number of occurrences of Z <= x, divided by len(Z). This requires no "binning" to determine. Thus there is a sense in which the ECDF retains all possible information about a dataset (since it must retain the entire dataset for calculations), whereas a histogram actually loses some information about the dataset by binning. I much prefer to work with ecdfs vs histograms when possible, for this reason.
确实没有任何带有ECDFs的“箱子”,就像有直方图一样。如果G是用数据向量Z形成的经验分布函数,那么G(x)就是Z <= x出现的次数,除以len(Z)这就不需要“binning”来确定。因此,在某种意义上,ECDF保留了关于数据集的所有可能信息(因为它必须保留整个数据集用于计算),而直方图实际上通过binning丢失了关于数据集的一些信息。出于这个原因,我更喜欢尽可能使用ecdfs和直方图。
Fun bonus: if you need to create a small-footprint ECDF-like object from very large streaming data, you should look into this "Data Skeletons" paper by McDermott et al.
有趣的是:如果您需要从非常大的流数据中创建一个类似于ecdf的小内存占用的对象,您应该查看McDermott等人的“数据骨架”文件。
#1
6
Try these links:
试试这些链接:
statsmodels.ECDF
ECDF in python without step function?
没有step函数的python中的ECDF ?
#2
6
The OP implementation for ecdf
is wrong, you are not supposed to cumsum()
the values. So not ys = np.cumsum(x)/np.sum(x)
but ys = np.cumsum(1 for _ in x)/float(len(x))
or better ys = np.arange(1, len(x)+1)/float(len(x))
ecdf的OP实现是错误的,您不应该将值累加()。不是ys = np。cumsum(x)/np。sum(x)而是ys = np。cumsum(1 for _ in x)/float(len(x))或better ys = np。不等(len(x)+ 1)/浮动(len(x))
You either go with statmodels
's ECDF
if you are OK with that extra dependency or provide your own implementation. See below:
您可以使用statmodels的ECDF,如果您可以使用这种额外的依赖关系,或者提供您自己的实现。见下文:
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.distributions.empirical_distribution import ECDF
%matplotlib inline
grades = (93.5,93,60.8,94.5,82,87.5,91.5,99.5,86,93.5,92.5,78,76,69,94.5,
89.5,92.8,78,65.5,98,98.5,92.3,95.5,76,91,95,61)
def ecdf_wrong(x):
xs = np.sort(x) # need to be sorted
ys = np.cumsum(xs)/np.sum(xs) # normalize so sum == 1
return (xs,ys)
def ecdf(x):
xs = np.sort(x)
ys = np.arange(1, len(xs)+1)/float(len(xs))
return xs, ys
xs, ys = ecdf_wrong(grades)
plt.plot(xs, ys, label="wrong cumsum")
xs, ys = ecdf(grades)
plt.plot(xs, ys, label="handwritten", marker=">", markerfacecolor='none')
cdf = ECDF(grades)
plt.plot(cdf.x, cdf.y, label="statmodels", marker="<", markerfacecolor='none')
plt.legend()
plt.show()
#3
1
This author has a very nice example of a user-written ECDF function: John Stachurski's Python lectures. His lecture series is geared towards graduate students in computational economics; however they are my go-to resource for anyone learning general scientific computing in Python.
作者有一个很好的例子,用户编写ECDF功能:约翰Stachurski的Python讲座。他的系列讲座是针对研究生在计算经济学;但是他们是我的首选资源对于任何学习一般在Python中科学计算。
Edit: This is a year old now, but I thought I'd still answer the "Edit" part of your question, in case you (or others) still fin it useful.
编辑:这已经是一年前的事了,但我想我还是会回答你问题的“编辑”部分,以防你(或其他人)仍然觉得它有用。
There really aren't any "bins" with ECDFs as there are with histograms. If G is your empirical distribution function formed using data vector Z, G(x) is literally the number of occurrences of Z <= x, divided by len(Z). This requires no "binning" to determine. Thus there is a sense in which the ECDF retains all possible information about a dataset (since it must retain the entire dataset for calculations), whereas a histogram actually loses some information about the dataset by binning. I much prefer to work with ecdfs vs histograms when possible, for this reason.
确实没有任何带有ECDFs的“箱子”,就像有直方图一样。如果G是用数据向量Z形成的经验分布函数,那么G(x)就是Z <= x出现的次数,除以len(Z)这就不需要“binning”来确定。因此,在某种意义上,ECDF保留了关于数据集的所有可能信息(因为它必须保留整个数据集用于计算),而直方图实际上通过binning丢失了关于数据集的一些信息。出于这个原因,我更喜欢尽可能使用ecdfs和直方图。
Fun bonus: if you need to create a small-footprint ECDF-like object from very large streaming data, you should look into this "Data Skeletons" paper by McDermott et al.
有趣的是:如果您需要从非常大的流数据中创建一个类似于ecdf的小内存占用的对象,您应该查看McDermott等人的“数据骨架”文件。