numpy/scipy等价于R ecdf(x)(x)函数?

时间:2021-05-28 21:20:53

What is the equivalent of R's ecdf(x)(x) function in Python, in either numpy or scipy? Is ecdf(x)(x) basically the same as:

相当于R的ecdf(x)(x)在Python中,函数numpy或scipy吗?ecdf(x)(x)基本相同:

import numpy as np
def ecdf(x):
  # normalize X to sum to 1
  x = x / np.sum(x)
  return np.cumsum(x)

or is something else required?

还是需要其他东西?

EDIT how can one control the number of bins used by ecdf?

编辑一个人怎么能控制ecdf所使用的箱子的数量吗?

3 个解决方案

#1


6  

Try these links:

试试这些链接:

statsmodels.ECDF

statsmodels.ECDF

ECDF in python without step function?

没有step函数的python中的ECDF ?

#2


6  

The OP implementation for ecdf is wrong, you are not supposed to cumsum() the values. So not ys = np.cumsum(x)/np.sum(x) but ys = np.cumsum(1 for _ in x)/float(len(x)) or better ys = np.arange(1, len(x)+1)/float(len(x))

ecdf的OP实现是错误的,您不应该将值累加()。不是ys = np。cumsum(x)/np。sum(x)而是ys = np。cumsum(1 for _ in x)/float(len(x))或better ys = np。不等(len(x)+ 1)/浮动(len(x))

You either go with statmodels's ECDF if you are OK with that extra dependency or provide your own implementation. See below:

您可以使用statmodels的ECDF,如果您可以使用这种额外的依赖关系,或者提供您自己的实现。见下文:

import numpy as np
import matplotlib.pyplot as plt
from statsmodels.distributions.empirical_distribution import ECDF
%matplotlib inline

grades = (93.5,93,60.8,94.5,82,87.5,91.5,99.5,86,93.5,92.5,78,76,69,94.5,
          89.5,92.8,78,65.5,98,98.5,92.3,95.5,76,91,95,61)


def ecdf_wrong(x):
    xs = np.sort(x) # need to be sorted
    ys = np.cumsum(xs)/np.sum(xs) # normalize so sum == 1
    return (xs,ys)
def ecdf(x):
    xs = np.sort(x)
    ys = np.arange(1, len(xs)+1)/float(len(xs))
    return xs, ys

xs, ys = ecdf_wrong(grades)
plt.plot(xs, ys, label="wrong cumsum")
xs, ys = ecdf(grades)
plt.plot(xs, ys, label="handwritten", marker=">", markerfacecolor='none')
cdf = ECDF(grades)
plt.plot(cdf.x, cdf.y, label="statmodels", marker="<", markerfacecolor='none')
plt.legend()
plt.show()

numpy/scipy等价于R ecdf(x)(x)函数?

#3


1  

This author has a very nice example of a user-written ECDF function: John Stachurski's Python lectures. His lecture series is geared towards graduate students in computational economics; however they are my go-to resource for anyone learning general scientific computing in Python.

作者有一个很好的例子,用户编写ECDF功能:约翰Stachurski的Python讲座。他的系列讲座是针对研究生在计算经济学;但是他们是我的首选资源对于任何学习一般在Python中科学计算。

Edit: This is a year old now, but I thought I'd still answer the "Edit" part of your question, in case you (or others) still fin it useful.

编辑:这已经是一年前的事了,但我想我还是会回答你问题的“编辑”部分,以防你(或其他人)仍然觉得它有用。

There really aren't any "bins" with ECDFs as there are with histograms. If G is your empirical distribution function formed using data vector Z, G(x) is literally the number of occurrences of Z <= x, divided by len(Z). This requires no "binning" to determine. Thus there is a sense in which the ECDF retains all possible information about a dataset (since it must retain the entire dataset for calculations), whereas a histogram actually loses some information about the dataset by binning. I much prefer to work with ecdfs vs histograms when possible, for this reason.

确实没有任何带有ECDFs的“箱子”,就像有直方图一样。如果G是用数据向量Z形成的经验分布函数,那么G(x)就是Z <= x出现的次数,除以len(Z)这就不需要“binning”来确定。因此,在某种意义上,ECDF保留了关于数据集的所有可能信息(因为它必须保留整个数据集用于计算),而直方图实际上通过binning丢失了关于数据集的一些信息。出于这个原因,我更喜欢尽可能使用ecdfs和直方图。

Fun bonus: if you need to create a small-footprint ECDF-like object from very large streaming data, you should look into this "Data Skeletons" paper by McDermott et al.

有趣的是:如果您需要从非常大的流数据中创建一个类似于ecdf的小内存占用的对象,您应该查看McDermott等人的“数据骨架”文件。

#1


6  

Try these links:

试试这些链接:

statsmodels.ECDF

statsmodels.ECDF

ECDF in python without step function?

没有step函数的python中的ECDF ?

#2


6  

The OP implementation for ecdf is wrong, you are not supposed to cumsum() the values. So not ys = np.cumsum(x)/np.sum(x) but ys = np.cumsum(1 for _ in x)/float(len(x)) or better ys = np.arange(1, len(x)+1)/float(len(x))

ecdf的OP实现是错误的,您不应该将值累加()。不是ys = np。cumsum(x)/np。sum(x)而是ys = np。cumsum(1 for _ in x)/float(len(x))或better ys = np。不等(len(x)+ 1)/浮动(len(x))

You either go with statmodels's ECDF if you are OK with that extra dependency or provide your own implementation. See below:

您可以使用statmodels的ECDF,如果您可以使用这种额外的依赖关系,或者提供您自己的实现。见下文:

import numpy as np
import matplotlib.pyplot as plt
from statsmodels.distributions.empirical_distribution import ECDF
%matplotlib inline

grades = (93.5,93,60.8,94.5,82,87.5,91.5,99.5,86,93.5,92.5,78,76,69,94.5,
          89.5,92.8,78,65.5,98,98.5,92.3,95.5,76,91,95,61)


def ecdf_wrong(x):
    xs = np.sort(x) # need to be sorted
    ys = np.cumsum(xs)/np.sum(xs) # normalize so sum == 1
    return (xs,ys)
def ecdf(x):
    xs = np.sort(x)
    ys = np.arange(1, len(xs)+1)/float(len(xs))
    return xs, ys

xs, ys = ecdf_wrong(grades)
plt.plot(xs, ys, label="wrong cumsum")
xs, ys = ecdf(grades)
plt.plot(xs, ys, label="handwritten", marker=">", markerfacecolor='none')
cdf = ECDF(grades)
plt.plot(cdf.x, cdf.y, label="statmodels", marker="<", markerfacecolor='none')
plt.legend()
plt.show()

numpy/scipy等价于R ecdf(x)(x)函数?

#3


1  

This author has a very nice example of a user-written ECDF function: John Stachurski's Python lectures. His lecture series is geared towards graduate students in computational economics; however they are my go-to resource for anyone learning general scientific computing in Python.

作者有一个很好的例子,用户编写ECDF功能:约翰Stachurski的Python讲座。他的系列讲座是针对研究生在计算经济学;但是他们是我的首选资源对于任何学习一般在Python中科学计算。

Edit: This is a year old now, but I thought I'd still answer the "Edit" part of your question, in case you (or others) still fin it useful.

编辑:这已经是一年前的事了,但我想我还是会回答你问题的“编辑”部分,以防你(或其他人)仍然觉得它有用。

There really aren't any "bins" with ECDFs as there are with histograms. If G is your empirical distribution function formed using data vector Z, G(x) is literally the number of occurrences of Z <= x, divided by len(Z). This requires no "binning" to determine. Thus there is a sense in which the ECDF retains all possible information about a dataset (since it must retain the entire dataset for calculations), whereas a histogram actually loses some information about the dataset by binning. I much prefer to work with ecdfs vs histograms when possible, for this reason.

确实没有任何带有ECDFs的“箱子”,就像有直方图一样。如果G是用数据向量Z形成的经验分布函数,那么G(x)就是Z <= x出现的次数,除以len(Z)这就不需要“binning”来确定。因此,在某种意义上,ECDF保留了关于数据集的所有可能信息(因为它必须保留整个数据集用于计算),而直方图实际上通过binning丢失了关于数据集的一些信息。出于这个原因,我更喜欢尽可能使用ecdfs和直方图。

Fun bonus: if you need to create a small-footprint ECDF-like object from very large streaming data, you should look into this "Data Skeletons" paper by McDermott et al.

有趣的是:如果您需要从非常大的流数据中创建一个类似于ecdf的小内存占用的对象,您应该查看McDermott等人的“数据骨架”文件。