I have discovered something unusual in pandas re the ddof (degrees of freedom) parameter for standard deviation calculations (std).
我在pandas中发现了一些不寻常的东西,用于标准差计算(std)的ddof(*度)参数。
For regular std, the speed is the same whether I leave its default to 1, or specify 0. When I do it as part of a group by
, it is around 10 times slower (the test DataFrame I set up has a similar structure to the one I am working on). The slowdown is worse for more columns/rows/unique groups.
对于常规std,无论我将其默认值保留为1还是指定0,速度都是相同的。当我作为group by的一部分执行时,速度大约慢10倍(我设置的测试DataFrame具有类似于我正在努力的那个)。更多列/行/唯一组的减速情况更糟。
Any idea what is going on here? Does pandas need a little bug fix? Is there a way to replicate the ddof=1
behaviour at a faster speed (I'm running these std calculations a lot)?
知道这里发生了什么吗?大熊猫需要修复一点虫子吗?有没有办法以更快的速度复制ddof = 1行为(我正在运行这些std计算很多)?
import pandas as pd
import datetime
test = pd.DataFrame(np.random.rand(100000,10))
%timeit test.std()
100 loops, best of 3: 18.2 ms per loop
%timeit test.std(ddof=0)
100 loops, best of 3: 18.3 ms per loop
test['group'] = (test[0]*20+1).astype(int)
test['date'] = [datetime.date(2018, 3, g) for g in test['group']]
test = test.set_index(['date','group'])
%timeit test.groupby(level='date').std()
100 loops, best of 3: 6.78 ms per loop
%timeit test.groupby(level='date').std(ddof=0)
10 loops, best of 3: 68.5 ms per loop
1 个解决方案
#1
0
This is not a bug, but it is a known issue.
这不是错误,但它是一个已知问题。
Below is some pandas
source code from groupby.py
.
以下是来自groupby.py的一些pandas源代码。
-
ddof == 1
: (default value) a Cythonised algorithm is applied.ddof == 1 :(默认值)应用Cythonised算法。
-
ddof != 1
: a Python-level loop is applied.ddof!= 1:应用了Python级循环。
Therefore, you won't be able to use this method to optimize from within pandas
.
因此,您将无法使用此方法在pandas中进行优化。
@Substitution(name='groupby')
@Appender(_doc_template)
def var(self, ddof=1, *args, **kwargs):
"""
Compute variance of groups, excluding missing values
For multiple groupings, the result index will be a MultiIndex
Parameters
----------
ddof : integer, default 1
degrees of freedom
"""
nv.validate_groupby_func('var', args, kwargs)
if ddof == 1:
return self._cython_agg_general('var')
else:
self._set_group_selection()
f = lambda x: x.var(ddof=ddof)
return self._python_agg_general(f)
#1
0
This is not a bug, but it is a known issue.
这不是错误,但它是一个已知问题。
Below is some pandas
source code from groupby.py
.
以下是来自groupby.py的一些pandas源代码。
-
ddof == 1
: (default value) a Cythonised algorithm is applied.ddof == 1 :(默认值)应用Cythonised算法。
-
ddof != 1
: a Python-level loop is applied.ddof!= 1:应用了Python级循环。
Therefore, you won't be able to use this method to optimize from within pandas
.
因此,您将无法使用此方法在pandas中进行优化。
@Substitution(name='groupby')
@Appender(_doc_template)
def var(self, ddof=1, *args, **kwargs):
"""
Compute variance of groups, excluding missing values
For multiple groupings, the result index will be a MultiIndex
Parameters
----------
ddof : integer, default 1
degrees of freedom
"""
nv.validate_groupby_func('var', args, kwargs)
if ddof == 1:
return self._cython_agg_general('var')
else:
self._set_group_selection()
f = lambda x: x.var(ddof=ddof)
return self._python_agg_general(f)