为什么nanty函数在pandas系列/数据帧上如此慢?

时间:2021-05-21 21:19:21

Consider a small MWE, taken from another question:

考虑一个小型MWE,取自另一个问题:

DateTime                Data
2017-11-21 18:54:31     1
2017-11-22 02:26:48     2
2017-11-22 10:19:44     3
2017-11-22 15:11:28     6
2017-11-22 23:21:58     7
2017-11-28 14:28:28    28
2017-11-28 14:36:40     0
2017-11-28 14:59:48     1

The goal is to clip all values with an upper bound of 1. My answer uses np.clip, which works fine.

目标是剪切所有值的上限为1.我的答案使用np.clip,它工作正常。

np.clip(df.Data, a_min=None, a_max=1)
array([1, 1, 1, 1, 1, 1, 0, 1])

Or,

要么,

np.clip(df.Data.values, a_min=None, a_max=1)
array([1, 1, 1, 1, 1, 1, 0, 1])

Both of which return the same answer. My question is about the relative performance of these two methods. Consider -

两者都返回相同的答案。我的问题是关于这两种方法的相对表现。考虑 -

df = pd.concat([df]*1000).reset_index(drop=True)

%timeit np.clip(df.Data, a_min=None, a_max=1)
1000 loops, best of 3: 270 µs per loop

%timeit np.clip(df.Data.values, a_min=None, a_max=1)
10000 loops, best of 3: 23.4 µs per loop

Why is there such a massive difference between the two, just by calling values on the latter? In other words...

为什么两者之间存在如此巨大的差异,仅仅通过调用后者的价值?换一种说法...

Why are numpy functions so slow on pandas objects?

为什么numpy函数在pandas对象上如此慢?

4 个解决方案

#1


46  

Yes, it seems like np.clip is a lot slower on pandas.Series than on numpy.ndarrays. That's correct but it's actually (at least asymptomatically) not that bad. 8000 elements is still in the regime where constant factors are major contributors in the runtime. I think this is a very important aspect to the question, so I'm visualizing this (borrowing from another answer):

是的,似乎np.clip在pandas.Series上比在numpy.ndarrays上要慢得多。这是正确的,但它实际上(至少是无症状地)并没有那么糟糕。 8000个元素仍然处于运行时常数因素是主要贡献者的制度中。我认为这是问题的一个非常重要的方面,所以我想象这个(借用另一个答案):

# Setup

import pandas as pd
import numpy as np

def on_series(s):
    return np.clip(s, a_min=None, a_max=1)

def on_values_of_series(s):
    return np.clip(s.values, a_min=None, a_max=1)

# Timing setup
timings = {on_series: [], on_values_of_series: []}
sizes = [2**i for i in range(1, 26, 2)]

# Timing
for size in sizes:
    func_input = pd.Series(np.random.randint(0, 30, size=size))
    for func in timings:
        res = %timeit -o func(func_input)
        timings[func].append(res)

%matplotlib notebook

import matplotlib.pyplot as plt
import numpy as np

fig, (ax1, ax2) = plt.subplots(1, 2)

for func in timings:
    ax1.plot(sizes, 
             [time.best for time in timings[func]], 
             label=str(func.__name__))
ax1.set_xscale('log')
ax1.set_yscale('log')
ax1.set_xlabel('size')
ax1.set_ylabel('time [seconds]')
ax1.grid(which='both')
ax1.legend()

baseline = on_values_of_series # choose one function as baseline
for func in timings:
    ax2.plot(sizes, 
             [time.best / ref.best for time, ref in zip(timings[func], timings[baseline])], 
             label=str(func.__name__))
ax2.set_yscale('log')
ax2.set_xscale('log')
ax2.set_xlabel('size')
ax2.set_ylabel('time relative to {}'.format(baseline.__name__))
ax2.grid(which='both')
ax2.legend()

plt.tight_layout()

为什么nanty函数在pandas系列/数据帧上如此慢?

It's a log-log plot because I think this shows the important features more clearly. For example it shows that np.clip on a numpy.ndarray is faster but it also has a much smaller constant factor in that case. The difference for large arrays is only ~3! That's still a big difference but way less than the difference on small arrays.

这是一个对数日志图,因为我认为这更清楚地显示了重要的特征。例如,它显示numpy.ndarray上的np.clip更快,但在这种情况下它也具有小得多的常数因子。大数组的差异只有3个!这仍然是一个很大的差异,但比小阵列的差异要小。

However, that's still not an answer to the question where the time difference comes from.

然而,这仍然不是时差的来源问题的答案。

The solution is actually quite simple: np.clip delegates to the clip method of the first argument:

解决方案实际上非常简单:np.clip委托第一个参数的clip方法:

>>> np.clip??
Source:   
def clip(a, a_min, a_max, out=None):
    """
    ...
    """
    return _wrapfunc(a, 'clip', a_min, a_max, out=out)

>>> np.core.fromnumeric._wrapfunc??
Source:   
def _wrapfunc(obj, method, *args, **kwds):
    try:
        return getattr(obj, method)(*args, **kwds)
    # ...
    except (AttributeError, TypeError):
        return _wrapit(obj, method, *args, **kwds)

The getattr line of the _wrapfunc function is the important line here, because np.ndarray.clip and pd.Series.clip are different methods, yes, completely different methods:

_wrapfunc函数的getattr行是这里的重要行,因为np.ndarray.clip和pd.Series.clip是不同的方法,是的,完全不同的方法:

>>> np.ndarray.clip
<method 'clip' of 'numpy.ndarray' objects>
>>> pd.Series.clip
<function pandas.core.generic.NDFrame.clip>

Unfortunately is np.ndarray.clip a C-function so it's hard to profile it, however pd.Series.clip is a regular Python function so it's easy to profile. Let's use a Series of 5000 integers here:

不幸的是np.ndarray.clip是一个C函数,所以很难对它进行分析,但是pd.Series.clip是一个常规的Python函数,所以它很容易分析。我们在这里使用一系列5000个整数:

s = pd.Series(np.random.randint(0, 100, 5000))

For the np.clip on the values I get the following line-profiling:

对于值的np.clip,我得到以下行分析:

%load_ext line_profiler
%lprun -f np.clip -f np.core.fromnumeric._wrapfunc np.clip(s.values, a_min=None, a_max=1)

Timer unit: 4.10256e-07 s

Total time: 2.25641e-05 s
File: numpy\core\fromnumeric.py
Function: clip at line 1673

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  1673                                           def clip(a, a_min, a_max, out=None):
  1674                                               """
  ...
  1726                                               """
  1727         1           55     55.0    100.0      return _wrapfunc(a, 'clip', a_min, a_max, out=out)

Total time: 1.51795e-05 s
File: numpy\core\fromnumeric.py
Function: _wrapfunc at line 55

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    55                                           def _wrapfunc(obj, method, *args, **kwds):
    56         1            2      2.0      5.4      try:
    57         1           35     35.0     94.6          return getattr(obj, method)(*args, **kwds)
    58                                           
    59                                               # An AttributeError occurs if the object does not have
    60                                               # such a method in its class.
    61                                           
    62                                               # A TypeError occurs if the object does have such a method
    63                                               # in its class, but its signature is not identical to that
    64                                               # of NumPy's. This situation has occurred in the case of
    65                                               # a downstream library like 'pandas'.
    66                                               except (AttributeError, TypeError):
    67                                                   return _wrapit(obj, method, *args, **kwds)

But for np.clip on the Series I get a totally different profiling result:

但是对于系列中的np.clip,我得到了一个完全不同的分析结果:

%lprun -f np.clip -f np.core.fromnumeric._wrapfunc -f pd.Series.clip -f pd.Series._clip_with_scalar np.clip(s, a_min=None, a_max=1)

Timer unit: 4.10256e-07 s

Total time: 0.000823794 s
File: numpy\core\fromnumeric.py
Function: clip at line 1673

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  1673                                           def clip(a, a_min, a_max, out=None):
  1674                                               """
  ...
  1726                                               """
  1727         1         2008   2008.0    100.0      return _wrapfunc(a, 'clip', a_min, a_max, out=out)

Total time: 0.00081846 s
File: numpy\core\fromnumeric.py
Function: _wrapfunc at line 55

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    55                                           def _wrapfunc(obj, method, *args, **kwds):
    56         1            2      2.0      0.1      try:
    57         1         1993   1993.0     99.9          return getattr(obj, method)(*args, **kwds)
    58                                           
    59                                               # An AttributeError occurs if the object does not have
    60                                               # such a method in its class.
    61                                           
    62                                               # A TypeError occurs if the object does have such a method
    63                                               # in its class, but its signature is not identical to that
    64                                               # of NumPy's. This situation has occurred in the case of
    65                                               # a downstream library like 'pandas'.
    66                                               except (AttributeError, TypeError):
    67                                                   return _wrapit(obj, method, *args, **kwds)

Total time: 0.000804922 s
File: pandas\core\generic.py
Function: clip at line 4969

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  4969                                               def clip(self, lower=None, upper=None, axis=None, inplace=False,
  4970                                                        *args, **kwargs):
  4971                                                   """
  ...
  5021                                                   """
  5022         1           12     12.0      0.6          if isinstance(self, ABCPanel):
  5023                                                       raise NotImplementedError("clip is not supported yet for panels")
  5024                                           
  5025         1           10     10.0      0.5          inplace = validate_bool_kwarg(inplace, 'inplace')
  5026                                           
  5027         1           69     69.0      3.5          axis = nv.validate_clip_with_axis(axis, args, kwargs)
  5028                                           
  5029                                                   # GH 17276
  5030                                                   # numpy doesn't like NaN as a clip value
  5031                                                   # so ignore
  5032         1          158    158.0      8.1          if np.any(pd.isnull(lower)):
  5033         1            3      3.0      0.2              lower = None
  5034         1           26     26.0      1.3          if np.any(pd.isnull(upper)):
  5035                                                       upper = None
  5036                                           
  5037                                                   # GH 2747 (arguments were reversed)
  5038         1            1      1.0      0.1          if lower is not None and upper is not None:
  5039                                                       if is_scalar(lower) and is_scalar(upper):
  5040                                                           lower, upper = min(lower, upper), max(lower, upper)
  5041                                           
  5042                                                   # fast-path for scalars
  5043         1            1      1.0      0.1          if ((lower is None or (is_scalar(lower) and is_number(lower))) and
  5044         1           28     28.0      1.4                  (upper is None or (is_scalar(upper) and is_number(upper)))):
  5045         1         1654   1654.0     84.3              return self._clip_with_scalar(lower, upper, inplace=inplace)
  5046                                           
  5047                                                   result = self
  5048                                                   if lower is not None:
  5049                                                       result = result.clip_lower(lower, axis, inplace=inplace)
  5050                                                   if upper is not None:
  5051                                                       if inplace:
  5052                                                           result = self
  5053                                                       result = result.clip_upper(upper, axis, inplace=inplace)
  5054                                           
  5055                                                   return result

Total time: 0.000662153 s
File: pandas\core\generic.py
Function: _clip_with_scalar at line 4920

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  4920                                               def _clip_with_scalar(self, lower, upper, inplace=False):
  4921         1            2      2.0      0.1          if ((lower is not None and np.any(isna(lower))) or
  4922         1           25     25.0      1.5                  (upper is not None and np.any(isna(upper)))):
  4923                                                       raise ValueError("Cannot use an NA value as a clip threshold")
  4924                                           
  4925         1           22     22.0      1.4          result = self.values
  4926         1          571    571.0     35.4          mask = isna(result)
  4927                                           
  4928         1           95     95.0      5.9          with np.errstate(all='ignore'):
  4929         1            1      1.0      0.1              if upper is not None:
  4930         1          141    141.0      8.7                  result = np.where(result >= upper, upper, result)
  4931         1           33     33.0      2.0              if lower is not None:
  4932                                                           result = np.where(result <= lower, lower, result)
  4933         1           73     73.0      4.5          if np.any(mask):
  4934                                                       result[mask] = np.nan
  4935                                           
  4936         1           90     90.0      5.6          axes_dict = self._construct_axes_dict()
  4937         1          558    558.0     34.6          result = self._constructor(result, **axes_dict).__finalize__(self)
  4938                                           
  4939         1            2      2.0      0.1          if inplace:
  4940                                                       self._update_inplace(result)
  4941                                                   else:
  4942         1            1      1.0      0.1              return result

I stopped going into the subroutines at that point because it already highlights where the pd.Series.clip does much more work than the np.ndarray.clip. Just compare the total time of the np.clip call on the values (55 timer units) to one of the first checks in the pandas.Series.clip method, the if np.any(pd.isnull(lower)) (158 timer units). At that point the pandas method didn't even start at clipping and it already takes 3 times longer.

那时我停止进入子程序,因为它已经突出显示了pd.Series.clip比np.ndarray.clip做得更多的工作。只需将np.clip调用的总时间(55个计时器单位)与pandas.Series.clip方法中的第一个检查之一进行比较,if np.any(pd.isnull(lower))(158计时器)单位)。那时,熊猫方法甚至没有开始削波,它已经花了3倍的时间。

However several of these "overheads" become insignificant when the array is big:

然而,当阵列很大时,其中一些“开销”变得无关紧要:

s = pd.Series(np.random.randint(0, 100, 1000000))

%lprun -f np.clip -f np.core.fromnumeric._wrapfunc -f pd.Series.clip -f pd.Series._clip_with_scalar np.clip(s, a_min=None, a_max=1)

Timer unit: 4.10256e-07 s

Total time: 0.00593476 s
File: numpy\core\fromnumeric.py
Function: clip at line 1673

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  1673                                           def clip(a, a_min, a_max, out=None):
  1674                                               """
  ...
  1726                                               """
  1727         1        14466  14466.0    100.0      return _wrapfunc(a, 'clip', a_min, a_max, out=out)

Total time: 0.00592779 s
File: numpy\core\fromnumeric.py
Function: _wrapfunc at line 55

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    55                                           def _wrapfunc(obj, method, *args, **kwds):
    56         1            1      1.0      0.0      try:
    57         1        14448  14448.0    100.0          return getattr(obj, method)(*args, **kwds)
    58                                           
    59                                               # An AttributeError occurs if the object does not have
    60                                               # such a method in its class.
    61                                           
    62                                               # A TypeError occurs if the object does have such a method
    63                                               # in its class, but its signature is not identical to that
    64                                               # of NumPy's. This situation has occurred in the case of
    65                                               # a downstream library like 'pandas'.
    66                                               except (AttributeError, TypeError):
    67                                                   return _wrapit(obj, method, *args, **kwds)

Total time: 0.00591302 s
File: pandas\core\generic.py
Function: clip at line 4969

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  4969                                               def clip(self, lower=None, upper=None, axis=None, inplace=False,
  4970                                                        *args, **kwargs):
  4971                                                   """
  ...
  5021                                                   """
  5022         1           17     17.0      0.1          if isinstance(self, ABCPanel):
  5023                                                       raise NotImplementedError("clip is not supported yet for panels")
  5024                                           
  5025         1           14     14.0      0.1          inplace = validate_bool_kwarg(inplace, 'inplace')
  5026                                           
  5027         1           97     97.0      0.7          axis = nv.validate_clip_with_axis(axis, args, kwargs)
  5028                                           
  5029                                                   # GH 17276
  5030                                                   # numpy doesn't like NaN as a clip value
  5031                                                   # so ignore
  5032         1          125    125.0      0.9          if np.any(pd.isnull(lower)):
  5033         1            2      2.0      0.0              lower = None
  5034         1           30     30.0      0.2          if np.any(pd.isnull(upper)):
  5035                                                       upper = None
  5036                                           
  5037                                                   # GH 2747 (arguments were reversed)
  5038         1            2      2.0      0.0          if lower is not None and upper is not None:
  5039                                                       if is_scalar(lower) and is_scalar(upper):
  5040                                                           lower, upper = min(lower, upper), max(lower, upper)
  5041                                           
  5042                                                   # fast-path for scalars
  5043         1            2      2.0      0.0          if ((lower is None or (is_scalar(lower) and is_number(lower))) and
  5044         1           32     32.0      0.2                  (upper is None or (is_scalar(upper) and is_number(upper)))):
  5045         1        14092  14092.0     97.8              return self._clip_with_scalar(lower, upper, inplace=inplace)
  5046                                           
  5047                                                   result = self
  5048                                                   if lower is not None:
  5049                                                       result = result.clip_lower(lower, axis, inplace=inplace)
  5050                                                   if upper is not None:
  5051                                                       if inplace:
  5052                                                           result = self
  5053                                                       result = result.clip_upper(upper, axis, inplace=inplace)
  5054                                           
  5055                                                   return result

Total time: 0.00575753 s
File: pandas\core\generic.py
Function: _clip_with_scalar at line 4920

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  4920                                               def _clip_with_scalar(self, lower, upper, inplace=False):
  4921         1            2      2.0      0.0          if ((lower is not None and np.any(isna(lower))) or
  4922         1           28     28.0      0.2                  (upper is not None and np.any(isna(upper)))):
  4923                                                       raise ValueError("Cannot use an NA value as a clip threshold")
  4924                                           
  4925         1          120    120.0      0.9          result = self.values
  4926         1         3525   3525.0     25.1          mask = isna(result)
  4927                                           
  4928         1           86     86.0      0.6          with np.errstate(all='ignore'):
  4929         1            2      2.0      0.0              if upper is not None:
  4930         1         9314   9314.0     66.4                  result = np.where(result >= upper, upper, result)
  4931         1           61     61.0      0.4              if lower is not None:
  4932                                                           result = np.where(result <= lower, lower, result)
  4933         1          283    283.0      2.0          if np.any(mask):
  4934                                                       result[mask] = np.nan
  4935                                           
  4936         1           78     78.0      0.6          axes_dict = self._construct_axes_dict()
  4937         1          532    532.0      3.8          result = self._constructor(result, **axes_dict).__finalize__(self)
  4938                                           
  4939         1            2      2.0      0.0          if inplace:
  4940                                                       self._update_inplace(result)
  4941                                                   else:
  4942         1            1      1.0      0.0              return result

There are still multiple function calls, for example isna and np.where, that take a significant amount of time, but overall this is at least comparable to the np.ndarray.clip time (that's in the regime where the timing difference is ~3 on my computer).

仍然存在多个函数调用,例如isna和np.where,这需要花费大量时间,但总体而言这至少与np.ndarray.clip时间相当(在时序差为~3的情况下)在我的电脑上)。

The takeaway should probably be:

外卖应该是:

  • Many NumPy functions just delegate to a method of the object passed in, so there can be huge differences when you pass in different objects.
  • 许多NumPy函数只是委托传入的对象的方法,因此传入不同的对象时可能会有很大的差异。
  • Profiling, especially line-profiling, can be a great tool to find the places where the performance difference comes from.
  • 分析,尤其是线性分析,可以成为找到性能差异来源的好工具。
  • Always make sure to test differently sized objects in such cases. You could be comparing constant factors that probably don't matter except if you process lots of small arrays.
  • 在这种情况下,务必确保测试不同尺寸的物体。您可以比较可能无关紧要的常数因素,除非您处理大量小数组。

Used versions:

二手版本:

Python 3.6.3 64-bit on Windows 10
Numpy 1.13.3
Pandas 0.21.1

#2


8  

Just read the source code, it's clear.

只需阅读源代码即可。

def clip(a, a_min, a_max, out=None):
    """a : array_like Array containing elements to clip."""
    return _wrapfunc(a, 'clip', a_min, a_max, out=out)

def _wrapfunc(obj, method, *args, **kwds):
    try:
        return getattr(obj, method)(*args, **kwds)
    #This situation has occurred in the case of
    # a downstream library like 'pandas'.
    except (AttributeError, TypeError):
        return _wrapit(obj, method, *args, **kwds)

def _wrapit(obj, method, *args, **kwds):
    try:
        wrap = obj.__array_wrap__
    except AttributeError:
        wrap = None
    result = getattr(asarray(obj), method)(*args, **kwds)
    if wrap:
        if not isinstance(result, mu.ndarray):
            result = asarray(result)
        result = wrap(result)
    return result

rectify:

after pandas v0.13.0_ahl1,pandas has it's own implement of clip.

在pandas v0.13.0_ahl1之后,pandas拥有它自己的clip实现。

#3


7  

There are two parts to the performance difference to be aware of here:

这里要注意性能差异有两个部分:

  • Python overhead in each library (pandas being extra helpful)
  • 每个库中的Python开销(熊猫更有帮助)
  • Difference in numeric algorithm implementation (pd.clip actually calls np.where)
  • 数值算法实现的差异(pd.clip实际调用np.where)

Running this on a very small array should demonstrate the difference in Python overhead. For numpy, this is understandably very small, however pandas does a lot of checking (null values, more flexible argument processing, etc) before getting to the heavy number crunching. I've tried to show a rough breakdown of the stages which the two codes go through before hitting C code bedrock.

在一个非常小的数组上运行它应该证明Python开销的差异。对于numpy,这是可以理解的非常小,但是在进行大量数字运算之前,pandas会进行大量检查(空值,更灵活的参数处理等)。我试图在击中C代码基岩之前显示两个代码经历的阶段的粗略分解。

data = pd.Series(np.random.random(100))

When using np.clip on an ndarray, the overhead is simply the numpy wrapper function calling the object's method:

在ndarray上使用np.clip时,开销只是调用对象方法的numpy包装函数:

>>> %timeit np.clip(data.values, 0.2, 0.8)        # numpy wrapper, calls .clip() on the ndarray
>>> %timeit data.values.clip(0.2, 0.8)            # C function call

2.22 µs ± 125 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
1.32 µs ± 20.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Pandas spends more time checking for edge cases before getting to the algorithm:

在进入算法之前,Pandas花费更多时间检查边缘情况:

>>> %timeit np.clip(data, a_min=0.2, a_max=0.8)   # numpy wrapper, calls .clip() on the Series
>>> %timeit data.clip(lower=0.2, upper=0.8)       # pandas API method
>>> %timeit data._clip_with_scalar(0.2, 0.8)      # lowest level python function

102 µs ± 1.54 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
90.4 µs ± 1.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
73.7 µs ± 805 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Relative to overall time, the overhead of both libraries before hitting C code is pretty significant. For numpy, the single wrapping instruction takes as much time to execute as the numeric processing. Pandas has ~30x more overhead just in the first two layers of function calls.

相对于总时间,两个库在命中C代码之前的开销非常大。对于numpy,单个包装指令需要花费与数值处理相同的时间。在前两层函数调用中,Pandas的开销大约增加了30倍。

To isolate what is happening at the algorithm level, we should check this on a larger array and benchmark the same functions:

为了隔离算法级别发生的事情,我们应该在更大的阵列上检查它并对相同的函数进行基准测试:

>>> data = pd.Series(np.random.random(1000000))

>>> %timeit np.clip(data.values, 0.2, 0.8)
>>> %timeit data.values.clip(0.2, 0.8)

2.85 ms ± 37.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.85 ms ± 15.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit np.clip(data, a_min=0.2, a_max=0.8)
>>> %timeit data.clip(lower=0.2, upper=0.8)
>>> %timeit data._clip_with_scalar(0.2, 0.8)

12.3 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
12.3 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
12.2 ms ± 76.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The python overhead in both cases is now negligible; time for wrapper functions and argument checking is small relative to the calculation time on 1 million values. However there is a 3-4x speed difference which can be attributed to numeric implementation. By investigating a bit in the source code, we see that the pandas implementation of clip actually uses np.where, not np.clip:

两种情况下的python开销现在可以忽略不计;包装函数和参数检查的时间相对于100万个值的计算时间而言较小。然而,速度差异为3-4倍,这可归因于数字实现。通过调查源代码中的一点,我们看到clip的pandas实现实际上使用的是np.where,而不是np.clip:

def clip_where(data, lower, upper):
    ''' Actual implementation in pd.Series._clip_with_scalar (minus NaN handling). '''
    result = data.values
    result = np.where(result >= upper, upper, result)
    result = np.where(result <= lower, lower, result)
    return pd.Series(result)

def clip_clip(data, lower, upper):
    ''' What would happen if we used ndarray.clip instead. '''
    return pd.Series(data.values.clip(lower, upper))

The additional effort required to check each boolean condition separately before doing a conditional replace would seem to account for the speed difference. Specifying both upper and lower would result in 4 passes through the numpy array (two inequality checks and two calls to np.where). Benchmarking these two functions shows that 3-4x speed ratio:

在进行条件替换之前单独检查每个布尔条件所需的额外工作似乎考虑了速度差异。指定upper和lower将导致4次遍历numpy数组(两次不等式检查和两次调用np.where)。对这两个功能进行基准测试表明,3-4倍速比:

>>> %timeit clip_clip(data, lower=0.2, upper=0.8)
>>> %timeit clip_where(data, lower=0.2, upper=0.8)

11.1 ms ± 101 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.97 ms ± 76.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I'm not sure why the pandas devs went with this implementation. np.clip may be a newer API function that previously required a workaround. There is also a little more to it than I've gone into here, since pandas checks for various case before running the final algorithm, and this is only one of the implementations that may be called.

我不确定为什么熊猫开发者会采用这种方式。 np.clip可能是以前需要解决方法的较新的API函数。还有一点比我进入这里还要多,因为在运行最终算法之前,pandas会检查各种情况,这只是可以调用的一个实现。

#4


5  

The reason why the performance differs is because numpy first tends to search for pandas implementation of the function using getattr than doing the same in builtin numpy functions when a pandas object is passed.

性能不同的原因是因为numpy首先倾向于使用getattr搜索函数的pandas实现,而不是在传递pandas对象时在内置numpy函数中执行相同操作。

Its not the numpy over the pandas object that is slow, its the pandas version.

它不是pandas对象的缓慢,它是熊猫版本。

When you do

当你这样做

np.clip(pd.Series([1,2,3,4,5]),a_min=None,amax=1)  

_wrapfunc is being called :

正在调用_wrapfunc:

# Code from source 
def _wrapfunc(obj, method, *args, **kwds):
    try:
        return getattr(obj, method)(*args, **kwds)

Due to _wrapfunc's getattr method :

由于_wrapfunc的getattr方法:

getattr(pd.Series([1,2,3,4,5]),'clip')(None, 1)
# Equivalent to `pd.Series([1,2,3,4,5]).clip(lower=None,upper=1)`
# 0    1
# 1    1
# 2    1
# 3    1
# 4    1
# dtype: int64

If you go through the pandas implementation there is a lot of pre checking work that is done. Its the reason why the functions which has the pandas implementation done via numpy has such difference in speed.

如果你通过pandas实现,那么有很多预检工作已经完成。它之所以通过numpy实现pandas实现的函数在速度上有这样的差异。

Not only clip, functions like cumsum,cumprod,reshape,searchsorted,transpose and much more uses pandas version of them than numpy when you pass them a pandas object.

当你传递一个pandas对象时,不仅剪辑,像cumsum,cumprod,reshape,searchsorted,transpose等功能使用pandas版本而不是numpy。

It might appear numpy is doing the work over those objects but under the hood its the pandas function.

可能看起来numpy正在对这些对象进行工作但在引擎盖下它的pandas功能。

#1


46  

Yes, it seems like np.clip is a lot slower on pandas.Series than on numpy.ndarrays. That's correct but it's actually (at least asymptomatically) not that bad. 8000 elements is still in the regime where constant factors are major contributors in the runtime. I think this is a very important aspect to the question, so I'm visualizing this (borrowing from another answer):

是的,似乎np.clip在pandas.Series上比在numpy.ndarrays上要慢得多。这是正确的,但它实际上(至少是无症状地)并没有那么糟糕。 8000个元素仍然处于运行时常数因素是主要贡献者的制度中。我认为这是问题的一个非常重要的方面,所以我想象这个(借用另一个答案):

# Setup

import pandas as pd
import numpy as np

def on_series(s):
    return np.clip(s, a_min=None, a_max=1)

def on_values_of_series(s):
    return np.clip(s.values, a_min=None, a_max=1)

# Timing setup
timings = {on_series: [], on_values_of_series: []}
sizes = [2**i for i in range(1, 26, 2)]

# Timing
for size in sizes:
    func_input = pd.Series(np.random.randint(0, 30, size=size))
    for func in timings:
        res = %timeit -o func(func_input)
        timings[func].append(res)

%matplotlib notebook

import matplotlib.pyplot as plt
import numpy as np

fig, (ax1, ax2) = plt.subplots(1, 2)

for func in timings:
    ax1.plot(sizes, 
             [time.best for time in timings[func]], 
             label=str(func.__name__))
ax1.set_xscale('log')
ax1.set_yscale('log')
ax1.set_xlabel('size')
ax1.set_ylabel('time [seconds]')
ax1.grid(which='both')
ax1.legend()

baseline = on_values_of_series # choose one function as baseline
for func in timings:
    ax2.plot(sizes, 
             [time.best / ref.best for time, ref in zip(timings[func], timings[baseline])], 
             label=str(func.__name__))
ax2.set_yscale('log')
ax2.set_xscale('log')
ax2.set_xlabel('size')
ax2.set_ylabel('time relative to {}'.format(baseline.__name__))
ax2.grid(which='both')
ax2.legend()

plt.tight_layout()

为什么nanty函数在pandas系列/数据帧上如此慢?

It's a log-log plot because I think this shows the important features more clearly. For example it shows that np.clip on a numpy.ndarray is faster but it also has a much smaller constant factor in that case. The difference for large arrays is only ~3! That's still a big difference but way less than the difference on small arrays.

这是一个对数日志图,因为我认为这更清楚地显示了重要的特征。例如,它显示numpy.ndarray上的np.clip更快,但在这种情况下它也具有小得多的常数因子。大数组的差异只有3个!这仍然是一个很大的差异,但比小阵列的差异要小。

However, that's still not an answer to the question where the time difference comes from.

然而,这仍然不是时差的来源问题的答案。

The solution is actually quite simple: np.clip delegates to the clip method of the first argument:

解决方案实际上非常简单:np.clip委托第一个参数的clip方法:

>>> np.clip??
Source:   
def clip(a, a_min, a_max, out=None):
    """
    ...
    """
    return _wrapfunc(a, 'clip', a_min, a_max, out=out)

>>> np.core.fromnumeric._wrapfunc??
Source:   
def _wrapfunc(obj, method, *args, **kwds):
    try:
        return getattr(obj, method)(*args, **kwds)
    # ...
    except (AttributeError, TypeError):
        return _wrapit(obj, method, *args, **kwds)

The getattr line of the _wrapfunc function is the important line here, because np.ndarray.clip and pd.Series.clip are different methods, yes, completely different methods:

_wrapfunc函数的getattr行是这里的重要行,因为np.ndarray.clip和pd.Series.clip是不同的方法,是的,完全不同的方法:

>>> np.ndarray.clip
<method 'clip' of 'numpy.ndarray' objects>
>>> pd.Series.clip
<function pandas.core.generic.NDFrame.clip>

Unfortunately is np.ndarray.clip a C-function so it's hard to profile it, however pd.Series.clip is a regular Python function so it's easy to profile. Let's use a Series of 5000 integers here:

不幸的是np.ndarray.clip是一个C函数,所以很难对它进行分析,但是pd.Series.clip是一个常规的Python函数,所以它很容易分析。我们在这里使用一系列5000个整数:

s = pd.Series(np.random.randint(0, 100, 5000))

For the np.clip on the values I get the following line-profiling:

对于值的np.clip,我得到以下行分析:

%load_ext line_profiler
%lprun -f np.clip -f np.core.fromnumeric._wrapfunc np.clip(s.values, a_min=None, a_max=1)

Timer unit: 4.10256e-07 s

Total time: 2.25641e-05 s
File: numpy\core\fromnumeric.py
Function: clip at line 1673

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  1673                                           def clip(a, a_min, a_max, out=None):
  1674                                               """
  ...
  1726                                               """
  1727         1           55     55.0    100.0      return _wrapfunc(a, 'clip', a_min, a_max, out=out)

Total time: 1.51795e-05 s
File: numpy\core\fromnumeric.py
Function: _wrapfunc at line 55

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    55                                           def _wrapfunc(obj, method, *args, **kwds):
    56         1            2      2.0      5.4      try:
    57         1           35     35.0     94.6          return getattr(obj, method)(*args, **kwds)
    58                                           
    59                                               # An AttributeError occurs if the object does not have
    60                                               # such a method in its class.
    61                                           
    62                                               # A TypeError occurs if the object does have such a method
    63                                               # in its class, but its signature is not identical to that
    64                                               # of NumPy's. This situation has occurred in the case of
    65                                               # a downstream library like 'pandas'.
    66                                               except (AttributeError, TypeError):
    67                                                   return _wrapit(obj, method, *args, **kwds)

But for np.clip on the Series I get a totally different profiling result:

但是对于系列中的np.clip,我得到了一个完全不同的分析结果:

%lprun -f np.clip -f np.core.fromnumeric._wrapfunc -f pd.Series.clip -f pd.Series._clip_with_scalar np.clip(s, a_min=None, a_max=1)

Timer unit: 4.10256e-07 s

Total time: 0.000823794 s
File: numpy\core\fromnumeric.py
Function: clip at line 1673

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  1673                                           def clip(a, a_min, a_max, out=None):
  1674                                               """
  ...
  1726                                               """
  1727         1         2008   2008.0    100.0      return _wrapfunc(a, 'clip', a_min, a_max, out=out)

Total time: 0.00081846 s
File: numpy\core\fromnumeric.py
Function: _wrapfunc at line 55

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    55                                           def _wrapfunc(obj, method, *args, **kwds):
    56         1            2      2.0      0.1      try:
    57         1         1993   1993.0     99.9          return getattr(obj, method)(*args, **kwds)
    58                                           
    59                                               # An AttributeError occurs if the object does not have
    60                                               # such a method in its class.
    61                                           
    62                                               # A TypeError occurs if the object does have such a method
    63                                               # in its class, but its signature is not identical to that
    64                                               # of NumPy's. This situation has occurred in the case of
    65                                               # a downstream library like 'pandas'.
    66                                               except (AttributeError, TypeError):
    67                                                   return _wrapit(obj, method, *args, **kwds)

Total time: 0.000804922 s
File: pandas\core\generic.py
Function: clip at line 4969

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  4969                                               def clip(self, lower=None, upper=None, axis=None, inplace=False,
  4970                                                        *args, **kwargs):
  4971                                                   """
  ...
  5021                                                   """
  5022         1           12     12.0      0.6          if isinstance(self, ABCPanel):
  5023                                                       raise NotImplementedError("clip is not supported yet for panels")
  5024                                           
  5025         1           10     10.0      0.5          inplace = validate_bool_kwarg(inplace, 'inplace')
  5026                                           
  5027         1           69     69.0      3.5          axis = nv.validate_clip_with_axis(axis, args, kwargs)
  5028                                           
  5029                                                   # GH 17276
  5030                                                   # numpy doesn't like NaN as a clip value
  5031                                                   # so ignore
  5032         1          158    158.0      8.1          if np.any(pd.isnull(lower)):
  5033         1            3      3.0      0.2              lower = None
  5034         1           26     26.0      1.3          if np.any(pd.isnull(upper)):
  5035                                                       upper = None
  5036                                           
  5037                                                   # GH 2747 (arguments were reversed)
  5038         1            1      1.0      0.1          if lower is not None and upper is not None:
  5039                                                       if is_scalar(lower) and is_scalar(upper):
  5040                                                           lower, upper = min(lower, upper), max(lower, upper)
  5041                                           
  5042                                                   # fast-path for scalars
  5043         1            1      1.0      0.1          if ((lower is None or (is_scalar(lower) and is_number(lower))) and
  5044         1           28     28.0      1.4                  (upper is None or (is_scalar(upper) and is_number(upper)))):
  5045         1         1654   1654.0     84.3              return self._clip_with_scalar(lower, upper, inplace=inplace)
  5046                                           
  5047                                                   result = self
  5048                                                   if lower is not None:
  5049                                                       result = result.clip_lower(lower, axis, inplace=inplace)
  5050                                                   if upper is not None:
  5051                                                       if inplace:
  5052                                                           result = self
  5053                                                       result = result.clip_upper(upper, axis, inplace=inplace)
  5054                                           
  5055                                                   return result

Total time: 0.000662153 s
File: pandas\core\generic.py
Function: _clip_with_scalar at line 4920

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  4920                                               def _clip_with_scalar(self, lower, upper, inplace=False):
  4921         1            2      2.0      0.1          if ((lower is not None and np.any(isna(lower))) or
  4922         1           25     25.0      1.5                  (upper is not None and np.any(isna(upper)))):
  4923                                                       raise ValueError("Cannot use an NA value as a clip threshold")
  4924                                           
  4925         1           22     22.0      1.4          result = self.values
  4926         1          571    571.0     35.4          mask = isna(result)
  4927                                           
  4928         1           95     95.0      5.9          with np.errstate(all='ignore'):
  4929         1            1      1.0      0.1              if upper is not None:
  4930         1          141    141.0      8.7                  result = np.where(result >= upper, upper, result)
  4931         1           33     33.0      2.0              if lower is not None:
  4932                                                           result = np.where(result <= lower, lower, result)
  4933         1           73     73.0      4.5          if np.any(mask):
  4934                                                       result[mask] = np.nan
  4935                                           
  4936         1           90     90.0      5.6          axes_dict = self._construct_axes_dict()
  4937         1          558    558.0     34.6          result = self._constructor(result, **axes_dict).__finalize__(self)
  4938                                           
  4939         1            2      2.0      0.1          if inplace:
  4940                                                       self._update_inplace(result)
  4941                                                   else:
  4942         1            1      1.0      0.1              return result

I stopped going into the subroutines at that point because it already highlights where the pd.Series.clip does much more work than the np.ndarray.clip. Just compare the total time of the np.clip call on the values (55 timer units) to one of the first checks in the pandas.Series.clip method, the if np.any(pd.isnull(lower)) (158 timer units). At that point the pandas method didn't even start at clipping and it already takes 3 times longer.

那时我停止进入子程序,因为它已经突出显示了pd.Series.clip比np.ndarray.clip做得更多的工作。只需将np.clip调用的总时间(55个计时器单位)与pandas.Series.clip方法中的第一个检查之一进行比较,if np.any(pd.isnull(lower))(158计时器)单位)。那时,熊猫方法甚至没有开始削波,它已经花了3倍的时间。

However several of these "overheads" become insignificant when the array is big:

然而,当阵列很大时,其中一些“开销”变得无关紧要:

s = pd.Series(np.random.randint(0, 100, 1000000))

%lprun -f np.clip -f np.core.fromnumeric._wrapfunc -f pd.Series.clip -f pd.Series._clip_with_scalar np.clip(s, a_min=None, a_max=1)

Timer unit: 4.10256e-07 s

Total time: 0.00593476 s
File: numpy\core\fromnumeric.py
Function: clip at line 1673

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  1673                                           def clip(a, a_min, a_max, out=None):
  1674                                               """
  ...
  1726                                               """
  1727         1        14466  14466.0    100.0      return _wrapfunc(a, 'clip', a_min, a_max, out=out)

Total time: 0.00592779 s
File: numpy\core\fromnumeric.py
Function: _wrapfunc at line 55

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    55                                           def _wrapfunc(obj, method, *args, **kwds):
    56         1            1      1.0      0.0      try:
    57         1        14448  14448.0    100.0          return getattr(obj, method)(*args, **kwds)
    58                                           
    59                                               # An AttributeError occurs if the object does not have
    60                                               # such a method in its class.
    61                                           
    62                                               # A TypeError occurs if the object does have such a method
    63                                               # in its class, but its signature is not identical to that
    64                                               # of NumPy's. This situation has occurred in the case of
    65                                               # a downstream library like 'pandas'.
    66                                               except (AttributeError, TypeError):
    67                                                   return _wrapit(obj, method, *args, **kwds)

Total time: 0.00591302 s
File: pandas\core\generic.py
Function: clip at line 4969

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  4969                                               def clip(self, lower=None, upper=None, axis=None, inplace=False,
  4970                                                        *args, **kwargs):
  4971                                                   """
  ...
  5021                                                   """
  5022         1           17     17.0      0.1          if isinstance(self, ABCPanel):
  5023                                                       raise NotImplementedError("clip is not supported yet for panels")
  5024                                           
  5025         1           14     14.0      0.1          inplace = validate_bool_kwarg(inplace, 'inplace')
  5026                                           
  5027         1           97     97.0      0.7          axis = nv.validate_clip_with_axis(axis, args, kwargs)
  5028                                           
  5029                                                   # GH 17276
  5030                                                   # numpy doesn't like NaN as a clip value
  5031                                                   # so ignore
  5032         1          125    125.0      0.9          if np.any(pd.isnull(lower)):
  5033         1            2      2.0      0.0              lower = None
  5034         1           30     30.0      0.2          if np.any(pd.isnull(upper)):
  5035                                                       upper = None
  5036                                           
  5037                                                   # GH 2747 (arguments were reversed)
  5038         1            2      2.0      0.0          if lower is not None and upper is not None:
  5039                                                       if is_scalar(lower) and is_scalar(upper):
  5040                                                           lower, upper = min(lower, upper), max(lower, upper)
  5041                                           
  5042                                                   # fast-path for scalars
  5043         1            2      2.0      0.0          if ((lower is None or (is_scalar(lower) and is_number(lower))) and
  5044         1           32     32.0      0.2                  (upper is None or (is_scalar(upper) and is_number(upper)))):
  5045         1        14092  14092.0     97.8              return self._clip_with_scalar(lower, upper, inplace=inplace)
  5046                                           
  5047                                                   result = self
  5048                                                   if lower is not None:
  5049                                                       result = result.clip_lower(lower, axis, inplace=inplace)
  5050                                                   if upper is not None:
  5051                                                       if inplace:
  5052                                                           result = self
  5053                                                       result = result.clip_upper(upper, axis, inplace=inplace)
  5054                                           
  5055                                                   return result

Total time: 0.00575753 s
File: pandas\core\generic.py
Function: _clip_with_scalar at line 4920

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  4920                                               def _clip_with_scalar(self, lower, upper, inplace=False):
  4921         1            2      2.0      0.0          if ((lower is not None and np.any(isna(lower))) or
  4922         1           28     28.0      0.2                  (upper is not None and np.any(isna(upper)))):
  4923                                                       raise ValueError("Cannot use an NA value as a clip threshold")
  4924                                           
  4925         1          120    120.0      0.9          result = self.values
  4926         1         3525   3525.0     25.1          mask = isna(result)
  4927                                           
  4928         1           86     86.0      0.6          with np.errstate(all='ignore'):
  4929         1            2      2.0      0.0              if upper is not None:
  4930         1         9314   9314.0     66.4                  result = np.where(result >= upper, upper, result)
  4931         1           61     61.0      0.4              if lower is not None:
  4932                                                           result = np.where(result <= lower, lower, result)
  4933         1          283    283.0      2.0          if np.any(mask):
  4934                                                       result[mask] = np.nan
  4935                                           
  4936         1           78     78.0      0.6          axes_dict = self._construct_axes_dict()
  4937         1          532    532.0      3.8          result = self._constructor(result, **axes_dict).__finalize__(self)
  4938                                           
  4939         1            2      2.0      0.0          if inplace:
  4940                                                       self._update_inplace(result)
  4941                                                   else:
  4942         1            1      1.0      0.0              return result

There are still multiple function calls, for example isna and np.where, that take a significant amount of time, but overall this is at least comparable to the np.ndarray.clip time (that's in the regime where the timing difference is ~3 on my computer).

仍然存在多个函数调用,例如isna和np.where,这需要花费大量时间,但总体而言这至少与np.ndarray.clip时间相当(在时序差为~3的情况下)在我的电脑上)。

The takeaway should probably be:

外卖应该是:

  • Many NumPy functions just delegate to a method of the object passed in, so there can be huge differences when you pass in different objects.
  • 许多NumPy函数只是委托传入的对象的方法,因此传入不同的对象时可能会有很大的差异。
  • Profiling, especially line-profiling, can be a great tool to find the places where the performance difference comes from.
  • 分析,尤其是线性分析,可以成为找到性能差异来源的好工具。
  • Always make sure to test differently sized objects in such cases. You could be comparing constant factors that probably don't matter except if you process lots of small arrays.
  • 在这种情况下,务必确保测试不同尺寸的物体。您可以比较可能无关紧要的常数因素,除非您处理大量小数组。

Used versions:

二手版本:

Python 3.6.3 64-bit on Windows 10
Numpy 1.13.3
Pandas 0.21.1

#2


8  

Just read the source code, it's clear.

只需阅读源代码即可。

def clip(a, a_min, a_max, out=None):
    """a : array_like Array containing elements to clip."""
    return _wrapfunc(a, 'clip', a_min, a_max, out=out)

def _wrapfunc(obj, method, *args, **kwds):
    try:
        return getattr(obj, method)(*args, **kwds)
    #This situation has occurred in the case of
    # a downstream library like 'pandas'.
    except (AttributeError, TypeError):
        return _wrapit(obj, method, *args, **kwds)

def _wrapit(obj, method, *args, **kwds):
    try:
        wrap = obj.__array_wrap__
    except AttributeError:
        wrap = None
    result = getattr(asarray(obj), method)(*args, **kwds)
    if wrap:
        if not isinstance(result, mu.ndarray):
            result = asarray(result)
        result = wrap(result)
    return result

rectify:

after pandas v0.13.0_ahl1,pandas has it's own implement of clip.

在pandas v0.13.0_ahl1之后,pandas拥有它自己的clip实现。

#3


7  

There are two parts to the performance difference to be aware of here:

这里要注意性能差异有两个部分:

  • Python overhead in each library (pandas being extra helpful)
  • 每个库中的Python开销(熊猫更有帮助)
  • Difference in numeric algorithm implementation (pd.clip actually calls np.where)
  • 数值算法实现的差异(pd.clip实际调用np.where)

Running this on a very small array should demonstrate the difference in Python overhead. For numpy, this is understandably very small, however pandas does a lot of checking (null values, more flexible argument processing, etc) before getting to the heavy number crunching. I've tried to show a rough breakdown of the stages which the two codes go through before hitting C code bedrock.

在一个非常小的数组上运行它应该证明Python开销的差异。对于numpy,这是可以理解的非常小,但是在进行大量数字运算之前,pandas会进行大量检查(空值,更灵活的参数处理等)。我试图在击中C代码基岩之前显示两个代码经历的阶段的粗略分解。

data = pd.Series(np.random.random(100))

When using np.clip on an ndarray, the overhead is simply the numpy wrapper function calling the object's method:

在ndarray上使用np.clip时,开销只是调用对象方法的numpy包装函数:

>>> %timeit np.clip(data.values, 0.2, 0.8)        # numpy wrapper, calls .clip() on the ndarray
>>> %timeit data.values.clip(0.2, 0.8)            # C function call

2.22 µs ± 125 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
1.32 µs ± 20.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Pandas spends more time checking for edge cases before getting to the algorithm:

在进入算法之前,Pandas花费更多时间检查边缘情况:

>>> %timeit np.clip(data, a_min=0.2, a_max=0.8)   # numpy wrapper, calls .clip() on the Series
>>> %timeit data.clip(lower=0.2, upper=0.8)       # pandas API method
>>> %timeit data._clip_with_scalar(0.2, 0.8)      # lowest level python function

102 µs ± 1.54 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
90.4 µs ± 1.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
73.7 µs ± 805 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Relative to overall time, the overhead of both libraries before hitting C code is pretty significant. For numpy, the single wrapping instruction takes as much time to execute as the numeric processing. Pandas has ~30x more overhead just in the first two layers of function calls.

相对于总时间,两个库在命中C代码之前的开销非常大。对于numpy,单个包装指令需要花费与数值处理相同的时间。在前两层函数调用中,Pandas的开销大约增加了30倍。

To isolate what is happening at the algorithm level, we should check this on a larger array and benchmark the same functions:

为了隔离算法级别发生的事情,我们应该在更大的阵列上检查它并对相同的函数进行基准测试:

>>> data = pd.Series(np.random.random(1000000))

>>> %timeit np.clip(data.values, 0.2, 0.8)
>>> %timeit data.values.clip(0.2, 0.8)

2.85 ms ± 37.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.85 ms ± 15.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit np.clip(data, a_min=0.2, a_max=0.8)
>>> %timeit data.clip(lower=0.2, upper=0.8)
>>> %timeit data._clip_with_scalar(0.2, 0.8)

12.3 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
12.3 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
12.2 ms ± 76.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The python overhead in both cases is now negligible; time for wrapper functions and argument checking is small relative to the calculation time on 1 million values. However there is a 3-4x speed difference which can be attributed to numeric implementation. By investigating a bit in the source code, we see that the pandas implementation of clip actually uses np.where, not np.clip:

两种情况下的python开销现在可以忽略不计;包装函数和参数检查的时间相对于100万个值的计算时间而言较小。然而,速度差异为3-4倍,这可归因于数字实现。通过调查源代码中的一点,我们看到clip的pandas实现实际上使用的是np.where,而不是np.clip:

def clip_where(data, lower, upper):
    ''' Actual implementation in pd.Series._clip_with_scalar (minus NaN handling). '''
    result = data.values
    result = np.where(result >= upper, upper, result)
    result = np.where(result <= lower, lower, result)
    return pd.Series(result)

def clip_clip(data, lower, upper):
    ''' What would happen if we used ndarray.clip instead. '''
    return pd.Series(data.values.clip(lower, upper))

The additional effort required to check each boolean condition separately before doing a conditional replace would seem to account for the speed difference. Specifying both upper and lower would result in 4 passes through the numpy array (two inequality checks and two calls to np.where). Benchmarking these two functions shows that 3-4x speed ratio:

在进行条件替换之前单独检查每个布尔条件所需的额外工作似乎考虑了速度差异。指定upper和lower将导致4次遍历numpy数组(两次不等式检查和两次调用np.where)。对这两个功能进行基准测试表明,3-4倍速比:

>>> %timeit clip_clip(data, lower=0.2, upper=0.8)
>>> %timeit clip_where(data, lower=0.2, upper=0.8)

11.1 ms ± 101 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.97 ms ± 76.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I'm not sure why the pandas devs went with this implementation. np.clip may be a newer API function that previously required a workaround. There is also a little more to it than I've gone into here, since pandas checks for various case before running the final algorithm, and this is only one of the implementations that may be called.

我不确定为什么熊猫开发者会采用这种方式。 np.clip可能是以前需要解决方法的较新的API函数。还有一点比我进入这里还要多,因为在运行最终算法之前,pandas会检查各种情况,这只是可以调用的一个实现。

#4


5  

The reason why the performance differs is because numpy first tends to search for pandas implementation of the function using getattr than doing the same in builtin numpy functions when a pandas object is passed.

性能不同的原因是因为numpy首先倾向于使用getattr搜索函数的pandas实现,而不是在传递pandas对象时在内置numpy函数中执行相同操作。

Its not the numpy over the pandas object that is slow, its the pandas version.

它不是pandas对象的缓慢,它是熊猫版本。

When you do

当你这样做

np.clip(pd.Series([1,2,3,4,5]),a_min=None,amax=1)  

_wrapfunc is being called :

正在调用_wrapfunc:

# Code from source 
def _wrapfunc(obj, method, *args, **kwds):
    try:
        return getattr(obj, method)(*args, **kwds)

Due to _wrapfunc's getattr method :

由于_wrapfunc的getattr方法:

getattr(pd.Series([1,2,3,4,5]),'clip')(None, 1)
# Equivalent to `pd.Series([1,2,3,4,5]).clip(lower=None,upper=1)`
# 0    1
# 1    1
# 2    1
# 3    1
# 4    1
# dtype: int64

If you go through the pandas implementation there is a lot of pre checking work that is done. Its the reason why the functions which has the pandas implementation done via numpy has such difference in speed.

如果你通过pandas实现,那么有很多预检工作已经完成。它之所以通过numpy实现pandas实现的函数在速度上有这样的差异。

Not only clip, functions like cumsum,cumprod,reshape,searchsorted,transpose and much more uses pandas version of them than numpy when you pass them a pandas object.

当你传递一个pandas对象时,不仅剪辑,像cumsum,cumprod,reshape,searchsorted,transpose等功能使用pandas版本而不是numpy。

It might appear numpy is doing the work over those objects but under the hood its the pandas function.

可能看起来numpy正在对这些对象进行工作但在引擎盖下它的pandas功能。