如何将熊猫的功能应用于多个栏目?

时间:2021-03-29 22:31:03

I have some problems with the Pandas apply function, when using multiple columns with the following dataframe

我在使用带有以下数据aframe的多个列时,对熊猫应用函数有一些问题

df = DataFrame ({'a' : np.random.randn(6),
                 'b' : ['foo', 'bar'] * 3,
                 'c' : np.random.randn(6)})

and the following function

和下面的功能

def my_test(a, b):
    return a % b

When I try to apply this function with :

当我尝试将这个函数应用于:

df['Value'] = df.apply(lambda row: my_test(row[a], row[c]), axis=1)

I get the error message:

我得到了错误信息:

NameError: ("global name 'a' is not defined", u'occurred at index 0')

I do not understand this message, I defined the name properly.

我不理解这条消息,我正确地定义了名称。

I would highly appreciate any help on this issue

我非常感谢在这个问题上的任何帮助。

Update

更新

Thanks for your help. I made indeed some syntax mistakes with the code, the index should be put ''. However I have still the same issue using a more complex function such as:

谢谢你的帮助。我确实在代码中犯了一些语法错误,应该把索引放进去。然而,我使用更复杂的函数,如:

def my_test(a):
    cum_diff = 0
    for ix in df.index():
        cum_diff = cum_diff + (a - df['a'][ix])
    return cum_diff 

Thank you

谢谢你!

4 个解决方案

#1


234  

Seems you forgot the '' of your string.

你好像忘记了你的绳子。

In [43]: df['Value'] = df.apply(lambda row: my_test(row['a'], row['c']), axis=1)

In [44]: df
Out[44]:
                    a    b         c     Value
          0 -1.674308  foo  0.343801  0.044698
          1 -2.163236  bar -2.046438 -0.116798
          2 -0.199115  foo -0.458050 -0.199115
          3  0.918646  bar -0.007185 -0.001006
          4  1.336830  foo  0.534292  0.268245
          5  0.976844  bar -0.773630 -0.570417

BTW, in my opinion, following way is more elegant:

顺便说一句,在我看来,以下方式更优雅:

In [53]: def my_test2(row):
....:     return row['a'] % row['c']
....:     

In [54]: df['Value'] = df.apply(my_test2, axis=1)

#2


22  

If you just want to compute (column a) % (column b), you don't need apply, just do it directly:

如果你只想计算(a) % (b)列,你不需要应用,直接做:

In [7]: df['a'] % df['c']                                                                                                                                                        
Out[7]: 
0   -1.132022                                                                                                                                                                    
1   -0.939493                                                                                                                                                                    
2    0.201931                                                                                                                                                                    
3    0.511374                                                                                                                                                                    
4   -0.694647                                                                                                                                                                    
5   -0.023486                                                                                                                                                                    
Name: a

#3


9  

Let's say we want to apply a function add5 to columns 'a' and 'b' of DataFrame df

假设我们想对DataFrame df的'a'和'b'列应用一个函数add5

def add5(x):
    return x+5

df[['a', 'b']].apply(add5)

#4


0  

All of the suggestions above work, but if you want your computations to by more efficient, you should take advantage of numpy vector operations (ref).

上面所有的建议都是可行的,但是如果您希望您的计算更有效,您应该利用numpy向量操作(ref)。

import pandas as pd
import numpy as np

df = pd.DataFrame ({'a' : np.random.randn(6),
             'b' : ['foo', 'bar'] * 3,
             'c' : np.random.randn(6)})


#######
# pandas.apply()
%%timeit

def my_test2(row):
    return row['a'] % row['c']

df['Value'] = df.apply(my_test2, axis=1)

The slowest run took 7.49 times longer than the fastest. This could mean that an intermediate result is being cached. 1000 loops, best of 3: 481 µs per loop

最慢的跑比最快的跑长7.49倍。这可能意味着正在缓存中间结果。1000循环,最好3:481µs每循环

############
# vectorize pandas.apply()
%%timeit

df['a'] % df['c']

The slowest run took 458.85 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 70.9 µs per loop

最慢的跑比最快的跑长458.85倍。这可能意味着正在缓存中间结果。10000循环,最好3:70.9µs循环

#############
# vectorize numpy arrays
%%timeit

df['a'].values % df['c'].values

The slowest run took 7.98 times longer than the fastest. This could mean that an intermediate result is being cached. 100000 loops, best of 3: 6.39 µs per loop

最慢的跑得比最快的跑得长7.98倍。这可能意味着正在缓存中间结果。100000循环,最好3:6.39µs循环

So vectorizing using numpy arrays improved the speed by almost two orders of magnitude.

因此,使用numpy数组进行矢量化可以将速度提高近两个数量级。

#1


234  

Seems you forgot the '' of your string.

你好像忘记了你的绳子。

In [43]: df['Value'] = df.apply(lambda row: my_test(row['a'], row['c']), axis=1)

In [44]: df
Out[44]:
                    a    b         c     Value
          0 -1.674308  foo  0.343801  0.044698
          1 -2.163236  bar -2.046438 -0.116798
          2 -0.199115  foo -0.458050 -0.199115
          3  0.918646  bar -0.007185 -0.001006
          4  1.336830  foo  0.534292  0.268245
          5  0.976844  bar -0.773630 -0.570417

BTW, in my opinion, following way is more elegant:

顺便说一句,在我看来,以下方式更优雅:

In [53]: def my_test2(row):
....:     return row['a'] % row['c']
....:     

In [54]: df['Value'] = df.apply(my_test2, axis=1)

#2


22  

If you just want to compute (column a) % (column b), you don't need apply, just do it directly:

如果你只想计算(a) % (b)列,你不需要应用,直接做:

In [7]: df['a'] % df['c']                                                                                                                                                        
Out[7]: 
0   -1.132022                                                                                                                                                                    
1   -0.939493                                                                                                                                                                    
2    0.201931                                                                                                                                                                    
3    0.511374                                                                                                                                                                    
4   -0.694647                                                                                                                                                                    
5   -0.023486                                                                                                                                                                    
Name: a

#3


9  

Let's say we want to apply a function add5 to columns 'a' and 'b' of DataFrame df

假设我们想对DataFrame df的'a'和'b'列应用一个函数add5

def add5(x):
    return x+5

df[['a', 'b']].apply(add5)

#4


0  

All of the suggestions above work, but if you want your computations to by more efficient, you should take advantage of numpy vector operations (ref).

上面所有的建议都是可行的,但是如果您希望您的计算更有效,您应该利用numpy向量操作(ref)。

import pandas as pd
import numpy as np

df = pd.DataFrame ({'a' : np.random.randn(6),
             'b' : ['foo', 'bar'] * 3,
             'c' : np.random.randn(6)})


#######
# pandas.apply()
%%timeit

def my_test2(row):
    return row['a'] % row['c']

df['Value'] = df.apply(my_test2, axis=1)

The slowest run took 7.49 times longer than the fastest. This could mean that an intermediate result is being cached. 1000 loops, best of 3: 481 µs per loop

最慢的跑比最快的跑长7.49倍。这可能意味着正在缓存中间结果。1000循环,最好3:481µs每循环

############
# vectorize pandas.apply()
%%timeit

df['a'] % df['c']

The slowest run took 458.85 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 70.9 µs per loop

最慢的跑比最快的跑长458.85倍。这可能意味着正在缓存中间结果。10000循环,最好3:70.9µs循环

#############
# vectorize numpy arrays
%%timeit

df['a'].values % df['c'].values

The slowest run took 7.98 times longer than the fastest. This could mean that an intermediate result is being cached. 100000 loops, best of 3: 6.39 µs per loop

最慢的跑得比最快的跑得长7.98倍。这可能意味着正在缓存中间结果。100000循环,最好3:6.39µs循环

So vectorizing using numpy arrays improved the speed by almost two orders of magnitude.

因此,使用numpy数组进行矢量化可以将速度提高近两个数量级。