I have a pandas DataFrame, st
containing multiple columns:
我有一个pandas DataFrame,st包含多个列:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 53732 entries, 1993-01-07 12:23:58 to 2012-12-02 20:06:23
Data columns:
Date(dd-mm-yy)_Time(hh-mm-ss) 53732 non-null values
Julian_Day 53732 non-null values
AOT_1020 53716 non-null values
AOT_870 53732 non-null values
AOT_675 53188 non-null values
AOT_500 51687 non-null values
AOT_440 53727 non-null values
AOT_380 51864 non-null values
AOT_340 52852 non-null values
Water(cm) 51687 non-null values
%TripletVar_1020 53710 non-null values
%TripletVar_870 53726 non-null values
%TripletVar_675 53182 non-null values
%TripletVar_500 51683 non-null values
%TripletVar_440 53721 non-null values
%TripletVar_380 51860 non-null values
%TripletVar_340 52846 non-null values
440-870Angstrom 53732 non-null values
380-500Angstrom 52253 non-null values
440-675Angstrom 53732 non-null values
500-870Angstrom 53732 non-null values
340-440Angstrom 53277 non-null values
Last_Processing_Date(dd/mm/yyyy) 53732 non-null values
Solar_Zenith_Angle 53732 non-null values
dtypes: datetime64[ns](1), float64(22), object(1)
I want to create two new columns for this dataframe based on applying a function to each row of the dataframe. I don't want to have to call the function multiple times (eg. by doing two separate apply
calls) as it is rather computationally intensive. I have tried doing this in two ways, and neither of them work:
我想基于将函数应用于数据帧的每一行,为此数据帧创建两个新列。我不想多次调用该函数(例如,通过执行两次单独的应用调用),因为它是计算密集型的。我尝试过两种方式,但两种方式都不起作用:
Using apply
:
使用申请:
I have written a function which takes a Series
and returns a tuple of the values I want:
我编写了一个函数,它接受一个Series并返回我想要的值的元组:
def calculate(s):
a = s['path'] + 2*s['row'] # Simple calc for example
b = s['path'] * 0.153
return (a, b)
Trying to apply this to the DataFrame gives an error:
尝试将此应用于DataFrame会出错:
st.apply(calculate, axis=1)
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-248-acb7a44054a7> in <module>()
----> 1 st.apply(calculate, axis=1)
C:\Python27\lib\site-packages\pandas\core\frame.pyc in apply(self, func, axis, broadcast, raw, args, **kwds)
4191 return self._apply_raw(f, axis)
4192 else:
-> 4193 return self._apply_standard(f, axis)
4194 else:
4195 return self._apply_broadcast(f, axis)
C:\Python27\lib\site-packages\pandas\core\frame.pyc in _apply_standard(self, func, axis, ignore_failures)
4274 index = None
4275
-> 4276 result = self._constructor(data=results, index=index)
4277 result.rename(columns=dict(zip(range(len(res_index)), res_index)),
4278 inplace=True)
C:\Python27\lib\site-packages\pandas\core\frame.pyc in __init__(self, data, index, columns, dtype, copy)
390 mgr = self._init_mgr(data, index, columns, dtype=dtype, copy=copy)
391 elif isinstance(data, dict):
--> 392 mgr = self._init_dict(data, index, columns, dtype=dtype)
393 elif isinstance(data, ma.MaskedArray):
394 mask = ma.getmaskarray(data)
C:\Python27\lib\site-packages\pandas\core\frame.pyc in _init_dict(self, data, index, columns, dtype)
521
522 return _arrays_to_mgr(arrays, data_names, index, columns,
--> 523 dtype=dtype)
524
525 def _init_ndarray(self, values, index, columns, dtype=None,
C:\Python27\lib\site-packages\pandas\core\frame.pyc in _arrays_to_mgr(arrays, arr_names, index, columns, dtype)
5411
5412 # consolidate for now
-> 5413 mgr = BlockManager(blocks, axes)
5414 return mgr.consolidate()
5415
C:\Python27\lib\site-packages\pandas\core\internals.pyc in __init__(self, blocks, axes, do_integrity_check)
802
803 if do_integrity_check:
--> 804 self._verify_integrity()
805
806 self._consolidate_check()
C:\Python27\lib\site-packages\pandas\core\internals.pyc in _verify_integrity(self)
892 "items")
893 if block.values.shape[1:] != mgr_shape[1:]:
--> 894 raise AssertionError('Block shape incompatible with manager')
895 tot_items = sum(len(x.items) for x in self.blocks)
896 if len(self.items) != tot_items:
AssertionError: Block shape incompatible with manager
I was then going to assign the values returned from apply
to two new columns using the method shown in this question. However, I can't even get to this point! This all works fine if I just return one value.
然后,我将使用此问题中显示的方法将应用返回的值分配给两个新列。但是,我甚至无法达到这一点!如果我只返回一个值,这一切都正常。
Using a loop:
使用循环:
I first created two new columns of the dataframe and set them to None
:
我首先创建了两个新的数据帧列,并将它们设置为None:
st['a'] = None
st['b'] = None
Then looped over all of the indices and tried to modify these None
values that I'd got in there, but the modifications I did didn't seem to work. That is, no error was generated, but the DataFrame didn't seem to be modified.
然后循环遍历所有索引并尝试修改我在那里得到的这些None值,但我做的修改似乎不起作用。也就是说,没有生成错误,但似乎没有修改DataFrame。
for i in st.index:
# do calc here
st.ix[i]['a'] = a
st.ix[i]['b'] = b
I thought that both of these methods would work, but neither of them did. So, what am I doing wrong here? And what is the best, most 'pythonic' and 'pandaonic' way to do this?
我认为这两种方法都可行,但它们都没有。那么,我在这里做错了什么?什么是最好的,最“pythonic”和“pandaonic”的方式来做到这一点?
4 个解决方案
#1
25
To make the first approach work, try returning a Series instead of a tuple (apply is throwing an exception because it doesn't know how to glue the rows back together as the number of columns doesn't match the original frame).
要使第一个方法起作用,请尝试返回一个Series而不是一个元组(apply会抛出异常,因为它不知道如何将行重新粘合在一起,因为列数与原始帧不匹配)。
def calculate(s):
a = s['path'] + 2*s['row'] # Simple calc for example
b = s['path'] * 0.153
return pd.Series(dict(col1=a, col2=b))
The second approach should work if you replace:
如果您替换,第二种方法应该有效:
st.ix[i]['a'] = a
with:
有:
st.ix[i, 'a'] = a
#2
12
I always use lambdas and the built-in map()
function to create new rows by combining other rows:
我总是使用lambdas和内置的map()函数通过组合其他行来创建新行:
st['a'] = map(lambda path, row: path + 2 * row, st['path'], st['row'])
It might be slightly more complicated than necessary for doing linear combinations of numerical columns. On the other hand, I feel it's good to adopt as a convention as it can be used with more complicated combinations of rows (e.g. working with strings) or filling missing data in a column using functions of the other columns.
对于进行数值列的线性组合,可能稍微复杂一些。另一方面,我觉得采用惯例是好的,因为它可以用于更复杂的行组合(例如使用字符串)或使用其他列的函数填充列中的缺失数据。
For example, lets say you have a table with columns gender, and title, and some of the titles are missing. You can fill them with a function as follows:
例如,假设您有一个包含性别和标题列的表格,并且缺少某些标题。您可以使用以下函数填充它们:
title_dict = {'male': 'mr.', 'female': 'ms.'}
table['title'] = map(lambda title,
gender: title if title != None else title_dict[gender],
table['title'], table['gender'])
#3
5
This was solved here: Apply pandas function to column to create multiple new columns?
这在这里解决了:将pandas函数应用于列以创建多个新列?
Applied to your question this should work:
应用于您的问题,这应该工作:
def calculate(s):
a = s['path'] + 2*s['row'] # Simple calc for example
b = s['path'] * 0.153
return pd.Series({'col1': a, 'col2': b})
df = df.merge(df.apply(calculate, axis=1), left_index=True, right_index=True)
#4
0
Yet another solution based on Assigning New Columns in Method Chains:
另一种基于在方法链中分配新列的解决方案:
st.assign(a = st['path'] + 2*st['row'], b = st['path'] * 0.153)
Be aware assign
always returns a copy of the data, leaving the original DataFrame untouched.
请注意,assign始终返回数据的副本,保持原始DataFrame不变。
#1
25
To make the first approach work, try returning a Series instead of a tuple (apply is throwing an exception because it doesn't know how to glue the rows back together as the number of columns doesn't match the original frame).
要使第一个方法起作用,请尝试返回一个Series而不是一个元组(apply会抛出异常,因为它不知道如何将行重新粘合在一起,因为列数与原始帧不匹配)。
def calculate(s):
a = s['path'] + 2*s['row'] # Simple calc for example
b = s['path'] * 0.153
return pd.Series(dict(col1=a, col2=b))
The second approach should work if you replace:
如果您替换,第二种方法应该有效:
st.ix[i]['a'] = a
with:
有:
st.ix[i, 'a'] = a
#2
12
I always use lambdas and the built-in map()
function to create new rows by combining other rows:
我总是使用lambdas和内置的map()函数通过组合其他行来创建新行:
st['a'] = map(lambda path, row: path + 2 * row, st['path'], st['row'])
It might be slightly more complicated than necessary for doing linear combinations of numerical columns. On the other hand, I feel it's good to adopt as a convention as it can be used with more complicated combinations of rows (e.g. working with strings) or filling missing data in a column using functions of the other columns.
对于进行数值列的线性组合,可能稍微复杂一些。另一方面,我觉得采用惯例是好的,因为它可以用于更复杂的行组合(例如使用字符串)或使用其他列的函数填充列中的缺失数据。
For example, lets say you have a table with columns gender, and title, and some of the titles are missing. You can fill them with a function as follows:
例如,假设您有一个包含性别和标题列的表格,并且缺少某些标题。您可以使用以下函数填充它们:
title_dict = {'male': 'mr.', 'female': 'ms.'}
table['title'] = map(lambda title,
gender: title if title != None else title_dict[gender],
table['title'], table['gender'])
#3
5
This was solved here: Apply pandas function to column to create multiple new columns?
这在这里解决了:将pandas函数应用于列以创建多个新列?
Applied to your question this should work:
应用于您的问题,这应该工作:
def calculate(s):
a = s['path'] + 2*s['row'] # Simple calc for example
b = s['path'] * 0.153
return pd.Series({'col1': a, 'col2': b})
df = df.merge(df.apply(calculate, axis=1), left_index=True, right_index=True)
#4
0
Yet another solution based on Assigning New Columns in Method Chains:
另一种基于在方法链中分配新列的解决方案:
st.assign(a = st['path'] + 2*st['row'], b = st['path'] * 0.153)
Be aware assign
always returns a copy of the data, leaving the original DataFrame untouched.
请注意,assign始终返回数据的副本,保持原始DataFrame不变。