Suppose I had a Python/Pandas dataframe called df1 with columns a and b, each with only one record (a = 1 and b = 2). I want to create a third column, c, whose value equals a + b or 3.
假设我有一个名为df1的Python / Pandas数据帧,其中包含a和b列,每个只有一条记录(a = 1和b = 2)。我想创建第三列c,其值等于a + b或3。
Using Pandas, I'd write:
使用熊猫,我写道:
df1['c'] = df1['a'] + df1['b']
I'd prefer just to write something simpler and easier to read, like the following:
我更喜欢写一些更简单易读的东西,如下所示:
with df1:
c = a + b
SAS allows this simpler syntax in its "data step". I would love it if Python/Pandas had something similar.
SAS在其“数据步骤”中允许这种更简单的语法。如果Python / Pandas有类似的东西我会喜欢它。
Thanks a lot! Sean
非常感谢!肖恩
2 个解决方案
#1
1
Short answer: no. pandas is constrained by Python's syntax rules. The expression c = a + b
requires a
, b
, and c
to be names in the global namespace and it is not a good idea for a library to modify global namespace like that (what if you already have those names? What happens if there is a conflict?). That leaves out "no quotes" part.
简答:不。 pandas受Python的语法规则约束。表达式c = a + b要求a,b和c是全局命名空间中的名称,并且对于库来修改全局命名空间不是一个好主意(如果你已经拥有这些名称怎么办?如果有的话会发生什么?是冲突?)。这留下了“没有引号”的部分。
With quotes, you have some options. For adding a new column, you can use eval
:
有了引号,你有一些选择。要添加新列,您可以使用eval:
df.eval('c = a + b')
The eval method basically evaluates the expression passed as a string. In this case, it adds a new column to a copy of the original DataFrame. Eval is quite limited though, see the docs for its usage and limitations.
eval方法基本上计算作为字符串传递的表达式。在这种情况下,它会将新列添加到原始DataFrame的副本中。 Eval非常有限,请参阅文档了解其用法和局限性。
For adding a new column, another option is assign
. It is designed to add new columns on the fly but since it allows callables, you can also write things like:
要添加新列,请分配另一个选项。它旨在动态添加新列,但由于它允许使用callables,您还可以编写如下内容:
very_long_data_frame_name.assign(new_column=lambda x: x['col1'] + x['col2'])
This is an alternative to the following:
这是以下的替代方案:
very_long_data_frame_name['col1'] + very_long_data_frame_name['col2']
pandas also adds column names as attributes to the DataFrame if the column name is a valid Python identifier. That allows using the dot notation as juanpa.arrivillaga also mentioned:
如果列名是有效的Python标识符,pandas还会将列名添加为DataFrame的属性。这允许使用点符号作为juanpa.arrivillaga也提到:
df['c'] = df1.a + df2.a
Note that for non-existing columns you still have to use the brackets (see the left hand side of the assignment). If you already have a column named c, you can use df.c
on the left side too.
请注意,对于不存在的列,您仍然必须使用括号(请参阅分配的左侧)。如果您已经有一个名为c的列,则也可以在左侧使用df.c.
Similar to eval, there is a query method for selection. It doesn't add a new column but queries the DataFrame by parsing the string passed to it. The string, again, should be a valid Python expression.
与eval类似,有一种查询方法可供选择。它不会添加新列,而是通过解析传递给它的字符串来查询DataFrame。该字符串同样应该是一个有效的Python表达式。
#2
2
Use DataFrame.eval() method:
使用DataFrame.eval()方法:
Demo:
In [17]: df = pd.DataFrame({'a':[1], 'b':[2]})
In [18]: df
Out[18]:
a b
0 1 2
In [19]: df.eval("c = a + b", inplace=True)
In [20]: df
Out[20]:
a b c
0 1 2 3
#1
1
Short answer: no. pandas is constrained by Python's syntax rules. The expression c = a + b
requires a
, b
, and c
to be names in the global namespace and it is not a good idea for a library to modify global namespace like that (what if you already have those names? What happens if there is a conflict?). That leaves out "no quotes" part.
简答:不。 pandas受Python的语法规则约束。表达式c = a + b要求a,b和c是全局命名空间中的名称,并且对于库来修改全局命名空间不是一个好主意(如果你已经拥有这些名称怎么办?如果有的话会发生什么?是冲突?)。这留下了“没有引号”的部分。
With quotes, you have some options. For adding a new column, you can use eval
:
有了引号,你有一些选择。要添加新列,您可以使用eval:
df.eval('c = a + b')
The eval method basically evaluates the expression passed as a string. In this case, it adds a new column to a copy of the original DataFrame. Eval is quite limited though, see the docs for its usage and limitations.
eval方法基本上计算作为字符串传递的表达式。在这种情况下,它会将新列添加到原始DataFrame的副本中。 Eval非常有限,请参阅文档了解其用法和局限性。
For adding a new column, another option is assign
. It is designed to add new columns on the fly but since it allows callables, you can also write things like:
要添加新列,请分配另一个选项。它旨在动态添加新列,但由于它允许使用callables,您还可以编写如下内容:
very_long_data_frame_name.assign(new_column=lambda x: x['col1'] + x['col2'])
This is an alternative to the following:
这是以下的替代方案:
very_long_data_frame_name['col1'] + very_long_data_frame_name['col2']
pandas also adds column names as attributes to the DataFrame if the column name is a valid Python identifier. That allows using the dot notation as juanpa.arrivillaga also mentioned:
如果列名是有效的Python标识符,pandas还会将列名添加为DataFrame的属性。这允许使用点符号作为juanpa.arrivillaga也提到:
df['c'] = df1.a + df2.a
Note that for non-existing columns you still have to use the brackets (see the left hand side of the assignment). If you already have a column named c, you can use df.c
on the left side too.
请注意,对于不存在的列,您仍然必须使用括号(请参阅分配的左侧)。如果您已经有一个名为c的列,则也可以在左侧使用df.c.
Similar to eval, there is a query method for selection. It doesn't add a new column but queries the DataFrame by parsing the string passed to it. The string, again, should be a valid Python expression.
与eval类似,有一种查询方法可供选择。它不会添加新列,而是通过解析传递给它的字符串来查询DataFrame。该字符串同样应该是一个有效的Python表达式。
#2
2
Use DataFrame.eval() method:
使用DataFrame.eval()方法:
Demo:
In [17]: df = pd.DataFrame({'a':[1], 'b':[2]})
In [18]: df
Out[18]:
a b
0 1 2
In [19]: df.eval("c = a + b", inplace=True)
In [20]: df
Out[20]:
a b c
0 1 2 3