在Python / Pandas数据框中创建新列时，有没有办法避免键入数据框名称，括号和引号？

Suppose I had a Python/Pandas dataframe called df1 with columns a and b, each with only one record (a = 1 and b = 2). I want to create a third column, c, whose value equals a + b or 3.

假设我有一个名为df1的Python / Pandas数据帧,其中包含a和b列,每个只有一条记录(a = 1和b = 2)。我想创建第三列c,其值等于a + b或3。

Using Pandas, I'd write:

使用熊猫,我写道:

df1['c'] = df1['a'] + df1['b']

I'd prefer just to write something simpler and easier to read, like the following:

我更喜欢写一些更简单易读的东西,如下所示:

with df1:
    c = a + b

SAS allows this simpler syntax in its "data step". I would love it if Python/Pandas had something similar.

SAS在其“数据步骤”中允许这种更简单的语法。如果Python / Pandas有类似的东西我会喜欢它。

Thanks a lot! Sean

非常感谢!肖恩

2 个解决方案

#1

Short answer: no. pandas is constrained by Python's syntax rules. The expression c = a + b requires a, b, and c to be names in the global namespace and it is not a good idea for a library to modify global namespace like that (what if you already have those names? What happens if there is a conflict?). That leaves out "no quotes" part.

简答:不。 pandas受Python的语法规则约束。表达式c = a + b要求a,b和c是全局命名空间中的名称,并且对于库来修改全局命名空间不是一个好主意(如果你已经拥有这些名称怎么办?如果有的话会发生什么?是冲突?)。这留下了“没有引号”的部分。

With quotes, you have some options. For adding a new column, you can use eval:

有了引号,你有一些选择。要添加新列,您可以使用eval:

df.eval('c = a + b')

The eval method basically evaluates the expression passed as a string. In this case, it adds a new column to a copy of the original DataFrame. Eval is quite limited though, see the docs for its usage and limitations.

eval方法基本上计算作为字符串传递的表达式。在这种情况下,它会将新列添加到原始DataFrame的副本中。 Eval非常有限,请参阅文档了解其用法和局限性。

For adding a new column, another option is assign. It is designed to add new columns on the fly but since it allows callables, you can also write things like:

要添加新列,请分配另一个选项。它旨在动态添加新列,但由于它允许使用callables,您还可以编写如下内容:

very_long_data_frame_name.assign(new_column=lambda x: x['col1'] + x['col2'])

This is an alternative to the following:

这是以下的替代方案:

very_long_data_frame_name['col1'] + very_long_data_frame_name['col2']

pandas also adds column names as attributes to the DataFrame if the column name is a valid Python identifier. That allows using the dot notation as juanpa.arrivillaga also mentioned:

如果列名是有效的Python标识符,pandas还会将列名添加为DataFrame的属性。这允许使用点符号作为juanpa.arrivillaga也提到:

df['c'] = df1.a + df2.a

Note that for non-existing columns you still have to use the brackets (see the left hand side of the assignment). If you already have a column named c, you can use df.c on the left side too.

请注意,对于不存在的列,您仍然必须使用括号(请参阅分配的左侧)。如果您已经有一个名为c的列,则也可以在左侧使用df.c.

Similar to eval, there is a query method for selection. It doesn't add a new column but queries the DataFrame by parsing the string passed to it. The string, again, should be a valid Python expression.

与eval类似,有一种查询方法可供选择。它不会添加新列,而是通过解析传递给它的字符串来查询DataFrame。该字符串同样应该是一个有效的Python表达式。

#2

Use DataFrame.eval() method:

使用DataFrame.eval()方法:

Demo:

In [17]: df = pd.DataFrame({'a':[1], 'b':[2]})

In [18]: df
Out[18]:
   a  b
0  1  2

In [19]: df.eval("c = a + b", inplace=True)

In [20]: df
Out[20]:
   a  b  c
0  1  2  3

#1