在Python熊猫中向现有的DataFrame添加新列

时间:2021-09-26 08:12:05

I have the following indexed DataFrame with named columns and rows not- continuous numbers:

我有以下已索引的、列和行不连续的数据aframe:

          a         b         c         d
2  0.671399  0.101208 -0.181532  0.241273
3  0.446172 -0.243316  0.051767  1.577318
5  0.614758  0.075793 -0.451460 -0.012493

I would like to add a new column, 'e', to the existing data frame and do not want to change anything in the data frame (i.e., the new column always has the same length as the DataFrame).

我想在现有的数据帧中添加一个新的列“e”,并且不希望在数据帧中更改任何内容(例如,新列的长度总是与DataFrame相同)。

0   -0.335485
1   -1.166658
2   -0.385571
dtype: float64

I tried different versions of join, append, merge, but I did not get the result I wanted, only errors at most. How can I add column e to the above example?

我尝试了不同版本的join, append, merge,但是我没有得到我想要的结果,只有错误最多。如何在上面的例子中添加e列?

20 个解决方案

#1


612  

Use the original df1 indexes to create the series:

使用原始的df1索引创建系列:

df1['e'] = Series(np.random.randn(sLength), index=df1.index)


Edit 2015
Some reported to get the SettingWithCopyWarning with this code.
However, the code still runs perfect with the current pandas version 0.16.1.

编辑2015年的一些报告,用此代码获得带有copywarning的SettingWithCopyWarning。然而,目前的熊猫版本0.16.1仍然是完美的。

>>> sLength = len(df1['a'])
>>> df1
          a         b         c         d
6 -0.269221 -0.026476  0.997517  1.294385
8  0.917438  0.847941  0.034235 -0.448948

>>> df1['e'] = p.Series(np.random.randn(sLength), index=df1.index)
>>> df1
          a         b         c         d         e
6 -0.269221 -0.026476  0.997517  1.294385  1.757167
8  0.917438  0.847941  0.034235 -0.448948  2.228131

>>> p.version.short_version
'0.16.1'

The SettingWithCopyWarning aims to inform of a possibly invalid assignment on a copy of the Dataframe. It doesn't necessarily say you did it wrong (it can trigger false positives) but from 0.13.0 it let you know there are more adequate methods for the same purpose. Then, if you get the warning, just follow its advise: Try using .loc[row_index,col_indexer] = value instead

带有copywarning的SettingWithCopyWarning旨在通知Dataframe副本上可能无效的赋值。它并不一定说你做错了(它会引发假阳性),但是从0.13.0开始,它会让你知道有更合适的方法来达到同样的目的。然后,如果您得到了警告,只需遵循它的建议:尝试使用.loc[row_index,col_indexer] = value

>>> df1.loc[:,'f'] = p.Series(np.random.randn(sLength), index=df1.index)
>>> df1
          a         b         c         d         e         f
6 -0.269221 -0.026476  0.997517  1.294385  1.757167 -0.050927
8  0.917438  0.847941  0.034235 -0.448948  2.228131  0.006109
>>> 

In fact, this is currently the more efficient method as described in pandas docs

事实上,这是目前更有效的方法,如熊猫档案所描述的



Edit 2017

编辑2017

As indicated in the comments and by @Alexander, currently the best method to add the values of a Series as a new column of a DataFrame could be using assign:

如评论和@Alexander所指出的,目前将一个系列的值作为一个DataFrame的新列添加的最佳方法是使用assign:

df1 = df1.assign(e=p.Series(np.random.randn(sLength)).values)

#2


129  

This is the simple way of adding a new column: df['e'] = e

这是添加新列的简单方法:df['e'] = e

#3


81  

I would like to add a new column, 'e', to the existing data frame and do not change anything in the data frame. (The series always got the same length as a dataframe.)

我想在现有的数据帧中添加一个新的列“e”,并且不更改数据帧中的任何内容。(这个系列的长度总是和dataframe一样。)

I assume that the index values in e match those in df1.

我假设e中的索引值与df1中的匹配。

The easiest way to initiate a new column named e, and assign it the values from your series e:

启动一个名为e的新列的最简单方法,并将其赋值为e系列中的值:

df['e'] = e.values

assign (Pandas 0.16.0+)

分配(熊猫0.16.0 +)

As of Pandas 0.16.0, you can also use assign, which assigns new columns to a DataFrame and returns a new object (a copy) with all the original columns in addition to the new ones.

对于panda 0.16.0,您还可以使用assign,它将新列分配给DataFrame并返回一个新对象(一个副本),除了新列之外,还有所有原始列。

df1 = df1.assign(e=e.values)

As per this example (which also includes the source code of the assign function), you can also include more than one column:

根据这个示例(它还包含赋值函数的源代码),您还可以包含多个列:

df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
>>> df.assign(mean_a=df.a.mean(), mean_b=df.b.mean())
   a  b  mean_a  mean_b
0  1  3     1.5     3.5
1  2  4     1.5     3.5

In context with your example:

在你的例子中:

np.random.seed(0)
df1 = pd.DataFrame(np.random.randn(10, 4), columns=['a', 'b', 'c', 'd'])
mask = df1.applymap(lambda x: x <-0.7)
df1 = df1[-mask.any(axis=1)]
sLength = len(df1['a'])
e = pd.Series(np.random.randn(sLength))

>>> df1
          a         b         c         d
0  1.764052  0.400157  0.978738  2.240893
2 -0.103219  0.410599  0.144044  1.454274
3  0.761038  0.121675  0.443863  0.333674
7  1.532779  1.469359  0.154947  0.378163
9  1.230291  1.202380 -0.387327 -0.302303

>>> e
0   -1.048553
1   -1.420018
2   -1.706270
3    1.950775
4   -0.509652
dtype: float64

df1 = df1.assign(e=e.values)

>>> df1
          a         b         c         d         e
0  1.764052  0.400157  0.978738  2.240893 -1.048553
2 -0.103219  0.410599  0.144044  1.454274 -1.420018
3  0.761038  0.121675  0.443863  0.333674 -1.706270
7  1.532779  1.469359  0.154947  0.378163  1.950775
9  1.230291  1.202380 -0.387327 -0.302303 -0.509652

The description of this new feature when it was first introduced can be found here.

这个新特性的描述在第一次引入时可以在这里找到。

#4


32  

Doing this directly via NumPy will be the most efficient:

通过NumPy直接这样做是最有效的:

df1['e'] = np.random.randn(sLength)

Note my original (very old) suggestion was to use map (which is much slower):

注意,我最初(非常古老)的建议是使用map(速度要慢得多):

df1['e'] = df1['a'].map(lambda x: np.random.random())

#5


25  

It seems that in recent Pandas versions the way to go is to use df.assign:

在最近的熊猫版本中,我们应该使用df.assign:

df1 = df1.assign(e=np.random.randn(sLength))

df1 = df1.assign(e = np.random.randn(sLength))

It doesn't produce SettingWithCopyWarning.

它不会产生SettingWithCopyWarning。

#6


13  

I got the dreaded SettingWithCopyWarning, and it wasn't fixed by using the iloc syntax. My DataFrame was created by read_sql from an ODBC source. Using a suggestion by lowtech above, the following worked for me:

我得到了令人畏惧的设置copywarning,并且它不是通过使用iloc语法来修复的。我的DataFrame是由来自ODBC源的read_sql创建的。根据上面lowtech的建议,下面的建议对我起了作用:

df.insert(len(df.columns), 'e', pd.Series(np.random.randn(sLength),  index=df.index))

This worked fine to insert the column at the end. I don't know if it is the most efficient, but I don't like warning messages. I think there is a better solution, but I can't find it, and I think it depends on some aspect of the index.
Note. That this only works once and will give an error message if trying to overwrite and existing column.
Note As above and from 0.16.0 assign is the best solution. See documentation http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.assign.html#pandas.DataFrame.assign Works well for data flow type where you don't overwrite your intermediate values.

这样就可以在最后插入列了。我不知道它是否最有效,但我不喜欢警告信息。我认为有更好的解决方案,但我找不到,我认为这取决于指数的某些方面。请注意。这只工作一次,如果试图覆盖和现有列,将给出一个错误消息。如上所述,赋值是最好的解决方案。请参阅文档http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.assign.html# pandas.html #pandas.DataFrame.assign .assign对于不覆盖中间值的数据流类型非常有效。

#7


13  

Super simple column assignment

A pandas dataframe is implemented as an ordered dict of columns.

熊猫数据aframe被实现为一个有序的列命令。

This means that the __getitem__ [] can not only be used to get a certain column, but __setitem__ [] = can be used to assign a new column.

这意味着__getitem__[]不仅可以用于获取某个列,还可以使用__setitem__[] =分配一个新的列。

For example, this dataframe can have a column added to it by simply using the [] accessor

例如,这个dataframe可以通过使用[]访问器添加一个列

    size      name color
0    big      rose   red
1  small    violet  blue
2  small     tulip   red
3  small  harebell  blue

df['protected'] = ['no', 'no', 'no', 'yes']

    size      name color protected
0    big      rose   red        no
1  small    violet  blue        no
2  small     tulip   red        no
3  small  harebell  blue       yes

Note that this works even if the index of the dataframe is off.

注意,即使dataframe的索引是off的,这也可以工作。

df.index = [3,2,1,0]
df['protected'] = ['no', 'no', 'no', 'yes']
    size      name color protected
3    big      rose   red        no
2  small    violet  blue        no
1  small     tulip   red        no
0  small  harebell  blue       yes

[]= is the way to go, but watch out!

However, if you have a pd.Series and try to assign it to a dataframe where the indexes are off, you will run in to trouble. See example:

但是,如果你有pd的话。序列并尝试将它分配到索引关闭的dataframe,您会遇到麻烦。看到的例子:

df['protected'] = pd.Series(['no', 'no', 'no', 'yes'])
    size      name color protected
3    big      rose   red       yes
2  small    violet  blue        no
1  small     tulip   red        no
0  small  harebell  blue        no

This is because a pd.Series by default has an index enumerated from 0 to n. And the pandas [] = method tries to be "smart"

这是因为pd。在默认情况下,Series具有从0到n的枚举索引。

What actually is going on.

When you use the [] = method pandas is quietly performing an outer join or outer merge using the index of the left hand dataframe and the index of the right hand series. df['column'] = series

当您使用[]=方法时,使用左手dataframe的索引和右手系列的索引来安静地执行外部联接或外部合并。df系列(“列”)=

Side note

This quickly causes cognitive dissonance, since the []= method is trying to do a lot of different things depending on the input, and the outcome cannot be predicted unless you just know how pandas works. I would therefore advice against the []= in code bases, but when exploring data in a notebook, it is fine.

这很快就会导致认知失调,因为[]=方法试图根据输入做很多不同的事情,除非你只知道熊猫是如何工作的,否则结果无法预测。因此,我建议不要在代码库中使用[]=,但在笔记本中研究数据时,这是可以的。

Going around the problem

If you have a pd.Series and want it assigned from top to bottom, or if you are coding productive code and you are not sure of the index order, it is worth it to safeguard for this kind of issue.

如果你有pd。从上到下进行排序,或者如果您正在编写有效的代码,并且不确定索引的顺序,那么保护这种问题是值得的。

You could downcast the pd.Series to a np.ndarray or a list, this will do the trick.

你可以让警察失望。np系列。ndarray或列表,这就行了。

df['protected'] = pd.Series(['no', 'no', 'no', 'yes']).values

or

df['protected'] = list(pd.Series(['no', 'no', 'no', 'yes']))

But this is not very explicit.

但这并不是很明确。

Some coder may come along and say "Hey, this looks redundant, I'll just optimize this away".

有些程序员可能会说"嘿,这个看起来多余,我把它优化掉"

Explicit way

Setting the index of the pd.Series to be the index of the df is explicit.

设置pd的索引。系列作为df的索引是显式的。

df['protected'] = pd.Series(['no', 'no', 'no', 'yes'], index=df.index)

Or more realistically, you probably have a pd.Series already available.

或者更现实地说,你可能有一个pd。系列已经可用。

protected_series = pd.Series(['no', 'no', 'no', 'yes'])
protected_series.index = df.index

3     no
2     no
1     no
0    yes

Can now be assigned

现在可以分配

df['protected'] = protected_series

    size      name color protected
3    big      rose   red        no
2  small    violet  blue        no
1  small     tulip   red        no
0  small  harebell  blue       yes

Alternative way with df.reset_index()

Since the index dissonance is the problem, if you feel that the index of the dataframe should not dictate things, you can simply drop the index, this should be faster, but it is not very clean, since your function now probably does two things.

由于索引不一致是问题所在,如果您认为dataframe的索引不应该指定内容,那么您可以简单地删除索引,这应该会更快,但它不是很干净,因为您的函数现在可能要做两件事。

df.reset_index(drop=True)
protected_series.reset_index(drop=True)
df['protected'] = protected_series

    size      name color protected
0    big      rose   red        no
1  small    violet  blue        no
2  small     tulip   red        no
3  small  harebell  blue       yes

Note on df.assign

While df.assign make it more explicit what you are doing, it actually has all the same problems as the above []=

而df。赋值使它更明确地显示您正在做什么,它实际上具有与上面的[]=相同的问题

df.assign(protected=pd.Series(['no', 'no', 'no', 'yes']))
    size      name color protected
3    big      rose   red       yes
2  small    violet  blue        no
1  small     tulip   red        no
0  small  harebell  blue        no

Just watch out with df.assign that your column is not called self. It will cause errors. This makes df.assign smelly, since there are these kind of artifacts in the function.

只需要注意df。指定您的列不是self。它将导致错误。这使得df。分配气味,因为函数中有这些工件。

df.assign(self=pd.Series(['no', 'no', 'no', 'yes'])
TypeError: assign() got multiple values for keyword argument 'self'

You may say, "Well, I'll just not use self then". But who knows how this function changes in the future to support new arguments. Maybe your column name will be an argument in a new update of pandas, causing problems with upgrading.

你可能会说,“那我就不使用self了。”但是谁知道这个函数将来会如何变化以支持新的参数呢。也许你的列名会成为熊猫更新中的一个参数,导致升级的问题。

#8


7  

If you want to set the whole new column to an initial base value (e.g. None), you can do this: df1['e'] = None

如果要将整个新列设置为初始基值(例如,None),可以这样做:df1['e'] = None

This actually would assign "object" type to the cell. So later you're free to put complex data types, like list, into individual cells.

这实际上会给单元格分配“对象”类型。所以,以后你可以*地把复杂的数据类型,比如列表,放到单个的单元格中。

#9


6  

Foolproof:

简单明了的:

df.loc[:, 'NewCol'] = 'New_Val'

Example:

例子:

df = pd.DataFrame(data=np.random.randn(20, 4), columns=['A', 'B', 'C', 'D'])

df

           A         B         C         D
0  -0.761269  0.477348  1.170614  0.752714
1   1.217250 -0.930860 -0.769324 -0.408642
2  -0.619679 -1.227659 -0.259135  1.700294
3  -0.147354  0.778707  0.479145  2.284143
4  -0.529529  0.000571  0.913779  1.395894
5   2.592400  0.637253  1.441096 -0.631468
6   0.757178  0.240012 -0.553820  1.177202
7  -0.986128 -1.313843  0.788589 -0.707836
8   0.606985 -2.232903 -1.358107 -2.855494
9  -0.692013  0.671866  1.179466 -1.180351
10 -1.093707 -0.530600  0.182926 -1.296494
11 -0.143273 -0.503199 -1.328728  0.610552
12 -0.923110 -1.365890 -1.366202 -1.185999
13 -2.026832  0.273593 -0.440426 -0.627423
14 -0.054503 -0.788866 -0.228088 -0.404783
15  0.955298 -1.430019  1.434071 -0.088215
16 -0.227946  0.047462  0.373573 -0.111675
17  1.627912  0.043611  1.743403 -0.012714
18  0.693458  0.144327  0.329500 -0.655045
19  0.104425  0.037412  0.450598 -0.923387


df.drop([3, 5, 8, 10, 18], inplace=True)

df

           A         B         C         D
0  -0.761269  0.477348  1.170614  0.752714
1   1.217250 -0.930860 -0.769324 -0.408642
2  -0.619679 -1.227659 -0.259135  1.700294
4  -0.529529  0.000571  0.913779  1.395894
6   0.757178  0.240012 -0.553820  1.177202
7  -0.986128 -1.313843  0.788589 -0.707836
9  -0.692013  0.671866  1.179466 -1.180351
11 -0.143273 -0.503199 -1.328728  0.610552
12 -0.923110 -1.365890 -1.366202 -1.185999
13 -2.026832  0.273593 -0.440426 -0.627423
14 -0.054503 -0.788866 -0.228088 -0.404783
15  0.955298 -1.430019  1.434071 -0.088215
16 -0.227946  0.047462  0.373573 -0.111675
17  1.627912  0.043611  1.743403 -0.012714
19  0.104425  0.037412  0.450598 -0.923387

df.loc[:, 'NewCol'] = 0

df
           A         B         C         D  NewCol
0  -0.761269  0.477348  1.170614  0.752714       0
1   1.217250 -0.930860 -0.769324 -0.408642       0
2  -0.619679 -1.227659 -0.259135  1.700294       0
4  -0.529529  0.000571  0.913779  1.395894       0
6   0.757178  0.240012 -0.553820  1.177202       0
7  -0.986128 -1.313843  0.788589 -0.707836       0
9  -0.692013  0.671866  1.179466 -1.180351       0
11 -0.143273 -0.503199 -1.328728  0.610552       0
12 -0.923110 -1.365890 -1.366202 -1.185999       0
13 -2.026832  0.273593 -0.440426 -0.627423       0
14 -0.054503 -0.788866 -0.228088 -0.404783       0
15  0.955298 -1.430019  1.434071 -0.088215       0
16 -0.227946  0.047462  0.373573 -0.111675       0
17  1.627912  0.043611  1.743403 -0.012714       0
19  0.104425  0.037412  0.450598 -0.923387       0

#10


5  

Let me just add that, just like for hum3, .loc didn't solve the SettingWithCopyWarning and I had to resort to df.insert(). In my case false positive was generated by "fake" chain indexing dict['a']['e'], where 'e' is the new column, and dict['a'] is a DataFrame coming from dictionary.

让我补充一点,就像对hum3一样,.loc没有解决带有copywarning的settingingwithcopywarning,我不得不求助于df.insert()。在我的例子中,false positive是由“假”链索引dict['a']['e']生成的,其中'e'是新的列,而dict['a']是来自字典的数据aframe。

Also note that if you know what you are doing, you can switch of the warning using pd.options.mode.chained_assignment = None and than use one of the other solutions given here.

还要注意,如果您知道自己正在做什么,可以使用psd .options.mode切换警告。chained_assignment = None,并使用这里给出的其他解决方案之一。

#11


4  

Before assigning a new column, if you have indexed data, you need to sort the index. At least in my case I had to:

在分配新的列之前,如果您有索引数据,则需要对索引进行排序。至少在我的情况下,我必须:

data.set_index(['index_column'], inplace=True)
"if index is unsorted, assignment of a new column will fail"        
data.sort_index(inplace = True)
data.loc['index_value1', 'column_y'] = np.random.randn(data.loc['index_value1', 'column_x'].shape[0])

#12


4  

One thing to note, though, is that if you do

但要注意的一点是,如果你这么做了。

df1['e'] = Series(np.random.randn(sLength), index=df1.index)

this will effectively be a left join on the df1.index. So if you want to have an outer join effect, my probably imperfect solution is to create a dataframe with index values covering the universe of your data, and then use the code above. For example,

这实际上是df1.index上的左连接。因此,如果您想要有一个外部连接效果,我可能不完美的解决方案是创建一个带有索引值的dataframe,覆盖您的数据的整个宇宙,然后使用上面的代码。例如,

data = pd.DataFrame(index=all_possible_values)
df1['e'] = Series(np.random.randn(sLength), index=df1.index)

#13


4  

The following is what I did... But I'm pretty new to pandas and really Python in general, so no promises.

以下是我所做的……但我对熊猫和巨蟒都很陌生,所以没有承诺。

df = pd.DataFrame([[1, 2], [3, 4], [5,6]], columns=list('AB'))

newCol = [3,5,7]
newName = 'C'

values = np.insert(df.values,df.shape[1],newCol,axis=1)
header = df.columns.values.tolist()
header.append(newName)

df = pd.DataFrame(values,columns=header)

#14


4  

If you get the SettingWithCopyWarning, an easy fix is to copy the DataFrame you are trying to add a column to.

如果您获得了带有copywarning的SettingWithCopyWarning,一个简单的修复方法是复制您试图添加一个列的DataFrame。

df = df.copy()
df['col_name'] = values

#15


4  

If the data frame and Series object have the same index, pandas.concat also works here:

如果数据帧和序列对象有相同的索引,那么熊猫。concat也在这里工作:

import pandas as pd
df
#          a            b           c           d
#0  0.671399     0.101208   -0.181532    0.241273
#1  0.446172    -0.243316    0.051767    1.577318
#2  0.614758     0.075793   -0.451460   -0.012493

e = pd.Series([-0.335485, -1.166658, -0.385571])    
e
#0   -0.335485
#1   -1.166658
#2   -0.385571
#dtype: float64

# here we need to give the series object a name which converts to the new  column name 
# in the result
df = pd.concat([df, e.rename("e")], axis=1)
df

#          a            b           c           d           e
#0  0.671399     0.101208   -0.181532    0.241273   -0.335485
#1  0.446172    -0.243316    0.051767    1.577318   -1.166658
#2  0.614758     0.075793   -0.451460   -0.012493   -0.385571

In case they don't have the same index:

如果他们没有相同的指数:

e.index = df.index
df = pd.concat([df, e.rename("e")], axis=1)

#16


4  

  1. First create a python's list_of_e that has relevant data.
  2. 首先创建具有相关数据的python的list_of_e。
  3. Use this: df['e'] = list_of_e
  4. 使用这个:df['e'] = list_of_e

#17


4  

If the column you are trying to add is a series variable then just :

如果要添加的列是一个级数变量,那么只需:

df["new_columns_name"]=series_variable_name #this will do it for you

This works well even if you are replacing an existing column.just type the new_columns_name same as the column you want to replace.It will just overwrite the existing column data with the new series data.

即使您正在替换一个现有的列,这也可以很好地工作。只需键入与要替换的列相同的new_columns_name。它将用新的系列数据覆盖现有的列数据。

#18


3  

To add a new column, 'e', to the existing data frame

向现有的数据帧添加一个新的列“e”

 df1.loc[:,'e'] = Series(np.random.randn(sLength))

#19


3  

For the sake of completeness - yet another solution using DataFrame.eval() method:

为了完整性起见——使用DataFrame.eval()方法的另一个解决方案:

Data:

数据:

In [44]: e
Out[44]:
0    1.225506
1   -1.033944
2   -0.498953
3   -0.373332
4    0.615030
5   -0.622436
dtype: float64

In [45]: df1
Out[45]:
          a         b         c         d
0 -0.634222 -0.103264  0.745069  0.801288
4  0.782387 -0.090279  0.757662 -0.602408
5 -0.117456  2.124496  1.057301  0.765466
7  0.767532  0.104304 -0.586850  1.051297
8 -0.103272  0.958334  1.163092  1.182315
9 -0.616254  0.296678 -0.112027  0.679112

Solution:

解决方案:

In [46]: df1.eval("e = @e.values", inplace=True)

In [47]: df1
Out[47]:
          a         b         c         d         e
0 -0.634222 -0.103264  0.745069  0.801288  1.225506
4  0.782387 -0.090279  0.757662 -0.602408 -1.033944
5 -0.117456  2.124496  1.057301  0.765466 -0.498953
7  0.767532  0.104304 -0.586850  1.051297 -0.373332
8 -0.103272  0.958334  1.163092  1.182315  0.615030
9 -0.616254  0.296678 -0.112027  0.679112 -0.622436

#20


2  

I was looking for a general way of adding a column of numpy.nans to a dataframe without getting the dumb SettingWithCopyWarning.

我在寻找一种添加numpy列的一般方法。没有得到带有copywarning的哑设置的dataframe。

From the following:

从以下:

  • the answers here
  • 答案在这里
  • this question about passing a variable as a keyword argument
  • 将变量作为关键字参数传递的问题
  • this method for generating a numpy array of NaNs in-line
  • 这种方法用于生成一个NaNs的numpy数组

I came up with this:

我想到了这个:

col = 'column_name'
df = df.assign(**{col:numpy.full(len(df), numpy.nan)})

#1


612  

Use the original df1 indexes to create the series:

使用原始的df1索引创建系列:

df1['e'] = Series(np.random.randn(sLength), index=df1.index)


Edit 2015
Some reported to get the SettingWithCopyWarning with this code.
However, the code still runs perfect with the current pandas version 0.16.1.

编辑2015年的一些报告,用此代码获得带有copywarning的SettingWithCopyWarning。然而,目前的熊猫版本0.16.1仍然是完美的。

>>> sLength = len(df1['a'])
>>> df1
          a         b         c         d
6 -0.269221 -0.026476  0.997517  1.294385
8  0.917438  0.847941  0.034235 -0.448948

>>> df1['e'] = p.Series(np.random.randn(sLength), index=df1.index)
>>> df1
          a         b         c         d         e
6 -0.269221 -0.026476  0.997517  1.294385  1.757167
8  0.917438  0.847941  0.034235 -0.448948  2.228131

>>> p.version.short_version
'0.16.1'

The SettingWithCopyWarning aims to inform of a possibly invalid assignment on a copy of the Dataframe. It doesn't necessarily say you did it wrong (it can trigger false positives) but from 0.13.0 it let you know there are more adequate methods for the same purpose. Then, if you get the warning, just follow its advise: Try using .loc[row_index,col_indexer] = value instead

带有copywarning的SettingWithCopyWarning旨在通知Dataframe副本上可能无效的赋值。它并不一定说你做错了(它会引发假阳性),但是从0.13.0开始,它会让你知道有更合适的方法来达到同样的目的。然后,如果您得到了警告,只需遵循它的建议:尝试使用.loc[row_index,col_indexer] = value

>>> df1.loc[:,'f'] = p.Series(np.random.randn(sLength), index=df1.index)
>>> df1
          a         b         c         d         e         f
6 -0.269221 -0.026476  0.997517  1.294385  1.757167 -0.050927
8  0.917438  0.847941  0.034235 -0.448948  2.228131  0.006109
>>> 

In fact, this is currently the more efficient method as described in pandas docs

事实上,这是目前更有效的方法,如熊猫档案所描述的



Edit 2017

编辑2017

As indicated in the comments and by @Alexander, currently the best method to add the values of a Series as a new column of a DataFrame could be using assign:

如评论和@Alexander所指出的,目前将一个系列的值作为一个DataFrame的新列添加的最佳方法是使用assign:

df1 = df1.assign(e=p.Series(np.random.randn(sLength)).values)

#2


129  

This is the simple way of adding a new column: df['e'] = e

这是添加新列的简单方法:df['e'] = e

#3


81  

I would like to add a new column, 'e', to the existing data frame and do not change anything in the data frame. (The series always got the same length as a dataframe.)

我想在现有的数据帧中添加一个新的列“e”,并且不更改数据帧中的任何内容。(这个系列的长度总是和dataframe一样。)

I assume that the index values in e match those in df1.

我假设e中的索引值与df1中的匹配。

The easiest way to initiate a new column named e, and assign it the values from your series e:

启动一个名为e的新列的最简单方法,并将其赋值为e系列中的值:

df['e'] = e.values

assign (Pandas 0.16.0+)

分配(熊猫0.16.0 +)

As of Pandas 0.16.0, you can also use assign, which assigns new columns to a DataFrame and returns a new object (a copy) with all the original columns in addition to the new ones.

对于panda 0.16.0,您还可以使用assign,它将新列分配给DataFrame并返回一个新对象(一个副本),除了新列之外,还有所有原始列。

df1 = df1.assign(e=e.values)

As per this example (which also includes the source code of the assign function), you can also include more than one column:

根据这个示例(它还包含赋值函数的源代码),您还可以包含多个列:

df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
>>> df.assign(mean_a=df.a.mean(), mean_b=df.b.mean())
   a  b  mean_a  mean_b
0  1  3     1.5     3.5
1  2  4     1.5     3.5

In context with your example:

在你的例子中:

np.random.seed(0)
df1 = pd.DataFrame(np.random.randn(10, 4), columns=['a', 'b', 'c', 'd'])
mask = df1.applymap(lambda x: x <-0.7)
df1 = df1[-mask.any(axis=1)]
sLength = len(df1['a'])
e = pd.Series(np.random.randn(sLength))

>>> df1
          a         b         c         d
0  1.764052  0.400157  0.978738  2.240893
2 -0.103219  0.410599  0.144044  1.454274
3  0.761038  0.121675  0.443863  0.333674
7  1.532779  1.469359  0.154947  0.378163
9  1.230291  1.202380 -0.387327 -0.302303

>>> e
0   -1.048553
1   -1.420018
2   -1.706270
3    1.950775
4   -0.509652
dtype: float64

df1 = df1.assign(e=e.values)

>>> df1
          a         b         c         d         e
0  1.764052  0.400157  0.978738  2.240893 -1.048553
2 -0.103219  0.410599  0.144044  1.454274 -1.420018
3  0.761038  0.121675  0.443863  0.333674 -1.706270
7  1.532779  1.469359  0.154947  0.378163  1.950775
9  1.230291  1.202380 -0.387327 -0.302303 -0.509652

The description of this new feature when it was first introduced can be found here.

这个新特性的描述在第一次引入时可以在这里找到。

#4


32  

Doing this directly via NumPy will be the most efficient:

通过NumPy直接这样做是最有效的:

df1['e'] = np.random.randn(sLength)

Note my original (very old) suggestion was to use map (which is much slower):

注意,我最初(非常古老)的建议是使用map(速度要慢得多):

df1['e'] = df1['a'].map(lambda x: np.random.random())

#5


25  

It seems that in recent Pandas versions the way to go is to use df.assign:

在最近的熊猫版本中,我们应该使用df.assign:

df1 = df1.assign(e=np.random.randn(sLength))

df1 = df1.assign(e = np.random.randn(sLength))

It doesn't produce SettingWithCopyWarning.

它不会产生SettingWithCopyWarning。

#6


13  

I got the dreaded SettingWithCopyWarning, and it wasn't fixed by using the iloc syntax. My DataFrame was created by read_sql from an ODBC source. Using a suggestion by lowtech above, the following worked for me:

我得到了令人畏惧的设置copywarning,并且它不是通过使用iloc语法来修复的。我的DataFrame是由来自ODBC源的read_sql创建的。根据上面lowtech的建议,下面的建议对我起了作用:

df.insert(len(df.columns), 'e', pd.Series(np.random.randn(sLength),  index=df.index))

This worked fine to insert the column at the end. I don't know if it is the most efficient, but I don't like warning messages. I think there is a better solution, but I can't find it, and I think it depends on some aspect of the index.
Note. That this only works once and will give an error message if trying to overwrite and existing column.
Note As above and from 0.16.0 assign is the best solution. See documentation http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.assign.html#pandas.DataFrame.assign Works well for data flow type where you don't overwrite your intermediate values.

这样就可以在最后插入列了。我不知道它是否最有效,但我不喜欢警告信息。我认为有更好的解决方案,但我找不到,我认为这取决于指数的某些方面。请注意。这只工作一次,如果试图覆盖和现有列,将给出一个错误消息。如上所述,赋值是最好的解决方案。请参阅文档http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.assign.html# pandas.html #pandas.DataFrame.assign .assign对于不覆盖中间值的数据流类型非常有效。

#7


13  

Super simple column assignment

A pandas dataframe is implemented as an ordered dict of columns.

熊猫数据aframe被实现为一个有序的列命令。

This means that the __getitem__ [] can not only be used to get a certain column, but __setitem__ [] = can be used to assign a new column.

这意味着__getitem__[]不仅可以用于获取某个列,还可以使用__setitem__[] =分配一个新的列。

For example, this dataframe can have a column added to it by simply using the [] accessor

例如,这个dataframe可以通过使用[]访问器添加一个列

    size      name color
0    big      rose   red
1  small    violet  blue
2  small     tulip   red
3  small  harebell  blue

df['protected'] = ['no', 'no', 'no', 'yes']

    size      name color protected
0    big      rose   red        no
1  small    violet  blue        no
2  small     tulip   red        no
3  small  harebell  blue       yes

Note that this works even if the index of the dataframe is off.

注意,即使dataframe的索引是off的,这也可以工作。

df.index = [3,2,1,0]
df['protected'] = ['no', 'no', 'no', 'yes']
    size      name color protected
3    big      rose   red        no
2  small    violet  blue        no
1  small     tulip   red        no
0  small  harebell  blue       yes

[]= is the way to go, but watch out!

However, if you have a pd.Series and try to assign it to a dataframe where the indexes are off, you will run in to trouble. See example:

但是,如果你有pd的话。序列并尝试将它分配到索引关闭的dataframe,您会遇到麻烦。看到的例子:

df['protected'] = pd.Series(['no', 'no', 'no', 'yes'])
    size      name color protected
3    big      rose   red       yes
2  small    violet  blue        no
1  small     tulip   red        no
0  small  harebell  blue        no

This is because a pd.Series by default has an index enumerated from 0 to n. And the pandas [] = method tries to be "smart"

这是因为pd。在默认情况下,Series具有从0到n的枚举索引。

What actually is going on.

When you use the [] = method pandas is quietly performing an outer join or outer merge using the index of the left hand dataframe and the index of the right hand series. df['column'] = series

当您使用[]=方法时,使用左手dataframe的索引和右手系列的索引来安静地执行外部联接或外部合并。df系列(“列”)=

Side note

This quickly causes cognitive dissonance, since the []= method is trying to do a lot of different things depending on the input, and the outcome cannot be predicted unless you just know how pandas works. I would therefore advice against the []= in code bases, but when exploring data in a notebook, it is fine.

这很快就会导致认知失调,因为[]=方法试图根据输入做很多不同的事情,除非你只知道熊猫是如何工作的,否则结果无法预测。因此,我建议不要在代码库中使用[]=,但在笔记本中研究数据时,这是可以的。

Going around the problem

If you have a pd.Series and want it assigned from top to bottom, or if you are coding productive code and you are not sure of the index order, it is worth it to safeguard for this kind of issue.

如果你有pd。从上到下进行排序,或者如果您正在编写有效的代码,并且不确定索引的顺序,那么保护这种问题是值得的。

You could downcast the pd.Series to a np.ndarray or a list, this will do the trick.

你可以让警察失望。np系列。ndarray或列表,这就行了。

df['protected'] = pd.Series(['no', 'no', 'no', 'yes']).values

or

df['protected'] = list(pd.Series(['no', 'no', 'no', 'yes']))

But this is not very explicit.

但这并不是很明确。

Some coder may come along and say "Hey, this looks redundant, I'll just optimize this away".

有些程序员可能会说"嘿,这个看起来多余,我把它优化掉"

Explicit way

Setting the index of the pd.Series to be the index of the df is explicit.

设置pd的索引。系列作为df的索引是显式的。

df['protected'] = pd.Series(['no', 'no', 'no', 'yes'], index=df.index)

Or more realistically, you probably have a pd.Series already available.

或者更现实地说,你可能有一个pd。系列已经可用。

protected_series = pd.Series(['no', 'no', 'no', 'yes'])
protected_series.index = df.index

3     no
2     no
1     no
0    yes

Can now be assigned

现在可以分配

df['protected'] = protected_series

    size      name color protected
3    big      rose   red        no
2  small    violet  blue        no
1  small     tulip   red        no
0  small  harebell  blue       yes

Alternative way with df.reset_index()

Since the index dissonance is the problem, if you feel that the index of the dataframe should not dictate things, you can simply drop the index, this should be faster, but it is not very clean, since your function now probably does two things.

由于索引不一致是问题所在,如果您认为dataframe的索引不应该指定内容,那么您可以简单地删除索引,这应该会更快,但它不是很干净,因为您的函数现在可能要做两件事。

df.reset_index(drop=True)
protected_series.reset_index(drop=True)
df['protected'] = protected_series

    size      name color protected
0    big      rose   red        no
1  small    violet  blue        no
2  small     tulip   red        no
3  small  harebell  blue       yes

Note on df.assign

While df.assign make it more explicit what you are doing, it actually has all the same problems as the above []=

而df。赋值使它更明确地显示您正在做什么,它实际上具有与上面的[]=相同的问题

df.assign(protected=pd.Series(['no', 'no', 'no', 'yes']))
    size      name color protected
3    big      rose   red       yes
2  small    violet  blue        no
1  small     tulip   red        no
0  small  harebell  blue        no

Just watch out with df.assign that your column is not called self. It will cause errors. This makes df.assign smelly, since there are these kind of artifacts in the function.

只需要注意df。指定您的列不是self。它将导致错误。这使得df。分配气味,因为函数中有这些工件。

df.assign(self=pd.Series(['no', 'no', 'no', 'yes'])
TypeError: assign() got multiple values for keyword argument 'self'

You may say, "Well, I'll just not use self then". But who knows how this function changes in the future to support new arguments. Maybe your column name will be an argument in a new update of pandas, causing problems with upgrading.

你可能会说,“那我就不使用self了。”但是谁知道这个函数将来会如何变化以支持新的参数呢。也许你的列名会成为熊猫更新中的一个参数,导致升级的问题。

#8


7  

If you want to set the whole new column to an initial base value (e.g. None), you can do this: df1['e'] = None

如果要将整个新列设置为初始基值(例如,None),可以这样做:df1['e'] = None

This actually would assign "object" type to the cell. So later you're free to put complex data types, like list, into individual cells.

这实际上会给单元格分配“对象”类型。所以,以后你可以*地把复杂的数据类型,比如列表,放到单个的单元格中。

#9


6  

Foolproof:

简单明了的:

df.loc[:, 'NewCol'] = 'New_Val'

Example:

例子:

df = pd.DataFrame(data=np.random.randn(20, 4), columns=['A', 'B', 'C', 'D'])

df

           A         B         C         D
0  -0.761269  0.477348  1.170614  0.752714
1   1.217250 -0.930860 -0.769324 -0.408642
2  -0.619679 -1.227659 -0.259135  1.700294
3  -0.147354  0.778707  0.479145  2.284143
4  -0.529529  0.000571  0.913779  1.395894
5   2.592400  0.637253  1.441096 -0.631468
6   0.757178  0.240012 -0.553820  1.177202
7  -0.986128 -1.313843  0.788589 -0.707836
8   0.606985 -2.232903 -1.358107 -2.855494
9  -0.692013  0.671866  1.179466 -1.180351
10 -1.093707 -0.530600  0.182926 -1.296494
11 -0.143273 -0.503199 -1.328728  0.610552
12 -0.923110 -1.365890 -1.366202 -1.185999
13 -2.026832  0.273593 -0.440426 -0.627423
14 -0.054503 -0.788866 -0.228088 -0.404783
15  0.955298 -1.430019  1.434071 -0.088215
16 -0.227946  0.047462  0.373573 -0.111675
17  1.627912  0.043611  1.743403 -0.012714
18  0.693458  0.144327  0.329500 -0.655045
19  0.104425  0.037412  0.450598 -0.923387


df.drop([3, 5, 8, 10, 18], inplace=True)

df

           A         B         C         D
0  -0.761269  0.477348  1.170614  0.752714
1   1.217250 -0.930860 -0.769324 -0.408642
2  -0.619679 -1.227659 -0.259135  1.700294
4  -0.529529  0.000571  0.913779  1.395894
6   0.757178  0.240012 -0.553820  1.177202
7  -0.986128 -1.313843  0.788589 -0.707836
9  -0.692013  0.671866  1.179466 -1.180351
11 -0.143273 -0.503199 -1.328728  0.610552
12 -0.923110 -1.365890 -1.366202 -1.185999
13 -2.026832  0.273593 -0.440426 -0.627423
14 -0.054503 -0.788866 -0.228088 -0.404783
15  0.955298 -1.430019  1.434071 -0.088215
16 -0.227946  0.047462  0.373573 -0.111675
17  1.627912  0.043611  1.743403 -0.012714
19  0.104425  0.037412  0.450598 -0.923387

df.loc[:, 'NewCol'] = 0

df
           A         B         C         D  NewCol
0  -0.761269  0.477348  1.170614  0.752714       0
1   1.217250 -0.930860 -0.769324 -0.408642       0
2  -0.619679 -1.227659 -0.259135  1.700294       0
4  -0.529529  0.000571  0.913779  1.395894       0
6   0.757178  0.240012 -0.553820  1.177202       0
7  -0.986128 -1.313843  0.788589 -0.707836       0
9  -0.692013  0.671866  1.179466 -1.180351       0
11 -0.143273 -0.503199 -1.328728  0.610552       0
12 -0.923110 -1.365890 -1.366202 -1.185999       0
13 -2.026832  0.273593 -0.440426 -0.627423       0
14 -0.054503 -0.788866 -0.228088 -0.404783       0
15  0.955298 -1.430019  1.434071 -0.088215       0
16 -0.227946  0.047462  0.373573 -0.111675       0
17  1.627912  0.043611  1.743403 -0.012714       0
19  0.104425  0.037412  0.450598 -0.923387       0

#10


5  

Let me just add that, just like for hum3, .loc didn't solve the SettingWithCopyWarning and I had to resort to df.insert(). In my case false positive was generated by "fake" chain indexing dict['a']['e'], where 'e' is the new column, and dict['a'] is a DataFrame coming from dictionary.

让我补充一点,就像对hum3一样,.loc没有解决带有copywarning的settingingwithcopywarning,我不得不求助于df.insert()。在我的例子中,false positive是由“假”链索引dict['a']['e']生成的,其中'e'是新的列,而dict['a']是来自字典的数据aframe。

Also note that if you know what you are doing, you can switch of the warning using pd.options.mode.chained_assignment = None and than use one of the other solutions given here.

还要注意,如果您知道自己正在做什么,可以使用psd .options.mode切换警告。chained_assignment = None,并使用这里给出的其他解决方案之一。

#11


4  

Before assigning a new column, if you have indexed data, you need to sort the index. At least in my case I had to:

在分配新的列之前,如果您有索引数据,则需要对索引进行排序。至少在我的情况下,我必须:

data.set_index(['index_column'], inplace=True)
"if index is unsorted, assignment of a new column will fail"        
data.sort_index(inplace = True)
data.loc['index_value1', 'column_y'] = np.random.randn(data.loc['index_value1', 'column_x'].shape[0])

#12


4  

One thing to note, though, is that if you do

但要注意的一点是,如果你这么做了。

df1['e'] = Series(np.random.randn(sLength), index=df1.index)

this will effectively be a left join on the df1.index. So if you want to have an outer join effect, my probably imperfect solution is to create a dataframe with index values covering the universe of your data, and then use the code above. For example,

这实际上是df1.index上的左连接。因此,如果您想要有一个外部连接效果,我可能不完美的解决方案是创建一个带有索引值的dataframe,覆盖您的数据的整个宇宙,然后使用上面的代码。例如,

data = pd.DataFrame(index=all_possible_values)
df1['e'] = Series(np.random.randn(sLength), index=df1.index)

#13


4  

The following is what I did... But I'm pretty new to pandas and really Python in general, so no promises.

以下是我所做的……但我对熊猫和巨蟒都很陌生,所以没有承诺。

df = pd.DataFrame([[1, 2], [3, 4], [5,6]], columns=list('AB'))

newCol = [3,5,7]
newName = 'C'

values = np.insert(df.values,df.shape[1],newCol,axis=1)
header = df.columns.values.tolist()
header.append(newName)

df = pd.DataFrame(values,columns=header)

#14


4  

If you get the SettingWithCopyWarning, an easy fix is to copy the DataFrame you are trying to add a column to.

如果您获得了带有copywarning的SettingWithCopyWarning,一个简单的修复方法是复制您试图添加一个列的DataFrame。

df = df.copy()
df['col_name'] = values

#15


4  

If the data frame and Series object have the same index, pandas.concat also works here:

如果数据帧和序列对象有相同的索引,那么熊猫。concat也在这里工作:

import pandas as pd
df
#          a            b           c           d
#0  0.671399     0.101208   -0.181532    0.241273
#1  0.446172    -0.243316    0.051767    1.577318
#2  0.614758     0.075793   -0.451460   -0.012493

e = pd.Series([-0.335485, -1.166658, -0.385571])    
e
#0   -0.335485
#1   -1.166658
#2   -0.385571
#dtype: float64

# here we need to give the series object a name which converts to the new  column name 
# in the result
df = pd.concat([df, e.rename("e")], axis=1)
df

#          a            b           c           d           e
#0  0.671399     0.101208   -0.181532    0.241273   -0.335485
#1  0.446172    -0.243316    0.051767    1.577318   -1.166658
#2  0.614758     0.075793   -0.451460   -0.012493   -0.385571

In case they don't have the same index:

如果他们没有相同的指数:

e.index = df.index
df = pd.concat([df, e.rename("e")], axis=1)

#16


4  

  1. First create a python's list_of_e that has relevant data.
  2. 首先创建具有相关数据的python的list_of_e。
  3. Use this: df['e'] = list_of_e
  4. 使用这个:df['e'] = list_of_e

#17


4  

If the column you are trying to add is a series variable then just :

如果要添加的列是一个级数变量,那么只需:

df["new_columns_name"]=series_variable_name #this will do it for you

This works well even if you are replacing an existing column.just type the new_columns_name same as the column you want to replace.It will just overwrite the existing column data with the new series data.

即使您正在替换一个现有的列,这也可以很好地工作。只需键入与要替换的列相同的new_columns_name。它将用新的系列数据覆盖现有的列数据。

#18


3  

To add a new column, 'e', to the existing data frame

向现有的数据帧添加一个新的列“e”

 df1.loc[:,'e'] = Series(np.random.randn(sLength))

#19


3  

For the sake of completeness - yet another solution using DataFrame.eval() method:

为了完整性起见——使用DataFrame.eval()方法的另一个解决方案:

Data:

数据:

In [44]: e
Out[44]:
0    1.225506
1   -1.033944
2   -0.498953
3   -0.373332
4    0.615030
5   -0.622436
dtype: float64

In [45]: df1
Out[45]:
          a         b         c         d
0 -0.634222 -0.103264  0.745069  0.801288
4  0.782387 -0.090279  0.757662 -0.602408
5 -0.117456  2.124496  1.057301  0.765466
7  0.767532  0.104304 -0.586850  1.051297
8 -0.103272  0.958334  1.163092  1.182315
9 -0.616254  0.296678 -0.112027  0.679112

Solution:

解决方案:

In [46]: df1.eval("e = @e.values", inplace=True)

In [47]: df1
Out[47]:
          a         b         c         d         e
0 -0.634222 -0.103264  0.745069  0.801288  1.225506
4  0.782387 -0.090279  0.757662 -0.602408 -1.033944
5 -0.117456  2.124496  1.057301  0.765466 -0.498953
7  0.767532  0.104304 -0.586850  1.051297 -0.373332
8 -0.103272  0.958334  1.163092  1.182315  0.615030
9 -0.616254  0.296678 -0.112027  0.679112 -0.622436

#20


2  

I was looking for a general way of adding a column of numpy.nans to a dataframe without getting the dumb SettingWithCopyWarning.

我在寻找一种添加numpy列的一般方法。没有得到带有copywarning的哑设置的dataframe。

From the following:

从以下:

  • the answers here
  • 答案在这里
  • this question about passing a variable as a keyword argument
  • 将变量作为关键字参数传递的问题
  • this method for generating a numpy array of NaNs in-line
  • 这种方法用于生成一个NaNs的numpy数组

I came up with this:

我想到了这个:

col = 'column_name'
df = df.assign(**{col:numpy.full(len(df), numpy.nan)})