基于python中的pandas索引在新列中添加值

时间:2021-12-19 22:54:48

I'm just getting into pandas and I am trying to add a new column to an existing dataframe.

我刚刚进入大熊猫,我正在尝试向现有数据框添加新列。

I have two dataframes where the index of one data frame links to a column in another dataframe. Where these values are equal I need to put the value of another column in the source dataframe in a new column of the destination column.

我有两个数据帧,其中一个数据帧的索引链接到另一个数据帧中的列。在这些值相等的情况下,我需要将源数据帧中另一列的值放在目标列的新列中。

The code section below illustrates what I mean. The commented part is what I need as an output.

下面的代码部分说明了我的意思。评论部分是我需要的输出。

I guess I need the .loc[] function.

我想我需要.loc []函数。

Another, minor, question: is it bad practice to have a non-unique indexes?

另一个小问题:拥有非唯一索引是不好的做法吗?

import pandas as pd

d = {'key':['a',  'b', 'c'], 
     'bar':[1, 2, 3]}

d2 = {'key':['a', 'a', 'b'],
      'other_data':['10', '20', '30']}

df = pd.DataFrame(d)
df2 = pd.DataFrame(data = d2)
df2 = df2.set_index('key')

print df2

##    other_data  new_col
##key           
##a            10   1
##a            20   1
##b            30   2

5 个解决方案

#1


2  

Use rename index by Series:

按系列重命名索引:

df2['new'] = df2.rename(index=df.set_index('key')['bar']).index
print (df2)

    other_data  new
key                
a           10    1
a           20    1
b           30    2

Or map:

或者地图:

df2['new'] = df2.index.to_series().map(df.set_index('key')['bar'])
print (df2)

    other_data  new
key                
a           10    1
a           20    1
b           30    2

If want better performance, the best is avoid duplicates in index. Also some function like reindex failed in duplicates index.

如果想要更好的性能,最好的是避免重复索引。还有一些函数如reindex在重复索引中失败。

#2


2  

You can use join

你可以使用join

df2.join(df.set_index('key'))

    other_data  bar
key                
a           10    1
a           20    1
b           30    2

One way to rename the column in the process

在流程中重命名列的一种方法

df2.join(df.set_index('key').bar.rename('new'))

    other_data  new
key                
a           10    1
a           20    1
b           30    2

#3


1  

Using combine_first

使用combine_first

In [442]: df2.combine_first(df.set_index('key')).dropna()
Out[442]:
     bar other_data
key
a    1.0         10
a    1.0         20
b    2.0         30

Or, using map

或者,使用地图

In [461]: df2.assign(bar=df2.index.to_series().map(df.set_index('key')['bar']))
Out[461]:
    other_data  bar
key
a           10    1
a           20    1
b           30    2

#4


1  

With the help of .loc

在.loc的帮助下

df2['new'] = df.set_index('key').loc[df2.index]

Output :

输出:

   other_data  new
key                
a           10    1
a           20    1
b           30    2

#5


1  

Another, minor, question: is it bad practice to have a non-unique indexes?

另一个小问题:拥有非唯一索引是不好的做法吗?

It is not great practice, but depends on your needs and can be okay in some circumstances.

这不是很好的做法,但取决于您的需求,在某些情况下可以。

Issue 1: join operations

A good place to start is to think about what makes an Index different from a standard DataFrame column. This engenders the question: if your Index has duplicate values, does it really need to be specified as an Index, or could it just be another column in a RangeIndex-ed DataFrame? If you've ever used SQL or any other DMBS and want to mimic join operations in pandas with functions such as .join or .merge, you'll lose the functionality of a primary key if you have duplicate index values. A merge will give you what is basically a cartesian product--probably not what you're looking for.

一个好的起点是考虑使索引与标准DataFrame列不同的原因。这就产生了一个问题:如果您的索引具有重复值,是否真的需要指定为索引,还是只能是RangeIndex-ed DataFrame中的另一列?如果您曾经使用过SQL或任何其他DMBS,并希望模仿带有.join或.merge等函数的pandas中的连接操作,那么如果您有重复的索引值,则会丢失主键的功能。合并将为您提供基本上是笛卡尔积的产品 - 可能不是您想要的。

For example:

例如:

df = pd.DataFrame(np.random.randn(10,2),
                  index=2*list('abcde'))
df2 = df.rename(columns={0: 'a', 1 : 'b'})
print(df.merge(df2, left_index=True, right_index=True).head(7))
         0        1        a        b
a  0.73737  1.49073  0.73737  1.49073
a  0.73737  1.49073 -0.25562 -2.79859
a -0.25562 -2.79859  0.73737  1.49073
a -0.25562 -2.79859 -0.25562 -2.79859
b -0.93583  1.17583 -0.93583  1.17583
b -0.93583  1.17583 -1.77153 -0.69988
b -1.77153 -0.69988 -0.93583  1.17583

Issue 2: performance

Unique-valued indices make certain operations efficient, as explained in this post.

如本文所述,唯一值指数使某些操作有效。

When index is unique, pandas use a hashtable to map key to value O(1). When index is non-unique and sorted, pandas use binary search O(logN), when index is random ordered pandas need to check all the keys in the index O(N).

当index是唯一的时,pandas使用哈希表将键映射到值O(1)。当index是非唯一且已排序时,pandas使用二进制搜索O(logN),当index是随机排序时,pandas需要检查索引O(N)中的所有键。

A word on .loc

Using .loc will return all instances of the label. This can be a blessing or a curse depending on what your objective is. For example,

使用.loc将返回标签的所有实例。这可能是一种祝福或诅咒,取决于你的目标是什么。例如,

df = pd.DataFrame(np.random.randn(10,2),
                  index=2*list('abcde'))
print(df.loc['a'])
         0        1
a  0.73737  1.49073
a -0.25562 -2.79859

#1


2  

Use rename index by Series:

按系列重命名索引:

df2['new'] = df2.rename(index=df.set_index('key')['bar']).index
print (df2)

    other_data  new
key                
a           10    1
a           20    1
b           30    2

Or map:

或者地图:

df2['new'] = df2.index.to_series().map(df.set_index('key')['bar'])
print (df2)

    other_data  new
key                
a           10    1
a           20    1
b           30    2

If want better performance, the best is avoid duplicates in index. Also some function like reindex failed in duplicates index.

如果想要更好的性能,最好的是避免重复索引。还有一些函数如reindex在重复索引中失败。

#2


2  

You can use join

你可以使用join

df2.join(df.set_index('key'))

    other_data  bar
key                
a           10    1
a           20    1
b           30    2

One way to rename the column in the process

在流程中重命名列的一种方法

df2.join(df.set_index('key').bar.rename('new'))

    other_data  new
key                
a           10    1
a           20    1
b           30    2

#3


1  

Using combine_first

使用combine_first

In [442]: df2.combine_first(df.set_index('key')).dropna()
Out[442]:
     bar other_data
key
a    1.0         10
a    1.0         20
b    2.0         30

Or, using map

或者,使用地图

In [461]: df2.assign(bar=df2.index.to_series().map(df.set_index('key')['bar']))
Out[461]:
    other_data  bar
key
a           10    1
a           20    1
b           30    2

#4


1  

With the help of .loc

在.loc的帮助下

df2['new'] = df.set_index('key').loc[df2.index]

Output :

输出:

   other_data  new
key                
a           10    1
a           20    1
b           30    2

#5


1  

Another, minor, question: is it bad practice to have a non-unique indexes?

另一个小问题:拥有非唯一索引是不好的做法吗?

It is not great practice, but depends on your needs and can be okay in some circumstances.

这不是很好的做法,但取决于您的需求,在某些情况下可以。

Issue 1: join operations

A good place to start is to think about what makes an Index different from a standard DataFrame column. This engenders the question: if your Index has duplicate values, does it really need to be specified as an Index, or could it just be another column in a RangeIndex-ed DataFrame? If you've ever used SQL or any other DMBS and want to mimic join operations in pandas with functions such as .join or .merge, you'll lose the functionality of a primary key if you have duplicate index values. A merge will give you what is basically a cartesian product--probably not what you're looking for.

一个好的起点是考虑使索引与标准DataFrame列不同的原因。这就产生了一个问题:如果您的索引具有重复值,是否真的需要指定为索引,还是只能是RangeIndex-ed DataFrame中的另一列?如果您曾经使用过SQL或任何其他DMBS,并希望模仿带有.join或.merge等函数的pandas中的连接操作,那么如果您有重复的索引值,则会丢失主键的功能。合并将为您提供基本上是笛卡尔积的产品 - 可能不是您想要的。

For example:

例如:

df = pd.DataFrame(np.random.randn(10,2),
                  index=2*list('abcde'))
df2 = df.rename(columns={0: 'a', 1 : 'b'})
print(df.merge(df2, left_index=True, right_index=True).head(7))
         0        1        a        b
a  0.73737  1.49073  0.73737  1.49073
a  0.73737  1.49073 -0.25562 -2.79859
a -0.25562 -2.79859  0.73737  1.49073
a -0.25562 -2.79859 -0.25562 -2.79859
b -0.93583  1.17583 -0.93583  1.17583
b -0.93583  1.17583 -1.77153 -0.69988
b -1.77153 -0.69988 -0.93583  1.17583

Issue 2: performance

Unique-valued indices make certain operations efficient, as explained in this post.

如本文所述,唯一值指数使某些操作有效。

When index is unique, pandas use a hashtable to map key to value O(1). When index is non-unique and sorted, pandas use binary search O(logN), when index is random ordered pandas need to check all the keys in the index O(N).

当index是唯一的时,pandas使用哈希表将键映射到值O(1)。当index是非唯一且已排序时,pandas使用二进制搜索O(logN),当index是随机排序时,pandas需要检查索引O(N)中的所有键。

A word on .loc

Using .loc will return all instances of the label. This can be a blessing or a curse depending on what your objective is. For example,

使用.loc将返回标签的所有实例。这可能是一种祝福或诅咒,取决于你的目标是什么。例如,

df = pd.DataFrame(np.random.randn(10,2),
                  index=2*list('abcde'))
print(df.loc['a'])
         0        1
a  0.73737  1.49073
a -0.25562 -2.79859