I'm just getting into pandas and I am trying to add a new column to an existing dataframe.
我刚刚进入大熊猫,我正在尝试向现有数据框添加新列。
I have two dataframes where the index of one data frame links to a column in another dataframe. Where these values are equal I need to put the value of another column in the source dataframe in a new column of the destination column.
我有两个数据帧,其中一个数据帧的索引链接到另一个数据帧中的列。在这些值相等的情况下,我需要将源数据帧中另一列的值放在目标列的新列中。
The code section below illustrates what I mean. The commented part is what I need as an output.
下面的代码部分说明了我的意思。评论部分是我需要的输出。
I guess I need the .loc[]
function.
我想我需要.loc []函数。
Another, minor, question: is it bad practice to have a non-unique indexes?
另一个小问题:拥有非唯一索引是不好的做法吗?
import pandas as pd
d = {'key':['a', 'b', 'c'],
'bar':[1, 2, 3]}
d2 = {'key':['a', 'a', 'b'],
'other_data':['10', '20', '30']}
df = pd.DataFrame(d)
df2 = pd.DataFrame(data = d2)
df2 = df2.set_index('key')
print df2
## other_data new_col
##key
##a 10 1
##a 20 1
##b 30 2
5 个解决方案
#1
2
Use rename index
by Series
:
按系列重命名索引:
df2['new'] = df2.rename(index=df.set_index('key')['bar']).index
print (df2)
other_data new
key
a 10 1
a 20 1
b 30 2
Or map
:
或者地图:
df2['new'] = df2.index.to_series().map(df.set_index('key')['bar'])
print (df2)
other_data new
key
a 10 1
a 20 1
b 30 2
If want better performance, the best is avoid duplicates in index. Also some function like reindex
failed in duplicates index.
如果想要更好的性能,最好的是避免重复索引。还有一些函数如reindex在重复索引中失败。
#2
2
You can use join
你可以使用join
df2.join(df.set_index('key'))
other_data bar
key
a 10 1
a 20 1
b 30 2
One way to rename the column in the process
在流程中重命名列的一种方法
df2.join(df.set_index('key').bar.rename('new'))
other_data new
key
a 10 1
a 20 1
b 30 2
#3
1
Using combine_first
使用combine_first
In [442]: df2.combine_first(df.set_index('key')).dropna()
Out[442]:
bar other_data
key
a 1.0 10
a 1.0 20
b 2.0 30
Or, using map
或者,使用地图
In [461]: df2.assign(bar=df2.index.to_series().map(df.set_index('key')['bar']))
Out[461]:
other_data bar
key
a 10 1
a 20 1
b 30 2
#4
1
With the help of .loc
在.loc的帮助下
df2['new'] = df.set_index('key').loc[df2.index]
Output :
输出:
other_data new key a 10 1 a 20 1 b 30 2
#5
1
Another, minor, question: is it bad practice to have a non-unique indexes?
另一个小问题:拥有非唯一索引是不好的做法吗?
It is not great practice, but depends on your needs and can be okay in some circumstances.
这不是很好的做法,但取决于您的需求,在某些情况下可以。
Issue 1: join operations
A good place to start is to think about what makes an Index different from a standard DataFrame column. This engenders the question: if your Index has duplicate values, does it really need to be specified as an Index, or could it just be another column in a RangeIndex
-ed DataFrame? If you've ever used SQL or any other DMBS and want to mimic join operations in pandas with functions such as .join
or .merge
, you'll lose the functionality of a primary key if you have duplicate index values. A merge will give you what is basically a cartesian product--probably not what you're looking for.
一个好的起点是考虑使索引与标准DataFrame列不同的原因。这就产生了一个问题:如果您的索引具有重复值,是否真的需要指定为索引,还是只能是RangeIndex-ed DataFrame中的另一列?如果您曾经使用过SQL或任何其他DMBS,并希望模仿带有.join或.merge等函数的pandas中的连接操作,那么如果您有重复的索引值,则会丢失主键的功能。合并将为您提供基本上是笛卡尔积的产品 - 可能不是您想要的。
For example:
例如:
df = pd.DataFrame(np.random.randn(10,2),
index=2*list('abcde'))
df2 = df.rename(columns={0: 'a', 1 : 'b'})
print(df.merge(df2, left_index=True, right_index=True).head(7))
0 1 a b
a 0.73737 1.49073 0.73737 1.49073
a 0.73737 1.49073 -0.25562 -2.79859
a -0.25562 -2.79859 0.73737 1.49073
a -0.25562 -2.79859 -0.25562 -2.79859
b -0.93583 1.17583 -0.93583 1.17583
b -0.93583 1.17583 -1.77153 -0.69988
b -1.77153 -0.69988 -0.93583 1.17583
Issue 2: performance
Unique-valued indices make certain operations efficient, as explained in this post.
如本文所述,唯一值指数使某些操作有效。
When index is unique, pandas use a hashtable to map key to value O(1). When index is non-unique and sorted, pandas use binary search O(logN), when index is random ordered pandas need to check all the keys in the index O(N).
当index是唯一的时,pandas使用哈希表将键映射到值O(1)。当index是非唯一且已排序时,pandas使用二进制搜索O(logN),当index是随机排序时,pandas需要检查索引O(N)中的所有键。
A word on .loc
Using .loc
will return all instances of the label. This can be a blessing or a curse depending on what your objective is. For example,
使用.loc将返回标签的所有实例。这可能是一种祝福或诅咒,取决于你的目标是什么。例如,
df = pd.DataFrame(np.random.randn(10,2),
index=2*list('abcde'))
print(df.loc['a'])
0 1
a 0.73737 1.49073
a -0.25562 -2.79859
#1
2
Use rename index
by Series
:
按系列重命名索引:
df2['new'] = df2.rename(index=df.set_index('key')['bar']).index
print (df2)
other_data new
key
a 10 1
a 20 1
b 30 2
Or map
:
或者地图:
df2['new'] = df2.index.to_series().map(df.set_index('key')['bar'])
print (df2)
other_data new
key
a 10 1
a 20 1
b 30 2
If want better performance, the best is avoid duplicates in index. Also some function like reindex
failed in duplicates index.
如果想要更好的性能,最好的是避免重复索引。还有一些函数如reindex在重复索引中失败。
#2
2
You can use join
你可以使用join
df2.join(df.set_index('key'))
other_data bar
key
a 10 1
a 20 1
b 30 2
One way to rename the column in the process
在流程中重命名列的一种方法
df2.join(df.set_index('key').bar.rename('new'))
other_data new
key
a 10 1
a 20 1
b 30 2
#3
1
Using combine_first
使用combine_first
In [442]: df2.combine_first(df.set_index('key')).dropna()
Out[442]:
bar other_data
key
a 1.0 10
a 1.0 20
b 2.0 30
Or, using map
或者,使用地图
In [461]: df2.assign(bar=df2.index.to_series().map(df.set_index('key')['bar']))
Out[461]:
other_data bar
key
a 10 1
a 20 1
b 30 2
#4
1
With the help of .loc
在.loc的帮助下
df2['new'] = df.set_index('key').loc[df2.index]
Output :
输出:
other_data new key a 10 1 a 20 1 b 30 2
#5
1
Another, minor, question: is it bad practice to have a non-unique indexes?
另一个小问题:拥有非唯一索引是不好的做法吗?
It is not great practice, but depends on your needs and can be okay in some circumstances.
这不是很好的做法,但取决于您的需求,在某些情况下可以。
Issue 1: join operations
A good place to start is to think about what makes an Index different from a standard DataFrame column. This engenders the question: if your Index has duplicate values, does it really need to be specified as an Index, or could it just be another column in a RangeIndex
-ed DataFrame? If you've ever used SQL or any other DMBS and want to mimic join operations in pandas with functions such as .join
or .merge
, you'll lose the functionality of a primary key if you have duplicate index values. A merge will give you what is basically a cartesian product--probably not what you're looking for.
一个好的起点是考虑使索引与标准DataFrame列不同的原因。这就产生了一个问题:如果您的索引具有重复值,是否真的需要指定为索引,还是只能是RangeIndex-ed DataFrame中的另一列?如果您曾经使用过SQL或任何其他DMBS,并希望模仿带有.join或.merge等函数的pandas中的连接操作,那么如果您有重复的索引值,则会丢失主键的功能。合并将为您提供基本上是笛卡尔积的产品 - 可能不是您想要的。
For example:
例如:
df = pd.DataFrame(np.random.randn(10,2),
index=2*list('abcde'))
df2 = df.rename(columns={0: 'a', 1 : 'b'})
print(df.merge(df2, left_index=True, right_index=True).head(7))
0 1 a b
a 0.73737 1.49073 0.73737 1.49073
a 0.73737 1.49073 -0.25562 -2.79859
a -0.25562 -2.79859 0.73737 1.49073
a -0.25562 -2.79859 -0.25562 -2.79859
b -0.93583 1.17583 -0.93583 1.17583
b -0.93583 1.17583 -1.77153 -0.69988
b -1.77153 -0.69988 -0.93583 1.17583
Issue 2: performance
Unique-valued indices make certain operations efficient, as explained in this post.
如本文所述,唯一值指数使某些操作有效。
When index is unique, pandas use a hashtable to map key to value O(1). When index is non-unique and sorted, pandas use binary search O(logN), when index is random ordered pandas need to check all the keys in the index O(N).
当index是唯一的时,pandas使用哈希表将键映射到值O(1)。当index是非唯一且已排序时,pandas使用二进制搜索O(logN),当index是随机排序时,pandas需要检查索引O(N)中的所有键。
A word on .loc
Using .loc
will return all instances of the label. This can be a blessing or a curse depending on what your objective is. For example,
使用.loc将返回标签的所有实例。这可能是一种祝福或诅咒,取决于你的目标是什么。例如,
df = pd.DataFrame(np.random.randn(10,2),
index=2*list('abcde'))
print(df.loc['a'])
0 1
a 0.73737 1.49073
a -0.25562 -2.79859