前面给大家分享了pandas.merge用法详解,这节分享pandas数据合并处理的姊妹篇,pandas.concat用法详解,参考利用Python进行数据分析与pandas官网进行整理。
pandas.merge参数列表如下图,其中只有objs是必须得参数,另外常用参数包括objs、axis、join、keys、ignore_index。
1.pd.concat([df1,df2,df3])
, 默认axis=0,在0轴上合并。
2.pd.concat([df1,df4],axis=1)
–在1轴上合并
3.pd.concat([df1,df2,df3],keys=[‘x', ‘y', ‘z'])
–合并时便于区分建立层次化索引。
4.pd.concat([df1, df4], axis=1, join=‘inner')
–采用内连接合并,join默认为outer外连接。
5.pd.concat([df1, df4], ignore_index=true)
–当原来dataframe的索引没有意义的时候,concat之后可以不需要原来的索引。
补充:python3:pandas(合并concat和merge)
pandas处理多组数据的时候往往会要用到数据的合并处理,其中有三种方式,concat、append和merge。
1、concat
用concat是一种基本的合并方式。而且concat中有很多参数可以调整,合并成你想要的数据形式。axis来指明合并方向。axis=0是预设值,因此未设定任何参数时,函数默认axis=0。(0表示上下合并,1表示左右合并)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
|
import pandas as pd
import numpy as np
#定义资料集
df1 = pd.dataframe(np.ones(( 3 , 4 )) * 0 , columns = [ 'a' , 'b' , 'c' , 'd' ])
df2 = pd.dataframe(np.ones(( 3 , 4 )) * 1 , columns = [ 'a' , 'b' , 'c' , 'd' ])
df3 = pd.dataframe(np.ones(( 3 , 4 )) * 2 , columns = [ 'a' , 'b' , 'c' , 'd' ])
#concat纵向合并
res = pd.concat([df1, df2, df3], axis = 0 )
#打印结果
print (res)
'''
a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
0 2.0 2.0 2.0 2.0
1 2.0 2.0 2.0 2.0
2 2.0 2.0 2.0 2.0
'''
|
上述index为0,1,2,0,1,2形式。为什么会出现这样的情况,其实是仍然按照合并前的index组合起来的。若希望递增,请看下面示例:
ignore_index (重置 index)
重置后的index为0,1,……8
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
res = pd.concat([df1, df2, df3], axis = 0 , ignore_index = true) # 将ignore_index设置为true
print (res) #打印结果
'''
a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
5 1.0 1.0 1.0 1.0
6 2.0 2.0 2.0 2.0
7 2.0 2.0 2.0 2.0
8 2.0 2.0 2.0 2.0
'''
|
join (合并方式)
join='outer'为预设值,因此未设定任何参数时,函数默认join='outer'。此方式是依照column来做纵向合并,有相同的column上下合并在一起,其他独自的column个自成列,原本没有值的位置皆以nan填充。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
import pandas as pd
import numpy as np
#定义资料集
df1 = pd.dataframe(np.ones(( 3 , 4 )) * 0 , columns = [ 'a' , 'b' , 'c' , 'd' ], index = [ 1 , 2 , 3 ])
df2 = pd.dataframe(np.ones(( 3 , 4 )) * 1 , columns = [ 'b' , 'c' , 'd' , 'e' ], index = [ 2 , 3 , 4 ])
res = pd.concat([df1, df2], axis = 0 , join = 'outer' ) #纵向"外"合并df1与df2
print (res)
'''
a b c d e
1 0.0 0.0 0.0 0.0 nan
2 0.0 0.0 0.0 0.0 nan
3 0.0 0.0 0.0 0.0 nan
2 nan 1.0 1.0 1.0 1.0
3 nan 1.0 1.0 1.0 1.0
4 nan 1.0 1.0 1.0 1.0
'''
res = pd.concat([df1, df2], axis = 0 , join = 'inner' ) #纵向"内"合并df1与df2
#打印结果
print (res)
'''
b c d
1 0.0 0.0 0.0
2 0.0 0.0 0.0
3 0.0 0.0 0.0
2 1.0 1.0 1.0
3 1.0 1.0 1.0
4 1.0 1.0 1.0
'''
|
join_axes (依照 axes 合并)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
import pandas as pd
import numpy as np
#定义资料集
df1 = pd.dataframe(np.ones(( 3 , 4 )) * 0 , columns = [ 'a' , 'b' , 'c' , 'd' ], index = [ 1 , 2 , 3 ])
df2 = pd.dataframe(np.ones(( 3 , 4 )) * 1 , columns = [ 'b' , 'c' , 'd' , 'e' ], index = [ 2 , 3 , 4 ])
#依照`df1.index`进行横向合并
res = pd.concat([df1, df2], axis = 1 , join_axes = [df1.index])
#打印结果
print (res)
# a b c d b c d e
# 1 0.0 0.0 0.0 0.0 nan nan nan nan
# 2 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
# 3 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
|
上述脚本中,join_axes=[df1.index]表明按照df1的index来合并,可以看到结果中去掉了df2中出现但df1中没有的index=4这一行。
2、append (添加数据)
append只有纵向合并,没有横向合并。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
|
import pandas as pd
import numpy as np
#定义资料集
df1 = pd.dataframe(np.ones(( 3 , 4 )) * 0 , columns = [ 'a' , 'b' , 'c' , 'd' ])
df2 = pd.dataframe(np.ones(( 3 , 4 )) * 1 , columns = [ 'a' , 'b' , 'c' , 'd' ])
df3 = pd.dataframe(np.ones(( 3 , 4 )) * 1 , columns = [ 'a' , 'b' , 'c' , 'd' ])
s1 = pd.series([ 1 , 2 , 3 , 4 ], index = [ 'a' , 'b' , 'c' , 'd' ])
#将df2合并到df1的下面,以及重置index,并打印出结果
res = df1.append(df2, ignore_index = true)
print (res)
# a b c d
# 0 0.0 0.0 0.0 0.0
# 1 0.0 0.0 0.0 0.0
# 2 0.0 0.0 0.0 0.0
# 3 1.0 1.0 1.0 1.0
# 4 1.0 1.0 1.0 1.0
# 5 1.0 1.0 1.0 1.0
#合并多个df,将df2与df3合并至df1的下面,以及重置index,并打印出结果
res = df1.append([df2, df3], ignore_index = true)
print (res)
# a b c d
# 0 0.0 0.0 0.0 0.0
# 1 0.0 0.0 0.0 0.0
# 2 0.0 0.0 0.0 0.0
# 3 1.0 1.0 1.0 1.0
# 4 1.0 1.0 1.0 1.0
# 5 1.0 1.0 1.0 1.0
# 6 1.0 1.0 1.0 1.0
# 7 1.0 1.0 1.0 1.0
# 8 1.0 1.0 1.0 1.0
#合并series,将s1合并至df1,以及重置index,并打印出结果
res = df1.append(s1, ignore_index = true)
print (res)
# a b c d
# 0 0.0 0.0 0.0 0.0
# 1 0.0 0.0 0.0 0.0
# 2 0.0 0.0 0.0 0.0
# 3 1.0 2.0 3.0 4.0
|
3、merge
根据两组数据中的关键字key来合并(key在两组数据中是完全一致的)。
3.1依据一组key合并
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
import pandas as pd
#定义资料集并打印出
left = pd.dataframe({ 'key' : [ 'k0' , 'k1' , 'k2' , 'k3' ],
'a' : [ 'a0' , 'a1' , 'a2' , 'a3' ],
'b' : [ 'b0' , 'b1' , 'b2' , 'b3' ]})
right = pd.dataframe({ 'key' : [ 'k0' , 'k1' , 'k2' , 'k3' ],
'c' : [ 'c0' , 'c1' , 'c2' , 'c3' ],
'd' : [ 'd0' , 'd1' , 'd2' , 'd3' ]})
print (left)
# a b key
# 0 a0 b0 k0
# 1 a1 b1 k1
# 2 a2 b2 k2
# 3 a3 b3 k3
print (right)
# c d key
# 0 c0 d0 k0
# 1 c1 d1 k1
# 2 c2 d2 k2
# 3 c3 d3 k3
#依据key column合并,并打印出
res = pd.merge(left, right, on = 'key' )
print (res)
a b key c d
# 0 a0 b0 k0 c0 d0
# 1 a1 b1 k1 c1 d1
# 2 a2 b2 k2 c2 d2
# 3 a3 b3 k3 c3 d3
|
3.2 根据两组key合并
合并时有4种方法how = ['left', 'right', 'outer', 'inner'],预设值how='inner'。
inner:按照关键字组合之后,去掉组合中有合并项为nan的行。
outer :保留所有组合
left:仅保留左边合并项为nan的行
right:仅保留右边合并项为nan的行
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
|
import pandas as pd
import numpy as np
#定义资料集并打印出
left = pd.dataframe({ 'key1' : [ 'k0' , 'k0' , 'k1' , 'k2' ],
'key2' : [ 'k0' , 'k1' , 'k0' , 'k1' ],
'a' : [ 'a0' , 'a1' , 'a2' , 'a3' ],
'b' : [ 'b0' , 'b1' , 'b2' , 'b3' ]})
right = pd.dataframe({ 'key1' : [ 'k0' , 'k1' , 'k1' , 'k2' ],
'key2' : [ 'k0' , 'k0' , 'k0' , 'k0' ],
'c' : [ 'c0' , 'c1' , 'c2' , 'c3' ],
'd' : [ 'd0' , 'd1' , 'd2' , 'd3' ]})
print (left)
'''
key1 key2 a b
0 k0 k0 a0 b0
1 k0 k1 a1 b1
2 k1 k0 a2 b2
3 k2 k1 a3 b3
'''
print (right)
'''
key1 key2 c d
0 k0 k0 c0 d0
1 k1 k0 c1 d1
2 k1 k0 c2 d2
3 k2 k0 c3 d3
'''
#依据key1与key2 columns进行合并,并打印出四种结果['left', 'right', 'outer', 'inner']
res = pd.merge(left, right, on = [ 'key1' , 'key2' ], how = 'inner' )
print (res)
'''
key1 key2 a b c d
0 k0 k0 a0 b0 c0 d0
1 k1 k0 a2 b2 c1 d1
2 k1 k0 a2 b2 c2 d2
'''
res = pd.merge(left, right, on = [ 'key1' , 'key2' ], how = 'outer' )
print (res)
'''
key1 key2 a b c d
0 k0 k0 a0 b0 c0 d0
1 k0 k1 a1 b1 nan nan
2 k1 k0 a2 b2 c1 d1
3 k1 k0 a2 b2 c2 d2
4 k2 k1 a3 b3 nan nan
5 k2 k0 nan nan c3 d3
'''
res = pd.merge(left, right, on = [ 'key1' , 'key2' ], how = 'left' )
print (res)
'''
key1 key2 a b c d
0 k0 k0 a0 b0 c0 d0
1 k0 k1 a1 b1 nan nan
2 k1 k0 a2 b2 c1 d1
3 k1 k0 a2 b2 c2 d2
4 k2 k1 a3 b3 nan nan
'''
res = pd.merge(left, right, on = [ 'key1' , 'key2' ], how = 'right' )
print (res)
'''
key1 key2 a b c d
0 k0 k0 a0 b0 c0 d0
1 k1 k0 a2 b2 c1 d1
2 k1 k0 a2 b2 c2 d2
3 k2 k0 nan nan c3 d3
'''
|
3.3 indicator
indicator=true会将合并的记录放在新的一列。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
|
import pandas as pd
#定义资料集并打印出
df1 = pd.dataframe({ 'col1' :[ 0 , 1 ], 'col_left' :[ 'a' , 'b' ]})
df2 = pd.dataframe({ 'col1' :[ 1 , 2 , 2 ], 'col_right' :[ 2 , 2 , 2 ]})
print (df1)
# col1 col_left
# 0 0 a
# 1 1 b
print (df2)
# col1 col_right
# 0 1 2
# 1 2 2
# 2 2 2
# 依据col1进行合并,并启用indicator=true,最后打印出
res = pd.merge(df1, df2, on = 'col1' , how = 'outer' , indicator = true)
print (res)
# col1 col_left col_right _merge
# 0 0.0 a nan left_only
# 1 1.0 b 2.0 both
# 2 2.0 nan 2.0 right_only
# 3 2.0 nan 2.0 right_only
# 自定indicator column的名称,并打印出
res = pd.merge(df1, df2, on = 'col1' , how = 'outer' , indicator = 'indicator_column' )
print (res)
# col1 col_left col_right indicator_column
# 0 0.0 a nan left_only
# 1 1.0 b 2.0 both
# 2 2.0 nan 2.0 right_only
# 3 2.0 nan 2.0 right_only
|
3.4 依据index合并
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
|
import pandas as pd
#定义资料集并打印出
left = pd.dataframe({ 'a' : [ 'a0' , 'a1' , 'a2' ],
'b' : [ 'b0' , 'b1' , 'b2' ]},
index = [ 'k0' , 'k1' , 'k2' ])
right = pd.dataframe({ 'c' : [ 'c0' , 'c2' , 'c3' ],
'd' : [ 'd0' , 'd2' , 'd3' ]},
index = [ 'k0' , 'k2' , 'k3' ])
print (left)
# a b
# k0 a0 b0
# k1 a1 b1
# k2 a2 b2
print (right)
# c d
# k0 c0 d0
# k2 c2 d2
# k3 c3 d3
#依据左右资料集的index进行合并,how='outer',并打印出
res = pd.merge(left, right, left_index = true, right_index = true, how = 'outer' )
print (res)
# a b c d
# k0 a0 b0 c0 d0
# k1 a1 b1 nan nan
# k2 a2 b2 c2 d2
# k3 nan nan c3 d3
#依据左右资料集的index进行合并,how='inner',并打印出
res = pd.merge(left, right, left_index = true, right_index = true, how = 'inner' )
print (res)
# a b c d
# k0 a0 b0 c0 d0
# k2 a2 b2 c2 d2
|
3.5 解决overlapping的问题
下面脚本中,boys和girls均有属性age,但是两者值不同,因此需要在合并时加上后缀suffixes,以示区分。
1
2
3
4
5
6
7
8
9
10
11
12
|
import pandas as pd
#定义资料集
boys = pd.dataframe({ 'k' : [ 'k0' , 'k1' , 'k2' ], 'age' : [ 1 , 2 , 3 ]})
girls = pd.dataframe({ 'k' : [ 'k0' , 'k0' , 'k3' ], 'age' : [ 4 , 5 , 6 ]})
#使用suffixes解决overlapping的问题
res = pd.merge(boys, girls, on = 'k' , suffixes = [ '_boy' , '_girl' ], how = 'inner' )
print (res)
# age_boy k age_girl
# 0 1 k0 4
# 1 1 k0 5
|
以上是pandas中有关于合并的一些操作。当然,如果练习的多了,几个方法也是大同小异。希望大家多多支持服务器之家。如有错误或未考虑完全的地方,望不吝赐教。
原文链接:https://blog.csdn.net/Asher117/article/details/84799845