pandas Series DataFrame 综合学习

时间:2021-12-22 04:26:10

综合学习分析

索引对象

pandas 中的索引对象负责管理轴标签和其他元数据(比如轴名称)

from pandas import Series

obj = Series(range(3), index=['a', 'b', 'c'])
index = obj.index
print(index) # Index(['a', 'b', 'c'], dtype='object')
print(index[1:]) # Index(['b', 'c'], dtype='object')

Index 是不能被修改的用户不能对其修改

index[1] = 'd'
# Traceback (most recent call last):
# File "E:/pandas_study/comone/a.py", line 8, in <module>
# index[1] = 'd'
# File "C:\Python36\lib\site-packages\pandas\core\indexes\base.py", line 1724, in __setitem__
# raise TypeError("Index does not support mutable operations")
# TypeError: Index does not support mutable operations

不可修改行很重要, 这样才能是Index对象在多个数据结构中安全共享数据

from pandas import Series
import pandas as pd
import numpy as np

index = pd.Index(np.arange(3))
obj = Series([1.5, -2.5, 0], index=index)

print(index is obj.index)
print(obj.index is index)

基本功能

现在我们要操作Series和DataFrame 中的基础数据的基本手段

1 重新索引
reindex 作用: 创建一个适应新索引的新对象。
下面来比较 这几种没有index 指定index 重新指定排序

from pandas import Series
import pandas as pd
import numpy as np

data = {"a": -5.3, "c": 3.6, "b": 7.2, 'd': 4.5}
# obj = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj = Series(data)
print(obj)
# a -5.3
# b 7.2
# c 3.6
# d 4.5
# dtype: float64
print("=================")
obj2 = Series(data, index=['d', 'b', 'a', 'c'])
print(obj2)
# d 4.5
# b 7.2
# a -5.3
# c 3.6
# dtype: float64
print("=================")
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
print(obj2)
# a -5.3
# b 7.2
# c 3.6
# d 4.5
# e NaN
# dtype: float64

如果某个索引值当前不存在, 就引入缺失值

空的时候缺失值 使用fill_value 填充

obj3 = obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=10)
print(obj3)
# a -5.3
# b 7.2
# c 3.6
# d 4.5
# e 10.0
# dtype: float64

重新索引有时候需要插值处理。method选项可以达到。 ffill可以实现向前值传值

obj = Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
print(obj)
# 0 blue
# 2 purple
# 4 yellow
# dtype: object
print("====")
obj2 = obj.reindex(range(6), method='ffill')
print(obj2)
# 0 blue
# 1 blue
# 2 purple
# 3 purple
# 4 yellow
# 5 yellow
# dtype: object

pandas Series DataFrame 综合学习

ffill 向前 填充
bfill 向后填充

修改index 索引

对于DataFrame, reindex可以修改索引, 或者连个都修改。如果只传入一个序列, 则会重新索引行

from pandas import Series, DataFrame
import numpy as np

frame = DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'],
                  columns=['yang', 'xiao', 'dong']
                  )
print(frame)
# yang xiao dong
# a 0 1 2
# c 3 4 5
# d 6 7 8
print("=========")
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
print(frame2)
# yang xiao dong
# a 0.0 1.0 2.0
# b NaN NaN NaN
# c 3.0 4.0 5.0
# d 6.0 7.0 8.0
print("==============")
state = ['yang', 'yan', 'dong']
frame3 = frame.reindex(columns=state)
print(frame3)
# yang yan dong
# a 0 NaN 2
# c 3 NaN 5
# d 6 NaN 8
print("=============")

可以对行和列进行重新索引, 而插值只能按行应用(轴为0)

# 对行和列同时进行索引
frame.reindex(index=['a','b','c','d'], method='ffill, columns=state # 比较简洁的一种方式, 下面这种方式是上面方式的简写 frame.ix(['a','b','c','d'], state)

利用ix的标签索引功能, 重新索引任务可以变得更加简洁

reindex 函数中的参数
pandas Series DataFrame 综合学习

丢弃指定轴上的项

由于需要执行一些数据整理和集合逻辑, 所以drop方法返回的是一个再指定轴上删除了指定值的新对象
注意返回的是新的对象。

Series 上面的丢弃

from pandas import Series
import pandas as pd
import numpy as np

obj = Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
print(obj)
# a 0.0
# b 1.0
# c 2.0
# d 3.0
# e 4.0
# dtype: float64
print("=============")
new_obj = obj.drop("c")
print(new_obj)
# a 0.0
# b 1.0
# d 3.0
# e 4.0
# dtype: float64
print("===============")
new_obj2 = obj.drop(['a', 'b'])
print(new_obj2)
# c 2.0
# d 3.0
# e 4.0
# dtype: float64

DataFrame 上面的丢弃

axis =0 =1 的理解
pandas Series DataFrame 综合学习

0 跨行 沿着行垂直往下
1 跨列 沿着列方向水平延伸

操作列就是 axis 为1 操作行就是axis =0

from pandas import Series, DataFrame
import pandas as pd
import numpy as np

frame = DataFrame(np.arange(16).reshape((4,4)),
                  index=['a', 'b', 'c', 'd'],
                  columns=['one', 'two', 'three', 'four']
                  )

print(frame)
# one two three four
# a 0 1 2 3
# b 4 5 6 7
# c 8 9 10 11
# d 12 13 14 15
print("==============")
frame2 = frame.drop(['a', 'b'])
print(frame2)
# one two three four
# c 8 9 10 11
# d 12 13 14 15
print("======")
frame3 = frame.drop('two', axis=1)
print(frame3)
# one three four
# a 0 2 3
# b 4 6 7
# c 8 10 11
# d 12 14 15
print("============")
frame4 = frame.drop(['two', 'four'], axis=1)
print(frame4)
# one three
# a 0 2
# b 4 6
# c 8 10
# d 12 14

默认的是axis = 0

索引选取 过滤

from pandas import Series, DataFrame
import pandas as pd
import numpy as np

obj = Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
print(obj)
# a 0.0
# b 1.0
# c 2.0
# d 3.0
# dtype: float64
print("==")
print(obj['b'])
print(obj.b)
print(obj[1])
print(obj[3])
# 1.0
# 1.0
# 1.0
# 3.0
print("============")
print(obj[2:4])
print(obj[['b', 'c', 'd']])
print(obj[[1, 3]])
print(obj[obj < 2])

# c 2.0
# d 3.0
# dtype: float64
# b 1.0
# c 2.0
# d 3.0
# dtype: float64
# b 1.0
# d 3.0
# dtype: float64
# a 0.0
# b 1.0
# dtype: float64

切片利用标签的切片运算和 普通的不一样, 其末端是包含的。

print(obj['b':'c'])
#b 1.0
#c 2.0
#dtype: float64

给切片的位置设置值

obj['b':'c'] = 5
print(obj)
# a 0.0
# b 5.0
# c 5.0
# d 3.0
# dtype: float64

对DataFrame 进行索引就是获取一个或者多个列

索引中的特殊情况

from pandas import Series, DataFrame
import pandas as pd
import numpy as np

frame = DataFrame(np.arange(16).reshape((4, 4)),
                  index=['a', 'b', 'c', 'd'],
                  columns=['one', 'two', 'three', 'four']
                  )

print(frame[:2])
# one two three four
#a 0 1 2 3
#b 4 5 6 7
print("========")
print(frame[frame['three'] > 5])
# one two three four
#b 4 5 6 7
#c 8 9 10 11
#d 12 13 14 15

索引字段ix

为了DataFrame 在行上进行标签索引。 她是你可以通过Numpy 式的标记法以及轴标签从DataFrame中选取行和列的子集

from pandas import Series, DataFrame
import pandas as pd
import numpy as np

frame = DataFrame(np.arange(16).reshape((4, 4)),
                  index=['a', 'b', 'c', 'd'],
                  columns=['one', 'two', 'three', 'four']
                  )
print(frame)
# one two three four
# a 0 1 2 3
# b 4 5 6 7
# c 8 9 10 11
# d 12 13 14 15
print(frame.ix['a', ['two', 'three']])
# two 1
# three 2
# Name: a, dtype: int32
print("=======")
print(frame.ix[['b', 'c'], [3, 0, 1]])
# four one two
# b 7 4 5
# c 11 8 9
print(frame.ix[['b', 'c'], ["four", "one", "two"]])
# four one two
# b 7 4 5
# c 11 8 9
print("=======")
print(frame.ix[2])
# one 8
# two 9
# three 10
# four 11
# Name: c, dtype: int32
print(frame.ix[:'c', 'two'])
# a 1
# b 5
# c 9
# Name: two, dtype: int32
print("=========")
print(frame.ix[frame.three > 5, :3])
# one two three
# b 4 5 6
# c 8 9 10
# d 12 13 14

pandas 对象中的数据的选取和重排的方式很多
下面是一些总结
pandas Series DataFrame 综合学习

pandas Series DataFrame 综合学习

算术运算和数据对其

pandas 的一个重要功能是对不同索引的对象进行算术运算。 在将对象相加的时候, 如果存在不同的索引对, 则结果的索引就是索引对的并集。

s1 = Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
s3 = s1 + s2
print(s1)
#a 7.3
#c -2.5
#d 3.4
#e 1.5
#dtype: float64
print(s2)
#a -2.1
#c 3.6
#e -1.5
#f 4.0
#g 3.1
#dtype: float64
print(s3)
#a 5.2
#c 1.1
#d NaN
#e 0.0
#f NaN
#g NaN
#dtype: float64

自动的数据对齐操作在不重叠的索引处引入了NA 值。 缺失值会在算术运算过程中传播。

对于DataFrame, 对其操作会同时发生在行和列上面

df = DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
               index=['one', 'two', 'three']
               )

df2 = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                index=['five', 'one', 'two', 'six']
                )

print(df)
# b c d
# one 0.0 1.0 2.0
# two 3.0 4.0 5.0
# three 6.0 7.0 8.0
print(df2)
# b d e
# five 0.0 1.0 2.0
# one 3.0 4.0 5.0
# two 6.0 7.0 8.0
# six 9.0 10.0 11.0
print(df + df2)
# b c d e
# five NaN NaN NaN NaN
# one 3.0 NaN 6.0 NaN
# six NaN NaN NaN NaN
# three NaN NaN NaN NaN
# two 9.0 NaN 12.0 NaN

上面可以看到有很多的NaN的值,现在需要填充起来
使用add fill_value 来进行填充。 规则是两者中有一个没有的就填写没有的那一方 指的是行列。如果两则都没有 有一个行列 在另外一个对象中没有的还是NAN

from pandas import Series, DataFrame
import pandas as pd
import numpy as np

df1 = DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
                index=['one', 'two', 'three']
                )

df2 = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                index=['five', 'one', 'two', 'six']
                )
print(df1 + df2)
# b c d e
# five NaN NaN NaN NaN
# one 3.0 NaN 6.0 NaN
# six NaN NaN NaN NaN
# three NaN NaN NaN NaN
# two 9.0 NaN 12.0 NaN
print(df1)
# b c d
# one 0.0 1.0 2.0
# two 3.0 4.0 5.0
# three 6.0 7.0 8.0
print(df2)
# b d e
# five 0.0 1.0 2.0
# one 3.0 4.0 5.0
# two 6.0 7.0 8.0
# six 9.0 10.0 11.0
df3 = df1.add(df2, fill_value=0)
print(df3)
# b c d e
# five 0.0 NaN 1.0 2.0
# one 3.0 1.0 6.0 5.0
# six 9.0 NaN 10.0 11.0
# three 6.0 7.0 8.0 NaN
# two 9.0 4.0 12.0 8.0

pandas Series DataFrame 综合学习

DataFrame 和Series之间的运算

他们之间的运算都是广播。 首先来看个numpy 之间的运算然后再切换到DataFrame 和Series 之间的运算

import numpy as np

arr = np.arange(12.).reshape((3, 4))
print(arr)
#[[ 0. 1. 2. 3.]
# [ 4. 5. 6. 7.]
# [ 8. 9. 10. 11.]]
print(arr[0]) # [0. 1. 2. 3.]
print("=====")
arr2 = arr - arr[0]
print(arr2)
#[[0. 0. 0. 0.]
# [4. 4. 4. 4.]
# [8. 8. 8. 8.]]

现在看看DataFrame和Series 之间的运算

frame = DataFrame(np.arange(12.).reshape((4,3)), columns=list('bde'),
                  index=['one', 'two', 'three', 'four']
                  )
series = frame.ix[0]
print(series)
series2 = frame.ix["one"]
print(series2)

aa = frame - series
print(aa)

 # b d e
#one 0.0 0.0 0.0
#two 3.0 3.0 3.0
#three 6.0 6.0 6.0
#four 9.0 9.0 9.0

默认情况下 DataFrame 和Series的算术运算会将 Series的索引匹配到DataFrame的列, 然后沿着行一直向下广播。

如果, 某个索引值在DataFrame的列或者Series的索引中找不到, 则参与运算的两个对象就会被重新索引以形成并集

series = Series(range(3), index=['b', 'e', 'f'])
print(frame - series)
# b d e f
# one 0.0 NaN 1.0 NaN
# two 3.0 NaN 4.0 NaN
# three 6.0 NaN 7.0 NaN
# four 9.0 NaN 10.0 NaN

注意上面是在行上面广播, 在列上面广播要注意呀,,敲黑板啦。要使用算术方法

series = frame['d']
print(series)
# one 1.0
# two 4.0
# three 7.0
# four 10.0
# Name: d, dtype: float64
print(frame.sub(series, axis=0))
# b d e
# one -1.0 0.0 1.0
# two -1.0 0.0 1.0
# three -1.0 0.0 1.0
# four -1.0 0.0 1.0

传入的轴号就是希望匹配的轴。在本例中我们得目的是匹配DataFrame的行索引并进行广播

函数应用和映射