Python数据分析之Pandas操作大全

从头到尾都是手码的，文中的所有示例也都是在Pycharm中运行过的，自己整理笔记的最大好处在于可以按照自己的思路来构建框架，等到将来在需要的时候能够以最快的速度看懂并应用=_=

注：为方便表述，本章设s为pandas.core.series.Series的一个实例化对象，设df为pandas.core.frame.DataFrame的一个实例化对象

1. Pandas简介

Pandas是基于NumPy的python数据分析库，最初被作为金融数据分析工具而开发出来，因此Pandas为时间序列分析提供了很好的支持。 Pandas的名称来自于面板数据（panel data）和python数据分析（data analysis）。panel data是经济学中关于多维数据集的一个术语，在Pandas中也提供了Panel的数据类型（注：在最新版本的Pandas中已将该数据类型删除）。

官网：https://pandas.pydata.org/

Pandas是构建在Numpy的基础上的，所以我们在加载pandas之前，最好先把Numpy也加载进来：

import numpy as np

import pandas as pd

2. Pandas中的三大数据类型

在Pandas中有下面三种数据结构：Series、DataFrame、Panel。

数据结构	维数	说明
Series	1
Dataframe	2	是Series的容器
Panel	3	是Dataframe的容器（注：在最新版本的Pandas中已将Panel数据类型删除）

3. Series

Series具有一维的数据结构，它拥有一列index和一列values，每个Series都是pandas.core.series.Series的一个实例化对象。

本节仅讨论具有一维index结构的Series，关于具有多维index结构的Series，见本章“13.层次化索引”

（1）创建Series

语法：pd.Series(data=None, index=None, dtype=None, name=None)

参数：

data：数据，它可以是一维list、dict、range()或一维numpy.ndarray。data的默认值为None，此时会创建一个空的Series([])
index：索引，默认值为None（当data为非dict类型时，默认索引为0、1、2……；当data为dict类型时，默认索引为dict的键）

可以使用一个list、tuple、range()或numpy.ndarray自定义索引。注意：

① index长度必须与data长度相等，否则报错；

② 当data为dict类型时，不可以再使用自定义index，否则自定义index不仅会覆盖掉字典的键，还会让Series的所有值变为NaN

③ 默认的index（0、1、2……）称为position（位置），自定义的index称为label（标签）。未定义index时，只能通过position取值；定义index后，既可以通过position取值，也可以通过label取值。

④ 本节仅讨论具有一维index结构的Series，关于具有多维index结构的Series，见本章“13.层次化索引”
dtype：数据类型
name：Series的名字

注意：和DataFrame不同，Series没有columns参数！

# 通过list和numpy.ndarray创建Series

import numpy as np

import pandas as pd

s1 = pd.Series([1,2,3])

s2 = pd.Series(np.array([1,2,3]), index=['a','b','c'], name='MySeries')

print(s1); print('===========')

print(type(s1)); print('===========')

print(s2); print('===========')

print(type(s2))

执行结果：

0    1

1    2

2    3

dtype: int64

===========

<class 'pandas.core.series.Series'>

===========

a    1

b    2

c    3

Name: MySeries, dtype: int32

===========

<class 'pandas.core.series.Series'>

# 通过dict创建Series

import numpy as np

import pandas as pd

s1 = pd.Series({'a':1,'b':2,'c':3})

s2 = pd.Series({'a':1,'b':2,'c':3}, index=['A','B','C'])		# 错误的定义方式

print(s1); print('===========')

print(type(s1)); print('===========')

print(s2); print('===========')

print(type(s2))

执行结果：

a    1

b    2

c    3

dtype: int64

===========

<class 'pandas.core.series.Series'>

===========

A   NaN

B   NaN

C   NaN

dtype: float64

===========

<class 'pandas.core.series.Series'>

（2）Series的向量化和广播

① 向量化： Series与一维对象进行计算

这个计算有个前提，即一维对象的长度等于Series的长度，该一维对象可以是list, numpy.ndarray……在满足这个前提的情况下（不满足则报错），会进行元素级操作，相同位置的元素按照某种运算规则进行运算，并返回一个与原Series索引相同、相同长度的Series

import numpy as np

import pandas as pd

s = pd.Series([10,11,12],index=['a','b','c'])

arr = np.array([9,11,13])

print(s + arr)

print('===========')

print(s > arr)

执行结果：

a    19

b    22

c    25

dtype: int64

===========

a     True

b    False

c    False

dtype: bool

② 广播：Series与数字进行计算

对于Series与数字进行的+、-、*、/、**、//、%、>、<、>=、<=、==、!=等运算，会将这个Series中的每一个值均与这个数字进行计算，并用这些结果组成一个与原Series结构相同的Series

import numpy as np

import pandas as pd

s = pd.Series([10,11,12],index=['a','b','c'])

print(s+2)

print('===========')

print(s>11)

执行结果：

a    12

b    13

c    14

dtype: int64

===========

a    False

b    False

c     True

dtype: bool

③ Series与多维对象进行计算（不支持）

Series不支持与多维对象进行计算，即一个多维numpy.ndarray不支持Series对其进行广播

import numpy as np

import pandas as pd

arr = np.array([[9,11,13],[8,15,10],[7,6,16]])

s = pd.Series([10,11,12])

print(arr + s)

print(arr > s)

执行结果：报错

（3）Series的索引和切片

s[0]：基于position（位置）的索引

s['a']：基于label（标签）的索引

s[1:3]：基于position的切片，顾前不顾后

s['b':'d']：基于label的切片，前后都包含

s[s>5]：先通过广播获得值为bool的Series，然后再筛选其中值为True的项构建新的Series（类似布尔值索引）

s[[3,1,2]]、s[['e','b','d']]：通过list实现不连续索引（类似花式索引）

s.loc[]：与df.loc[]用法相似

s.iloc[]：与df.iloc[]用法相似

import numpy as np

import pandas as pd

s = pd.Series(range(10,15),index=['a','b','c','d','e'])

print(s); print('===========')

print(s[2]); print('===========')

print(s['c']); print('===========')

print(s[1:3]); print('===========')			# 顾前不顾后

print(s['b':'d']); print('===========')		# 前后都包含

print(s[s>12]); print('===========')

print(s[[3,1,2]]); print('===========')

print(s[['e','b','d']])

执行结果：

a    10

b    11

c    12

d    13

e    14

dtype: int64

===========

12

===========

12

===========

b    11

c    12

dtype: int64

===========

b    11

c    12

d    13

dtype: int64

===========

d    13

e    14

dtype: int64

===========

d    13

b    11

c    12

dtype: int64

===========

e    14

b    11

d    13

dtype: int64

（4）Series的常用属性

注意：Series没有columns属性！

s.values：返回Series的所有值，数据类型为numpy.ndarray

s.index：返回Series的所有索引，数据类型为pandas.core.indexes.base.Index

s.name：Series的名字（可以赋值修改）

s.index.name：索引的名字（可以赋值修改）

（5）Series的常用方法

s.__len__()和len(s)：返回s的长度（int类型）

s.apply(func)和s.map(func)：将s中的每个元素分别传递给func作为其参数并执行func()，并将每次func()的返回值组成一个结构相同的新的Series，作为s.apply()或s.map()整体的返回值。代码示例见下面例2。所有apply()、applymap()、map()的对比见本章“二、Pandas模块 - 10. DataFrame对象的方法和Pandas模块的方法 - （5）其他重要方法 - ② df.applymap()”

s1.corr(s2)：计算两个Series的Pearson相关系数，返回一个float

s1.cov(s2)：计算两个Series的协方差，返回一个float

s.dropna()：删除s中的NaN

s.head(n)：返回s的至多前n项索引与值组成的Series，n默认为5。此方法用于快速预览，不会修改s本身

s.idxmin()和s.idxmax()：反查s中最小值（最大值）所对应的索引。注意：s.argmin()和s.argmax()两个方法已弃用

s.isin(list)：判断s中的每个元素是否在list中，返回一个与s结构相同但是由布尔值组成的Series

s.isna()和s.isnull()：与df.isna()和df.isnull()类似

s.notna()：与df.notna()类似

s.ptp()：计算s的极差（最大值减最小值），返回float（注意：DataFrame无此方法！）

s.replace('替换前的值','替换后的值',inplace=False)：将s中的值进行替换，当同时进行多个替换时，可以使用字典将替换前的值、替换后的值组成键值对，即s.replace({'旧1':'新1','旧2':'新2'...},inplace=False)

s.sort_values()：按值进行排序，类比df.sort_values()，由于Series只有一列，所以不用输入by=

s.str.字符串方法()：将s中的每个字符串按照指定的方法进行处理并组成一个新的Series，代码示例见下面例3

s.tail(n)：返回s的至多后n项索引与值组成的Series，n默认为5。此方法用于快速预览，不会修改s本身

s.tolist()：将s转换为list格式（不直接修改s，须定义一个变量来接收）（注意：DataFrame无此方法！）

s.unstack()：对层次化索引的Series进行变形（行标签与列标签的转换），详见本章“13.层次化索引 - （3）使用unstack()和stack()和DataFrame对层次化索引的Series进行变形（行标签与列标签的转换）”

s.value_counts()：统计s中的每个值出现的次数，返回一个Series（注意：numpy.ndarray和DataFrame都无此方法！）

s.var()：计算Series的方差，返回一个float

# 例1

import numpy as np

import pandas as pd

s = pd.Series([10,12,11,11,12],index=['a','b','c','d','e'],name='旧名字')

s.name='新名字'

s.index.name = '索引'

print(s); print('===========')

print(s.values,type(s.values)); print('===========')

print(s.index,type(s.index)); print('===========')

print(s.name); print('===========')

print(s.index.name); print('===========')

print(s.head(2)); print('===========')

print(s.tail(2)); print('===========')

print(s.__len__(),len(s),type(len(s))); print('===========')

print(s.tolist(),type(s.tolist())); print('===========')

print(s.value_counts(),type(s.value_counts())); print('===========')

print(s.isin([5,6,7,11,15,16,17]),type(s.isin([5,6,7,11,15,16,17]))); print('===========')

print(s.ptp())

执行结果：

索引

a    10

b    12

c    11

d    11

e    12

Name: 新名字, dtype: int64

===========

[10 12 11 11 12] <class 'numpy.ndarray'>

===========

Index(['a', 'b', 'c', 'd', 'e'], dtype='object', name='索引') <class 'pandas.core.indexes.base.Index'>

===========

新名字

===========

索引

===========

索引

a    10

b    12

Name: 新名字, dtype: int64

===========

索引

d    11

e    12

Name: 新名字, dtype: int64

===========

5 5 <class 'int'>

===========

[10, 12, 11, 11, 12] <class 'list'>

===========

12    2

11    2

10    1

Name: 新名字, dtype: int64 <class 'pandas.core.series.Series'>

===========

索引

a    False

b    False

c     True

d     True

e    False

Name: 新名字, dtype: bool <class 'pandas.core.series.Series'>

===========

2

# 例2：Series.apply(func)和Series.map(func)

import numpy as np

import pandas as pd

s1 = pd.Series([10,20,30], index=['t1','t2','t3'])

s2 = s1.apply(lambda x:x+1)

s3 = s1.map(lambda x:x+2)

print(s1); print('===========')

print(s2); print('===========')

print(s3)

执行结果：

t1    10

t2    20

t3    30

dtype: int64

===========

t1    11

t2    21

t3    31

dtype: int64

===========

t1    12

t2    22

t3    32

dtype: int64

# 例3：s.str.字符串方法()

import numpy as np

import pandas as pd

s1 = pd.Series(['a_b','c_d'],index=['t1','t2'])

s2 = s1.str.replace('_','')

s3 = s1.str.startswith('a')

print(s1); print('===========')

print(s2); print('===========')

print(s3)

执行结果：

t1    a_b

t2    c_d

dtype: object

===========

t1    ab

t2    cd

dtype: object

===========

t1     True

t2    False

dtype: bool

4. DataFrame的创建

DataFrame具有二维的数据结构，它拥有一列index和若干列values，每个DataFrame都是pandas.core.frame.DataFrame的一个实例化对象。

本节仅讨论具有一维index结构的DataFrame，关于具有多维index结构的DataFrame，见本章“13.层次化索引”

语法：pd.DataFrame(data=None, index=None, columns=None, dtype=None)

参数：

data：数据，它可以是dict、一维或二维list、一维或二维numpy.ndarray。当data是一维list或一维numpy.ndarray时，pd.DataFrame()会将其变为一个2行1列的列向量（可参见本章“DataFrame对象的常用属性和方法部分的（4）df.shape中的例子”）。data的默认值为None，此时会创建一个空的DataFrame
index：行索引，默认值为None（默认行索引为0、1、2……）

可以使用list自定义行索引，注意：

① index长度必须与data的行数相等，否则报错；

② 当data为dict类型时，不可以在pd.DataFrame()括号里自定义index，否则会让所有数据类型为Series的列的值都变为NaN。正确的方式有两种：一是在df定义完成后另写一行df.index=[...]；二是在定义字典时值都用Series类型，并给每个Series都单独定义index=[...]

③ 默认的index（0、1、2……）称为position（位置），自定义的index称为label（标签）。未定义index时，只能通过position取值；定义index后，既可以通过position取值，也可以通过label取值。

④ 本节仅讨论具有一维index结构的DataFrame，关于具有多维index结构的DataFrame，见本章“13.层次化索引”
columns：列索引，默认值为None（当data为非dict类型时，默认列索引为0、1、2……；当data为dict类型时，默认列索引为dict的键）

可以使用list自定义列索引，注意：

① columns长度必须与data的列数相等，否则报错；

② 当data为dict类型时，不可以在pd.DataFrame()括号里自定义columns，否则会清空整个DataFrame（见下面的错误演示）。由于dict的键已经作为columns了，因此不能再自定义columns。

③ 默认的columns（0、1、2……）称为position（位置），自定义的columns称为label（标签）。未定义columns时，只能通过position取值；定义columns后，既可以通过position取值，也可以通过label取值。
dtype：数据类型，值为'f'时表示float，值为'i'时表示int

注意：和Series不同，DataFrame没有name参数！

# 通过list和numpy.ndarray创建DataFrame

import numpy as np

import pandas as pd

li = [[44, 55, 66],[77, 88, 99]]

df1 = pd.DataFrame(li,columns=['c1','c2','c3'],index=['t1','t2'])

arr = np.array([[44, 55, 66],[77, 88, 99]])

df2 = pd.DataFrame(arr,columns=['c1','c2','c3'],index = ['t1','t2'])

print(df1); print('===========')

print(df2); print('==========='); print(type(df2))

执行结果：

    c1  c2  c3

t1  44  55  66

t2  77  88  99

===========

    c1  c2  c3

t1  44  55  66

t2  77  88  99

===========

<class 'pandas.core.frame.DataFrame'>

# 通过dict创建DataFrame

import numpy as np

import pandas as pd

dic1 = {

    'A': [30,32],

    'B': np.array([42,38]),

    'C': pd.Series([55,56]),

}

dic2 = {

    'A': pd.Series([30,32], index=['t1','t2']),

    'B': pd.Series([42,38], index=['t1','t2']),

    'C': pd.Series([55,56], index=['t1','t2']),

}

# 正确的创建方式一

df1 = pd.DataFrame(dic1)

df1.index=['t1','t2']

# 正确的创建方式二

df2 = pd.DataFrame(dic2)

# 错误的创建方式三

df3 = pd.DataFrame(dic1,index=['t1','t2'])

# 错误的创建方式四

df4 = pd.DataFrame(dic1,columns=['c1','c2'])

print('正确的创建方式一\n',df1); print('===========')

print('正确的创建方式二\n',df2); print('===========')

print('错误的创建方式三\n',df3); print('===========')

print('错误的创建方式四\n',df4)

执行结果：

正确的创建方式一

      A   B   C

t1  30  42  55

t2  32  38  56

===========

正确的创建方式二

      A   B   C

t1  30  42  55

t2  32  38  56

===========

错误的创建方式三

      A   B   C

t1  30  42 NaN

t2  32  38 NaN

===========

错误的创建方式四

 Empty DataFrame

Columns: [c1, c2]

Index: []

5. DataFrame对象的属性

（1）df.columns

返回df的列索引：

如果未自定义过df.columns，则返回的数据类型为pandas.core.indexes.range.RangeIndex
如果已自定义过df.columns，则返回的数据类型为pandas.core.indexes.base.Index

df.columns支持索引和切片：

当对df.columns里面的单一元素进行索引时（即df.columns[]的中括号里无冒号）：
- 若此时未自定义过df.columns，则返回int类型的位置索引
- 若此时已自定义过df.columns，则返回str类型（或其他类型）的标签索引
当对df.columns切片时（即df.columns[]的中括号里有冒号，不论切片的长度是多少，哪怕切片里只有一项）：
- 若此时未自定义过df.columns，则返回pandas.core.indexes.range.RangeIndex类型的位置索引
- 若此时已自定义过df.columns，则返回pandas.core.indexes.base.Index类型的标签索引

df.columns支持对其整体进行重新赋值，但不支持对其中的元素进行修改（会报错）

import numpy as np

import pandas as pd

arr = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]])

df = pd.DataFrame(arr,index=['t1','t2','t3','t4'])		# 定义df时未定义columns

print('df\n',df); print('===========')

print('df.columns\n',df.columns,'\n',type(df.columns)); print('===========')

print('df.columns[2]\n',df.columns[2],'\n',type(df.columns[2])); print('===========')

print('df.columns[2:3]\n',df.columns[2:3],'\n',type(df.columns[2:3])); print('===========')

# 可以对df.columns整体进行重新赋值

df.columns=['c1','c2','c3','c4']

print('df\n',df,'\n',type(df)); print('===========')

print('df.columns\n',df.columns,'\n',type(df.columns)); print('===========')

print('df.columns[2]\n',df.columns[2],'\n',type(df.columns[2])); print('===========')

print('df.columns[2:3]\n',df.columns[2:3],'\n',type(df.columns[2:3])); print('===========')

# 不可以对df.columns中的元素进行修改（会报错）

df.columns[1]='New'

print(df)

执行结果：

df

      0   1   2   3

t1   1   2   3   4

t2   5   6   7   8

t3   9  10  11  12

t4  13  14  15  16

===========

df.columns

 RangeIndex(start=0, stop=4, step=1)

 <class 'pandas.core.indexes.range.RangeIndex'>

===========

df.columns[2]

 2

 <class 'int'>

===========

df.columns[2:3]

 RangeIndex(start=2, stop=3, step=1)

 <class 'pandas.core.indexes.range.RangeIndex'>

===========

df

     c1  c2  c3  c4

t1   1   2   3   4

t2   5   6   7   8

t3   9  10  11  12

t4  13  14  15  16

 <class 'pandas.core.frame.DataFrame'>

===========

df.columns

 Index(['c1', 'c2', 'c3', 'c4'], dtype='object')

 <class 'pandas.core.indexes.base.Index'>

===========

df.columns[2]

 c3

 <class 'str'>

===========

df.columns[2:3]

 Index(['c3'], dtype='object')

 <class 'pandas.core.indexes.base.Index'>

===========

报错（TypeError: Index does not support mutable operations）

（2）df.index

返回df的行索引：

如果未自定义过df.index，则返回的数据类型为pandas.core.indexes.range.RangeIndex
如果已自定义过df.index，则返回的数据类型为pandas.core.indexes.base.Index

df.index支持索引和切片：

当对df.index里面的单一元素进行索引时（即df.index[]的中括号里无冒号）：
- 若此时未自定义过df.index，则返回int类型的位置索引
- 若此时已自定义过df.index，则返回str类型（或其他类型）的标签索引
当对df.index切片时（即df.index[]的中括号里有冒号，不论切片的长度是多少，哪怕切片里只有一项）：
- 若此时未自定义过df.index，则返回pandas.core.indexes.range.RangeIndex类型的位置索引
- 若此时已自定义过df.index，则返回pandas.core.indexes.base.Index类型的标签索引

df.index支持对其整体进行重新赋值，但不支持对其中的元素进行修改（会报错）

import numpy as np

import pandas as pd

arr = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]])

df = pd.DataFrame(arr,columns=['c1','c2','c3','c4'])		# 定义df时未定义index

print('df\n',df); print('===========')

print('df.index\n',df.index,'\n',type(df.index)); print('===========')

print('df.index[2]\n',df.index[2],'\n',type(df.index[2])); print('===========')

print('df.index[2:3]\n',df.index[2:3],'\n',type(df.index[2:3])); print('===========')

# 可以对df.index整体进行重新赋值

df.index=['t1','t2','t3','t4']

print('df\n',df,'\n',type(df)); print('===========')

print('df.index\n',df.index,'\n',type(df.index)); print('===========')

print('df.index[2]\n',df.index[2],'\n',type(df.index[2])); print('===========')

print('df.index[2:3]\n',df.index[2:3],'\n',type(df.index[2:3])); print('===========')

# 不可以对df.index中的元素进行修改（会报错）

df.index[2]='New'

print(df)

执行结果：

df

    c1  c2  c3  c4

0   1   2   3   4

1   5   6   7   8

2   9  10  11  12

3  13  14  15  16

===========

df.index

 RangeIndex(start=0, stop=4, step=1)

 <class 'pandas.core.indexes.range.RangeIndex'>

===========

df.index[2]

 2

 <class 'int'>

===========

df.index[2:3]

 RangeIndex(start=2, stop=3, step=1)

 <class 'pandas.core.indexes.range.RangeIndex'>

===========

df

     c1  c2  c3  c4

t1   1   2   3   4

t2   5   6   7   8

t3   9  10  11  12

t4  13  14  15  16

 <class 'pandas.core.frame.DataFrame'>

===========

df.index

 Index(['t1', 't2', 't3', 't4'], dtype='object')

 <class 'pandas.core.indexes.base.Index'>

===========

df.index[2]

 t3

 <class 'str'>

===========

df.index[2:3]

 Index(['t3'], dtype='object')

 <class 'pandas.core.indexes.base.Index'>

===========

报错（TypeError: Index does not support mutable operations）

（3）df.index.name和df.index.names

对于单一索引的df来说，df.index.name和df.index.names是同一个东西，对其中一个赋值会覆盖另一个的值，示例代码见例1、例2

对于层次化索引的df来说，df.index.name是一个标量（整个层次化索引的名字），而df.index.names是一个矢量（每一列层次化索引的单独的列名），示例代码见例3

# 例1：单一索引的df.index.names覆盖df.index.name

import numpy as np

import pandas as pd

df = pd.DataFrame([1,2,3])

df.columns = ['c1']

df.index = ['t1','t2','t3']

df.index.name = 'my_index'

df.index.names = ['t']      # 将前面的df.index.name覆盖掉了

print(df); print('===========')

print(df.index); print('===========')

print(df.index.name); print('===========')

print(df.index.names)

执行结果：

    c1

t

t1   1

t2   2

t3   3

===========

Index(['t1', 't2', 't3'], dtype='object', name='t')

===========

t

===========

['t']

# 例2：单一索引的df.index.name覆盖df.index.names

import numpy as np

import pandas as pd

df = pd.DataFrame([1,2,3])

df.columns = ['c1']

df.index = ['t1','t2','t3']

df.index.names = ['t']

df.index.name = 'my_index'      # 将前面的df.index.names覆盖掉了

print(df); print('===========')

print(df.index); print('===========')

print(df.index.name); print('===========')

print(df.index.names)

执行结果：

          c1

my_index

t1         1

t2         2

t3         3

===========

Index(['t1', 't2', 't3'], dtype='object', name='my_index')

===========

my_index

===========

['my_index']

# 例3：层次化索引的df.index.name和df.index.names是互相互独立的

import numpy as np

import pandas as pd

df = pd.DataFrame([1,2,3,4,5,6,7,8])

df.columns = ['c1']

df.index = [['A','A','B','B','C','C','D','D'],

            ['e','f','e','g','f','h','g','h']]

df.index.name = 'my_multi_index'

df.index.names = ['i1','i2']

print(df); print('===========')

print(df.index); print('===========')

print(df.index.name); print('===========')

print(df.index.names)

执行结果：

       c1

i1 i2

A  e    1

   f    2

B  e    3

   g    4

C  f    5

   h    6

D  g    7

   h    8

===========

MultiIndex([('A', 'e'),

            ('A', 'f'),

            ('B', 'e'),

            ('B', 'g'),

            ('C', 'f'),

            ('C', 'h'),

            ('D', 'g'),

            ('D', 'h')],

           name='my_multi_index')

===========

my_multi_index

===========

['i1', 'i2']

（4）df.index.levels

以list形式返回df的各级层次化索引

注意：未设定行索引时df.index是pandas.core.indexes.range.RangeIndex的实例化对象，设定单一层次化索引时df.index是pandas.core.indexes.base.Index的实例化对象，这两种情况下执行df.index.levels都会报错，因为它们没有这个属性。仅两层或更多层的索引（pandas.core.indexes.multi.MultiIndex的实例化对象）才有levels属性。

# 一层index

import numpy as np

import pandas as pd

df = pd.DataFrame([1,2,3])

df.columns = ['c1']

df.index = ['t1','t2','t3']

print(type(df.index))

print(df.index.levels)

执行结果：

<class 'pandas.core.indexes.base.Index'>

报错

# 两层index

import numpy as np

import pandas as pd

df = pd.DataFrame([1,2,3])

df.columns = ['c1']

df.index = [['A','B','C'],['t1','t2','t3']]

print(type(df.index))

print(df.index.levels)

执行结果：

<class 'pandas.core.indexes.multi.MultiIndex'>

[['A', 'B', 'C'], ['t1', 't2', 't3']]

# 三层index

import numpy as np

import pandas as pd

df = pd.DataFrame([1,2,3])

df.columns = ['c1']

df.index = [[10,20,30],['A','B','C'],['t1','t2','t3']]

print(type(df.index))

print(df.index.levels)

执行结果：

<class 'pandas.core.indexes.multi.MultiIndex'>

[[10, 20, 30], ['A', 'B', 'C'], ['t1', 't2', 't3']]

（5）df.dtypes

返回每列的数据类型，即返回df中每列Series的dtype

返回值整体是一个Series

import numpy as np

import pandas as pd

df = pd.DataFrame([[1,'a',3+4j],[2,'b',5+6j]],columns=['c1','c2','c3'],index=['t1','t2'])

print(df.dtypes)

print('==============')

print(type(df.dtypes))

执行结果：

c1         int64

c2        object

c3    complex128

dtype: object

==============

<class 'pandas.core.series.Series'>

（6）df.shape

返回一个元组，元组中的两项分别是df的行数和列数

import numpy as np

import pandas as pd

df1 = pd.DataFrame()

df2 = pd.DataFrame([10,11])				# pd.DataFrame()将一维列表变为了列向量

df3 = pd.DataFrame(np.array([10,11]))	# pd.DataFrame()将一维numpy.ndarray变为了列向量

df4 = pd.DataFrame([[10,11]])

df5 = pd.DataFrame([[10],[11]])

df6 = pd.DataFrame([[10,11],[12,13]])

print(df1.shape,type(df1.shape))

print(df2.shape,type(df2.shape))

print(df3.shape,type(df3.shape))

print(df4.shape,type(df4.shape))

print(df5.shape,type(df5.shape))

print(df6.shape,type(df6.shape))

执行结果：

(0, 0) <class 'tuple'>

(2, 1) <class 'tuple'>

(2, 1) <class 'tuple'>

(1, 2) <class 'tuple'>

(2, 1) <class 'tuple'>

(2, 2) <class 'tuple'>

（7）df.values

获取df的所有值，不含行索引、列索引，返回numpy.ndarray类型。示例代码见“6.DataFrame的数据选择 - （5）使用df的属性和方法进行数据选择”

（8）df.列标签

获取df的某一列，返回Series类型，注意df.列标签是不加引号的。示例代码见“6.DataFrame的数据选择 - （5）使用df的属性和方法进行数据选择”

（9）df.empty

判断df是否为空，返回bool。当df=pd.DataFrame()时，df为空，返回True；只要df中有数据，哪怕这些数据本身都为空，df也不为空（返回False）。

6. DataFrame的数据选择

（1）使用df[]进行选择

df[]支持的操作有：

使用 df['列标签'] 获取某一列（Series类型）
使用 df[['列标签','列标签']] 获取不连续的一列或多列（花式索引）（DataFrame类型）
使用 df['行标签':'行标签']、df[:'行标签']、df['行标签':] 获取连续的一行或多行（DataFrame类型）
使用 df[行索引号:行索引号]、df[:行索引号]、df[行索引号:] 获取连续的一行或多行（DataFrame类型）
当df行索引为pandas.core.indexes.datetimes.DatetimeIndex或pandas.core.indexes.period.PeriodIndex类型时，还可以使用df['年-月']、df['年.月']、df['年']等模糊索引方式获取满足条件的若干行（DataFrame类型），详见本章“14. Pandas中的时间相关格式及方法-（1）Pandas中的时间格式及特殊索引、切片方法”

df[]不支持的操作包括但不限于：

对连续的列进行切片
对不连续的行进行切片

# 例1：前四种索引方式

import numpy as np

import pandas as pd

arr=np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]])

df = pd.DataFrame(arr,columns=['c1','c2','c3','c4'],index=['t1','t2','t3','t4'])

print("查看df\n",df); print('===========')

# 使用 df['列标签'] 获取某一列（Series类型）

print("获取'c3'列\n",df['c3'],'\n',type(df['c3'])); print('===========')

# 使用 df[['列标签','列标签']] 获取不连续的若干列（花式索引）（DataFrame类型）

print("获取'c3','c1'列\n",df[['c3','c1']],'\n',type(df[['c3','c1']])); print('===========')

# 使用 df['行标签':'行标签']、df[:'行标签']、df['行标签':] 获取连续的一行或多行（DataFrame类型）

print("获取't3'行\n",df['t3':'t3'],'\n',type(df['t3':'t3'])); print('===========')

print("获取第一行到't3'行\n",df[:'t3'],'\n',type(df['t3':'t3'])); print('===========')

# 使用 df[行索引号:行索引号]、df[:行索引号]、df[行索引号:] 获取连续的一行或多行（DataFrame类型）

print("获取第三行\n",df[2:3],'\n',type(df[2:3])); print('===========')

print("获取第一行到第三行\n",df[:3],'\n',type(df[:3]))

执行结果：

查看df

     c1  c2  c3  c4

t1   1   2   3   4

t2   5   6   7   8

t3   9  10  11  12

t4  13  14  15  16

===========

获取'c3'列

 t1     3

t2     7

t3    11

t4    15

Name: c3, dtype: int32

 <class 'pandas.core.series.Series'>

===========

获取'c3','c1'列

     c3  c1

t1   3   1

t2   7   5

t3  11   9

t4  15  13

 <class 'pandas.core.frame.DataFrame'>

===========

获取't3'行

     c1  c2  c3  c4

t3   9  10  11  12

 <class 'pandas.core.frame.DataFrame'>

===========

获取第一行到't3'行

     c1  c2  c3  c4

t1   1   2   3   4

t2   5   6   7   8

t3   9  10  11  12

 <class 'pandas.core.frame.DataFrame'>

===========

获取第三行

     c1  c2  c3  c4

t3   9  10  11  12

 <class 'pandas.core.frame.DataFrame'>

===========

获取第一行到第三行

     c1  c2  c3  c4

t1   1   2   3   4

t2   5   6   7   8

t3   9  10  11  12

 <class 'pandas.core.frame.DataFrame'>

（2）使用基于标签索引的df.loc[]

loc是location的简写

df.loc[]支持的索引类型如下：

	df有列标签索引	df无列标签索引
df有行标签索引	支持行标签索引、列标签索引不支持行位置索引、列位置索引	支持行标签索引、列位置索引不支持行位置索引、列标签索引
df无行标签索引	支持行位置索引、列标签索引不支持行标签索引、列位置索引	支持行位置索引、列位置索引不支持行标签索引、列标签索引

总结一句话：有标签索引时只能用标签索引，没有标签索引时才能用位置索引

注意标签索引使用冒号:切片时是前后都包含的，而位置索引使用冒号:切片时是顾前不顾后的，此外还要注意位置索引是从0开始的

可以通过df.loc[]选取任意行、任意列。若选取的行数、列数都为1，则返回该位置元素本身的数据类型；若行数、列数只有一个为1，则返回Series类型；若行数、列数都不为1，则返回DataFrame类型

关于pd.date_range()的标签索引方式，详见本章“10.DataFrame对象的方法和Pandas模块的方法 - （4）时间相关方法 - ①pd.date_range()”

import numpy as np

import pandas as pd

arr = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]])

df = pd.DataFrame(arr,columns=['c1','c2','c3','c4'],index=['t1','t2','t3','t4'])

print("查看df\n",df); print('===========')

# 使用 df.loc['行标签'] 获取某一行（Series类型）

print("获取't1'行\n",df.loc['t2'],'\n',type(df.loc['t2'])); print('===========')

# 使用 df.loc['行标签':'行标签'] 获取连续的若干行（前后都包含）（DataFrame类型）

print("获取't2'至't3'行（前后都包含）\n",df.loc['t2':'t3'],'\n',type(df.loc['t2':'t3'])); print('===========')

# 使用 df.loc[['行标签','行标签']] 获取不连续的若干行（花式索引）（DataFrame类型）

print("获取't3'和't1'行\n",df.loc[['t3','t1']],'\n',type(df.loc[['t3','t1']])); print('===========')

# 使用 df.loc[:,'列标签'] 获取某一列（Series类型）

print("获取'c2'列\n",df.loc[:,'c2'],'\n',type(df.loc[:,'c2'])); print('===========')

# 使用 df.loc[:,'列标签':'列标签'] 获取连续的若干列（前后都包含）（DataFrame类型）

print("获取'c2'至'c3'列（前后都包含）\n",df.loc[:,'c2':'c3'],'\n',type(df.loc[:,'c2':'c3'])); print('===========')

# 使用 df.loc[:,['行标签','行标签']] 获取不连续的若干列（花式索引）（DataFrame类型）

print("获取'c3'和'c1'列\n",df.loc[:,['c3','c1']],'\n',type(df.loc[:,['c3','c1']])); print('===========')

# 使用 df.loc['行标签','列标签'] 获取某一个元素（该位置元素本身的数据类型）

print("获取't4'行,'c4'列位置的元素\n",df.loc['t4','c4'],'\n',type(df.loc['t4','c4'])); print('===========')

# 使用上述方法的各种组合获取某几行、某几列（若选取的行数、列数都为1，则返回该位置元素本身的数据类型；若行数、列数只有一个为1，则返回Series类型；若行数、列数都不为1，则返回DataFrame类型）

print("获取't2'至't3'行,'c2'至'c3'列\n",df.loc['t2':'t3','c2':'c3'],'\n',type(df.loc['t2':'t3','c2':'c3'])); print('===========')

print("获取't4'行,'c4'和'c1'列\n",df.loc['t4',['c4','c1']],'\n',type(df.loc['t4',['c4','c1']]))

执行结果：

查看df

     c1  c2  c3  c4

t1   1   2   3   4

t2   5   6   7   8

t3   9  10  11  12

t4  13  14  15  16

===========

获取't1'行

 c1    5

c2    6

c3    7

c4    8

Name: t2, dtype: int32

 <class 'pandas.core.series.Series'>

===========

获取't2'至't3'行（前后都包含）

     c1  c2  c3  c4

t2   5   6   7   8

t3   9  10  11  12

 <class 'pandas.core.frame.DataFrame'>

===========

获取't3'和't1'行

     c1  c2  c3  c4

t3   9  10  11  12

t1   1   2   3   4

 <class 'pandas.core.frame.DataFrame'>

===========

获取'c2'列

 t1     2

t2     6

t3    10

t4    14

Name: c2, dtype: int32

 <class 'pandas.core.series.Series'>

===========

获取'c2'至'c3'列（前后都包含）

     c2  c3

t1   2   3

t2   6   7

t3  10  11

t4  14  15

 <class 'pandas.core.frame.DataFrame'>

===========

获取'c3'和'c1'列

     c3  c1

t1   3   1

t2   7   5

t3  11   9

t4  15  13

 <class 'pandas.core.frame.DataFrame'>

===========

获取't4'行,'c4'列位置的元素

 16

 <class 'numpy.int32'>

===========

获取't2'至't3'行,'c2'至'c3'列

     c2  c3

t2   6   7

t3  10  11

 <class 'pandas.core.frame.DataFrame'>

===========

获取't4'行,'c4'和'c1'列

 c4    16

c1    13

Name: t4, dtype: int32

 <class 'pandas.core.series.Series'>

（3）使用基于位置索引的df.iloc[]

iloc是index location的简写

即使DataFrame已自定义了columns和index，仍然可以使用位置索引通过df.iloc[]进行选择

和df.loc[]不同，不论df有没有行标签、列标签，df.iloc[]都只支持位置索引（position index），位置索引使用冒号:切片时是顾前不顾后的，此外还要注意位置索引是从0开始的

可以通过df.iloc[]选取任意行、任意列。若选取的行数、列数都为1，则返回该位置元素本身的数据类型；若行数、列数只有一个为1，则返回Series类型；若行数、列数都不为1，则返回DataFrame类型

特别地，使用df.iloc[]可以实现倒序排列：

df.iloc[::-1,:]：倒序排所有行
df.iloc[:,::-1]：倒序排所有列
df.iloc[::-1,::-1]：倒序排所有行、所有列

import numpy as np

import pandas as pd

arr = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]])

df = pd.DataFrame(arr,columns=['c1','c2','c3','c4'],index=['t1','t2','t3','t4'])

print("查看df\n",df); print('===========')

# 使用 df.iloc[行索引] 获取某一行（Series类型）

print("获取最后一行\n",df.iloc[-1],'\n',type(df.iloc[-1])); print('===========')

# 使用 df.iloc[行索引:行索引] 获取连续的若干行（顾前不顾后）（DataFrame类型）

print("获取第二行至倒数第二行\n",df.iloc[1:-1],'\n',type(df.iloc[1:-1])); print('===========')

# 使用 df.iloc[[行索引,行索引]] 获取不连续的若干行（花式索引）（DataFrame类型）

print("获取倒数第二行和第一行\n",df.iloc[[-2,0]],'\n',type(df.iloc[[-2,0]])); print('===========')

# 使用 df.iloc[:,列索引] 获取某一列（Series类型）

print("获取最后一列\n",df.iloc[:,-1],'\n',type(df.iloc[:,-1])); print('===========')

# 使用 df.iloc[:,列索引:列索引] 获取连续的若干列（顾前不顾后）（DataFrame类型）

print("获取第二列至倒数第二列\n",df.iloc[:,1:-1],'\n',type(df.iloc[:,1:-1])); print('===========')

# 使用 df.iloc[:,[列索引,列索引]] 获取不连续的若干列（花式索引）（DataFrame类型）

print("获取倒数第二列和第一列\n",df.iloc[:,[-2,0]],'\n',type(df.iloc[:,[-2,0]])); print('===========')

# 使用 df.iloc[行索引,列索引] 获取某一个元素（该位置元素本身的数据类型）

print("获取最后一行,最后一列的元素\n",df.iloc[-1,-1],'\n',type(df.iloc[-1,-1])); print('===========')

# 使用上述方法的各种组合获取某几行、某几列（若选取的行数、列数都为1，则返回该位置元素本身的数据类型；若行数、列数只有一个为1，则返回Series类型；若行数、列数都不为1，则返回DataFrame类型）

print("获取第一行至倒数第二行,第三列至最后一列\n",df.iloc[:-1,2:],'\n',type(df.iloc[:-1,2:])); print('===========')

print("获取第四行,最后一列和倒数第三列\n",df.iloc[3,[-1,-3]],'\n',type(df.iloc[3,[-1,-3]])); print('===========')

# 使用df.iloc[]实现倒序排

print("倒序排所有行\n",df.iloc[::-1,:]); print('===========')

print("倒序排所有列\n",df.iloc[:,::-1]); print('===========')

print("倒序排所有行和所有列\n",df.iloc[::-1,::-1])

执行结果：

查看df

     c1  c2  c3  c4

t1   1   2   3   4

t2   5   6   7   8

t3   9  10  11  12

t4  13  14  15  16

===========

获取最后一行

 c1    13

c2    14

c3    15

c4    16

Name: t4, dtype: int32

 <class 'pandas.core.series.Series'>

===========

获取第二行至倒数第二行

     c1  c2  c3  c4

t2   5   6   7   8

t3   9  10  11  12

 <class 'pandas.core.frame.DataFrame'>

===========

获取倒数第二行和第一行

     c1  c2  c3  c4

t3   9  10  11  12

t1   1   2   3   4

 <class 'pandas.core.frame.DataFrame'>

===========

获取最后一列

 t1     4

t2     8

t3    12

t4    16

Name: c4, dtype: int32

 <class 'pandas.core.series.Series'>

===========

获取第二列至倒数第二列

     c2  c3

t1   2   3

t2   6   7

t3  10  11

t4  14  15

 <class 'pandas.core.frame.DataFrame'>

===========

获取倒数第二列和第一列

     c3  c1

t1   3   1

t2   7   5

t3  11   9

t4  15  13

 <class 'pandas.core.frame.DataFrame'>

===========

获取最后一行,最后一列的元素

 16

 <class 'numpy.int32'>

===========

获取第一行至倒数第二行,第三列至最后一列

     c3  c4

t1   3   4

t2   7   8

t3  11  12

 <class 'pandas.core.frame.DataFrame'>

===========

获取第四行,最后一列和倒数第三列

 c4    16

c2    14

Name: t4, dtype: int32

 <class 'pandas.core.series.Series'>

===========

倒序排所有行

     c1  c2  c3  c4

t4  13  14  15  16

t3   9  10  11  12

t2   5   6   7   8

t1   1   2   3   4

===========

倒序排所有列

     c4  c3  c2  c1

t1   4   3   2   1

t2   8   7   6   5

t3  12  11  10   9

t4  16  15  14  13

===========

倒序排所有行和所有列

     c4  c3  c2  c1

t4  16  15  14  13

t3  12  11  10   9

t2   8   7   6   5

t1   4   3   2   1

（4）使用基于混合索引的df.ix[]（新版Pandas即将取消该功能）

df.ix[]既支持位置索引（position index），也支持标签索引（label index），位置索引使用冒号:切片时是顾前不顾后的，标签索引使用冒号:切片时是前后都包含的

可以通过df.ix[]选取任意行、任意列。若选取的行数、列数都为1，则返回该位置元素本身的数据类型；若行数、列数只有一个为1，则返回Series类型；若行数、列数都不为1，则返回DataFrame类型

关于pd.date_range()的标签索引方式，详见本章“10.DataFrame对象的方法和Pandas模块的方法 - （4）时间相关方法 - ①pd.date_range()”

注意事项：

当df.ix[]的中括号里没有逗号时，自动视为行索引或行标签
在df.ix[,]中，可以对行使用一种索引方法，对列使用另一种索引方法
不可以在冒号:两边分别使用位置索引和标签索引
不可以在花式索引列表[]中同时出现位置索引和标签索引
使用df.ix[]会有warning警告，因为新版Pandas即将取消该功能

import numpy as np

import pandas as pd

import warnings; warnings.simplefilter('ignore') # 忽略可能会出现的警告信息；警告并不是错误，可以忽略；可能出现警告的场景包括：df.ix[]、pd.concat()

arr = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]])

df = pd.DataFrame(arr,columns=['c1','c2','c3','c4'],index=['t1','t2','t3','t4'])

print("查看df\n",df); print('===========')

# 正确的示例

print("获取最后一行\n",df.ix[-1]); print('===========')

print("获取第一行至't3'行\n",df.ix[:'t3']); print('===========')

print("获取第一行至倒数第三行,'c4'列和'c3'列\n",df.ix[:-2, ['c4','c3']])

# 错误的示例

# print("不可以在冒号:两边分别使用位置索引和标签索引，会报错\n",df.ix[1:'t1', 2])

# print("不可以在花式索引列表[]中同时出现位置索引和标签索引，会报错\n",df.ix[1:, [1,'c3']])

执行结果：

查看df

     c1  c2  c3  c4

t1   1   2   3   4

t2   5   6   7   8

t3   9  10  11  12

t4  13  14  15  16

===========

获取最后一行

 c1    13

c2    14

c3    15

c4    16

Name: t4, dtype: int32

===========

获取第一行至't3'行

     c1  c2  c3  c4

t1   1   2   3   4

t2   5   6   7   8

t3   9  10  11  12

===========

获取第一行至倒数第三行,'c4'列和'c3'列

     c4  c3

t1   4   3

t2   8   7

（5）使用df的属性和方法进行选择

具体包括下面几种属性和方法：

df.列标签：获取某一列，返回Series类型，注意df.列标签是不加引号的
df.values：获取所有值，不含行索引、列索引，返回numpy.ndarray类型
df.head(n=5)：获取前n行，n的默认值为5，返回DataFrame类型
df.tail(n=5)：获取后n行，n的默认值为5，返回DataFrame类型

import numpy as np

import pandas as pd

arr = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]])

df = pd.DataFrame(arr,columns=['c1','c2','c3','c4'],index=['t1','t2','t3','t4'])

print("查看df\n",df); print('===========')

# 使用 df.列标签 获取某一列（Series类型）

print("获取'c3'列\n",df.c3,'\n',type(df.c3)); print('===========')

# 无法使用 df.列索引号 获取某一列

# 无法使用 df.行标签 获取某一行

# 无法使用 df.行索引号 获取某一行

# 使用 df.values 获取所有值，不含行索引、列索引（numpy.ndarray类型）

print("获取df所有值\n",df.values,'\n',type(df.values))

# 使用 df.head(n) 获取前n行（DataFrame类型）

print("获取前2行\n",df.head(2),'\n',type(df.head(2))); print('===========')

# 使用 df.tail(n) 获取后n行（DataFrame类型）

print("获取后2行\n",df.tail(2),'\n',type(df.tail(2)))

执行结果：

查看df

     c1  c2  c3  c4

t1   1   2   3   4

t2   5   6   7   8

t3   9  10  11  12

t4  13  14  15  16

===========

获取'c3'列

 t1     3

t2     7

t3    11

t4    15

Name: c3, dtype: int32

 <class 'pandas.core.series.Series'>

===========

获取df所有值

 [[ 1  2  3  4]

 [ 5  6  7  8]

 [ 9 10 11 12]

 [13 14 15 16]]

 <class 'numpy.ndarray'>

获取前2行

     c1  c2  c3  c4

t1   1   2   3   4

t2   5   6   7   8

 <class 'pandas.core.frame.DataFrame'>

===========

获取后2行

     c1  c2  c3  c4

t3   9  10  11  12

t4  13  14  15  16

 <class 'pandas.core.frame.DataFrame'>

（6）使用布尔值索引筛选满足条件的行

如果需要根据某列的值是否满足给定的条件，筛选出满足条件的整行数据（或这些行指定字段的数据），可以使用下面的方法：

# 用于筛选满足条件的行

df[由布尔值组成的list]

df[由“df.列标签”组成的条件]

df[由“df['列标签']”组成的条件]

df.loc[由“df.列标签”组成的条件, :]

df.loc[由“df['列标签']”组成的条件, :]

# 用于筛选满足条件的行、列

df.loc[由“df.列标签”组成的条件, 由“df.列标签”组成的条件]

df.loc[由“df['列标签']”组成的条件, 由“df['列标签']”组成的条件]

条件之间的逻辑运算符有|、&、~、np.logical_or()、np.logical_and()、np.logical_not()，每个运算符的详细介绍见“第五章 Python编程进阶 - 一、NumPy模块 - 8. ndarray对象的方法和NumPy模块的方法 - （2）二元通用函数 - ③ 基本逻辑运算”

可以根据布尔值的特性（True=1，False=0），把条件*1并用+连接，以便对满足条件的数量进行筛选

import numpy as np

import pandas as pd

df = pd.DataFrame([[10,8,6],[8,15,13],[13,7,14],[9,9,11]],columns=['c1','c2','c3'],index=['t1','t2','t3','t4'])

print(df); print('===========')

print(df.c1>9); print('===========')						# 写成df['c1']>9也行

print(df[df.c1>9]); print('===========')					# 写成df[df['c1']>9]也行

print(df[(df.c1>9) & (df.c3>9)]); print('===========')      # 筛选两个条件都满足的（且）

print(df[(df.c1>9) | (df.c3>9)]); print('===========')      # 筛选满足任意一个条件的（或）

print(df[(df.c1>9)*1 + (df.c2>9)*1 + (df.c3>9)*1 >=2 ])     # 筛选三个条件中至少满足两个的（布尔值特性）

print('===========')

print(df.loc[df.c1.isin([8,9,22,33]),['c2','c3']])          # 筛选'c1'列的值在给定列表里的行的'c2'和'c3'列

执行结果：

    c1  c2  c3

t1  10   8   6

t2   8  15  13

t3  13   7  14

t4   9   9  11

===========

t1     True

t2    False

t3     True

t4    False

Name: c1, dtype: bool

===========

    c1  c2  c3

t1  10   8   6

t3  13   7  14

===========

    c1  c2  c3

t3  13   7  14

===========

    c1  c2  c3

t1  10   8   6

t2   8  15  13

t3  13   7  14

t4   9   9  11

===========

    c1  c2  c3

t2   8  15  13

t3  13   7  14

===========

    c2  c3

t2  15  13

t4   9  11

（7）使用df.query()筛选满足条件的行

语法：df.query('列标签组成的str格式表达式',inplace=False)

通过列标签组成的str格式表达式筛选满足条件的行，返回DataFrame格式的筛选结果，注意：

当表达式中含有变量时，需要在变量名称前加@符号
当表达式中含有带空格的列标签时，需要在此列标签的两侧加`符号

import numpy as np

import pandas as pd

df = pd.DataFrame([[10,8,6],[8,15,13],[13,7,14],[9,9,11]],

                  columns=['c1','c2','c 3'],

                  index=['t1','t2','t3','t4'])

var = 9

print(df); print('===========')

print(df.query('c1 > 9')); print('===========')

print(df.query('c1 > @var')); print('===========')

# print(df.query('c1 > c 3')); print('===========')			# 报错

print(df.query('c1 > `c 3`'))

执行结果：

    c1  c2  c 3

t1  10   8    6

t2   8  15   13

t3  13   7   14

t4   9   9   11

===========

    c1  c2  c 3

t1  10   8    6

t3  13   7   14

===========

    c1  c2  c 3

t1  10   8    6

t3  13   7   14

===========

    c1  c2  c 3

t1  10   8    6

（8）循环遍历df每一行数据

可以使用df.iterrows()返回的生成器实现，见本章“10. DataFrame对象的方法和Pandas模块的方法 - （5）其他重要方法 - ② df.iterrows()”

7. DataFrame的向量化、对齐和广播

（1）向量化和对齐

DataFrame的向量化是一种比numpy.ndarray和Series更为广义的向量化：

算数运算（+、-、*、/、**、//、%）
- 当两个DataFrame的shape、行标签、列标签都完全相同时，它们的算数运算就是对应项的运算，结果也是一个shape相同的DataFrame
- 当两个DataFrame的shape、行标签、列标签不完全相同时，它们之间也可以进行算数运算，此时行标签、列标签都相同的项才会执行元素级别的计算，不同的项则返回NaN。最终得到的DataFrame的shape会大于两个参与运算的DataFrame，因为前者的行标签是后两者行标签的并集，前者的列标签也是后两者列标签的并集。上述规则称为DataFrame的对齐。
- DataFrame可以和一个shape相同的二维numpy.ndarray进行算数运算，返回一个shape、行标签、列标签都相同的DataFrame
- DataFrame不能和一个shape不同的二维numpy.ndarray进行算数运算，也不能和任何二维list进行算数运算（哪怕二者shape相同）（报错）
比较运算（><、>=、<=、==、!=）
- 当两个DataFrame的shape、行标签、列标签都完全相同时，它们之间可以执行比较运算，返回一个shape相同的DataFrame，值为True或False
- 当两个DataFrame的shape、行标签、列标签不完全相同时，它们之间不能执行比较运算（报错），即此时DataFrame无法对齐
- DataFrame可以和一个shape相同的二维numpy.ndarray进行比较运算，返回一个shape、行标签、列标签都相同的DataFrame，值为True或False
- DataFrame不能和一个shape不同的二维numpy.ndarray进行比较运算，也不能和任何二维list进行比较运算（哪怕二者shape相同）（报错）

关于对齐的总结：如果两个DataFrame的shape、行标签、列标签不完全相同，进行算数运算时可以实现对齐，进行比较运算时无法实现对齐（只能报错）

关于对齐产生的NaN的后续处理：详见本章 “ 7.DataFrame的修改 - 对NaN进行替换 “

上面两类运算均未提到DataFrame和Series之间的计算，因为DataFrame必然是二维的，Series必然是一维的，因此它们二者之间只能是广播的关系，不属于向量化的范畴，其规则详见本章“广播”部分

# DataFrame的对齐

import numpy as np

import pandas as pd

df1 = pd.DataFrame([[10,20,30,40],[50,60,70,80]],columns=['c1','c2','c3','c4'],index=['t1','t2'])

df2 = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]],columns=['c0','c3','c1'],index=['t0','t2','t1'])

print('df1\n',df1); print('===========')

print('df2\n',df2); print('===========')

print('df1 + df2\n',df1 + df2); print('===========')

print('df1 // df2\n',df1 // df2); print('===========')

print('df1 > df2\n',df1 > df2)

执行结果：

df1

     c1  c2  c3  c4

t1  10  20  30  40

t2  50  60  70  80

===========

df2

     c0  c3  c1

t0   1   2   3

t2   4   5   6

t1   7   8   9

===========

df1 + df2

     c0    c1  c2    c3  c4

t0 NaN   NaN NaN   NaN NaN

t1 NaN  19.0 NaN  38.0 NaN

t2 NaN  56.0 NaN  75.0 NaN

===========

df1 // df2

     c0   c1  c2    c3  c4

t0 NaN  NaN NaN   NaN NaN

t1 NaN  1.0 NaN   3.0 NaN

t2 NaN  8.0 NaN  14.0 NaN

===========

报错（ValueError: Can only compare identically-labeled DataFrame objects）

# 仅当两个DataFrame的shape、行标签、列标签都完全相同时，它们之间才能执行><、>=、<=、==、!=的比较运算

import numpy as np

import pandas as pd

df1 = pd.DataFrame([[3,4],[5,6]],columns=['c1','c2'],index=['t1','t2'])

df2 = pd.DataFrame([[1,2],[7,8]],columns=['c1','c2'],index=['t1','t2'])

print('df1\n',df1); print('===========')

print('df2\n',df2); print('===========')

print('df1 < df2\n',df1 < df2)

执行结果：

df1

     c1  c2

t1   3   4

t2   5   6

===========

df2

     c1  c2

t1   1   2

t2   7   8

===========

df1 < df2

        c1     c2

t1  False  False

t2   True   True

（2）广播

① DataFrame与数字进行计算

对于DataFrame与数字进行的+、-、*、/、**、//、%、>、<、>=、<=、==、!=等运算，会将这个DataFrame中的每一个元素均与这个数字进行计算，并用这些结果组成一个与原DataFrame结构、行标签、列标签都相同的DataFrame

import numpy as np

import pandas as pd

df = pd.DataFrame([[1,2],[3,4]],columns=['c1','c2'],index=['t1','t2'])

print(df); print('===========')

print(df // 2); print('===========')

print(df < 2)

执行结果：

    c1  c2

t1   1   2

t2   3   4

===========

    c1  c2

t1   0   1

t2   1   2

===========

       c1     c2

t1   True  False

t2  False  False

② DataFrame与一维对象进行计算

DataFrame和DataFrame：

DataFrame和DataFrame之间不存在广播的概念，因为DataFrame本身是二维的，因此无论是几行几列，两个DataFrame之间的计算都属于向量化的范畴，详见本章的“向量化与对齐”的部分

DataFrame和一维list：

DataFrame和一维list之间可以广播的前提：DataFrame的列数等于一维list的长度。广播时，按行广播。

DataFrame和一维numpy.ndarray：

DataFrame和一维numpy.ndarray之间可以广播的前提：DataFrame的列数等于一维numpy.ndarray的长度。广播时，按行广播。

DataFrame和Series：

DataFrame和Series之间可以广播的前提：DataFrame的列数等于Series的长度，且Series的index与DataFrame的columns一一对应（顺序可以不同）。广播时，按行广播，标签相同的项执行对应的计算。

# DataFrame 与 一维list 的广播示例

import numpy as np

import pandas as pd

df = pd.DataFrame([[10,20,30,40],[50,60,70,80]],columns=['c1','c2','c3','c4'],index=['t1','t2'])

li= [1,2,100,100] #可以结果正确

print('查看df\n',df); print('===========')

print('查看li\n',li); print('===========')

print('df + li\n',df + li); print('===========')

print('df > li\n',df > li)

执行结果：

查看df

     c1  c2  c3  c4

t1  10  20  30  40

t2  50  60  70  80

===========

查看li

 [1, 2, 100, 100]

===========

df + li

     c1  c2   c3   c4

t1  11  22  130  140

t2  51  62  170  180

===========

df > li

       c1    c2     c3     c4

t1  True  True  False  False

t2  True  True  False  False

# DataFrame 与 一维numpy.ndarray 的广播示例

import numpy as np

import pandas as pd

df = pd.DataFrame([[10,20,30,40],[50,60,70,80]],columns=['c1','c2','c3','c4'],index=['t1','t2'])

arr = np.array([1,2,100,100]) #可以结果正确

print('查看df\n',df); print('===========')

print('查看arr\n',arr); print('===========')

print('df + arr\n',df + arr); print('===========')

print('df > arr\n',df > arr)

执行结果：

查看df

     c1  c2  c3  c4

t1  10  20  30  40

t2  50  60  70  80

===========

查看arr

 [  1   2 100 100]

===========

df + arr

     c1  c2   c3   c4

t1  11  22  130  140

t2  51  62  170  180

===========

df > arr

       c1    c2     c3     c4

t1  True  True  False  False

t2  True  True  False  False

# DataFrame 与 Series 的广播示例

import numpy as np

import pandas as pd

df = pd.DataFrame([[10,20,30,40],[50,60,70,80]],columns=['c1','c2','c3','c4'],index=['t1','t2'])

s1 = pd.Series([1,2,100,100])                   # 未定义符合条件的index，结果错误

s2 = pd.Series([100, 100, 1, 2], index=['c3','c4','c1','c2'])  # 顺序可以不同，结果依然正确

print('查看df\n',df); print('===========')

print('查看s1\n',s1); print('===========')

print('df + s1（结果错误）\n',df + s1); print('===========')

print('df > s1（结果错误）\n',df > s1); print('===========')

print('查看df\n',df); print('===========')

print('查看s2\n',s2); print('===========')

print('df + s2（结果正确）\n',df + s2); print('===========')

print('df > s2（结果正确）\n',df > s2)

执行结果：

查看df

     c1  c2  c3  c4

t1  10  20  30  40

t2  50  60  70  80

===========

查看s1

 0      1

1      2

2    100

3    100

dtype: int64

===========

df + s1（结果错误）

     c1  c2  c3  c4   0   1   2   3

t1 NaN NaN NaN NaN NaN NaN NaN NaN

t2 NaN NaN NaN NaN NaN NaN NaN NaN

===========

df > s1（结果错误）

        c1     c2     c3     c4      0      1      2      3

t1  False  False  False  False  False  False  False  False

t2  False  False  False  False  False  False  False  False

===========

查看df

     c1  c2  c3  c4

t1  10  20  30  40

t2  50  60  70  80

===========

查看s2

 c3    100

c4    100

c1      1

c2      2

dtype: int64

===========

df + s2（结果正确）

     c1  c2   c3   c4

t1  11  22  130  140

t2  51  62  170  180

===========

df > s2（结果正确）

       c1    c2     c3     c4

t1  True  True  False  False

t2  True  True  False  False

8. DataFrame的修改、变形、转换

（1）增加一行、增加一列

① 基于df.loc[]增加一行、增加一列

语法：

df.loc['新行标签'] = data			  # 增加一行

df.loc[:,'新列标签'] = data			  # 增加一列

df.loc['新行标签','旧列标签'] = data	# 增加一行（仅对部分元素赋值，未赋值元素的是NaN）

df.loc['旧行标签','新列标签'] = data	# 增加一列（仅对部分元素赋值，未赋值元素的是NaN）

df.loc['新行标签','新列标签'] = data	# 增加一行和一列（仅对部分元素赋值，未赋值元素的是NaN）

代码示例：

import numpy as np

import pandas as pd

df = pd.DataFrame([[1,2],[3,4]],columns=['c1','c2'],index=['t1','t2'])

print(df); print('==============')

df.loc['t3'] = [5, 6]

print(df); print('==============')

df.loc[:,'c3'] = [7, 8, 9]

print(df); print('==============')

df.loc['t4','c2'] = 10

print(df); print('==============')

df.loc['t2','c4'] = 11

print(df); print('==============')

df.loc['t5','c5'] = 12

print(df); print('==============')

print(df)

执行结果：

    c1  c2

t1   1   2

t2   3   4

==============

    c1  c2

t1   1   2

t2   3   4

t3   5   6

==============

    c1  c2  c3

t1   1   2   7

t2   3   4   8

t3   5   6   9

==============

     c1    c2   c3

t1  1.0   2.0  7.0

t2  3.0   4.0  8.0

t3  5.0   6.0  9.0

t4  NaN  10.0  NaN

==============

     c1    c2   c3    c4

t1  1.0   2.0  7.0   NaN

t2  3.0   4.0  8.0  11.0

t3  5.0   6.0  9.0   NaN

t4  NaN  10.0  NaN   NaN

==============

     c1    c2   c3    c4    c5

t1  1.0   2.0  7.0   NaN   NaN

t2  3.0   4.0  8.0  11.0   NaN

t3  5.0   6.0  9.0   NaN   NaN

t4  NaN  10.0  NaN   NaN   NaN

t5  NaN   NaN  NaN   NaN  12.0

==============

     c1    c2   c3    c4    c5

t1  1.0   2.0  7.0   NaN   NaN

t2  3.0   4.0  8.0  11.0   NaN

t3  5.0   6.0  9.0   NaN   NaN

t4  NaN  10.0  NaN   NaN   NaN

t5  NaN   NaN  NaN   NaN  12.0

② 基于df[]增加一列

语法：df['列标签'] = data

data的长度必须等于df的行数，data可以是list、tuple、range()、numpy.ndarray、Series、DataFrame

注意：当data为Series或DataFrame时，必须为其定义与df相对应的index（顺序可以不同，Pandas会自动根据标签进行匹配），如果不写index，由于DataFrame的自动对齐，会导致新增的值都是NaN

import numpy as np

import pandas as pd

df = pd.DataFrame([[1,3],[2,4]],columns=['c1','c2'],index=['t1','t2'])

print(df); print('==============')

print('list、tuple、range()的情况略'); print('==============')

df['c3'] = np.array([5,6])

print(df); print('==============')

df['c4'] = pd.Series([7,8],index=['t2','t1'])	# 给Series定义正确的index，顺序无所谓

print(df); print('==============')

df['c5'] = pd.Series([9,10])					# 未给Series定义index，错误

print(df); print('==============')

df['c6'] = pd.DataFrame([11,12],index=['t2','t1'])	# 给DataFrame定义正确的index，顺序无所谓

print(df); print('==============')

df['c7'] = pd.DataFrame([13,14])					# 未给DataFrame定义index，错误

print(df)

执行结果：

    c1  c2

t1   1   3

t2   2   4

==============

list、tuple、range()的情况略

==============

    c1  c2  c3

t1   1   3   5

t2   2   4   6

==============

    c1  c2  c3  c4

t1   1   3   5   8

t2   2   4   6   7

==============

    c1  c2  c3  c4  c5

t1   1   3   5   8 NaN

t2   2   4   6   7 NaN

==============

    c1  c2  c3  c4  c5  c6

t1   1   3   5   8 NaN  12

t2   2   4   6   7 NaN  11

==============

    c1  c2  c3  c4  c5  c6  c7

t1   1   3   5   8 NaN  12 NaN

t2   2   4   6   7 NaN  11 NaN

③ 基于df.append()增加一行

语法：df.append(series)，详见本章二、Pandas模块 - 12. DataFrame的合并 - （2）df.append()

④ 基于df.assign()增加一列

语法：df = df.assign(新列标签索引=表达式)

常用于根据现有的列进行表达式计算，产生新的列

import numpy as np

import pandas as pd

df = pd.DataFrame([[1,2],[3,4]],columns=['c1','c2'],index=['t1','t2'])

print(df); print('==============')

df = df.assign(c3 = df['c1']/df['c2'])

print(df)

执行结果：

    c1  c2

t1   1   2

t2   3   4

==============

    c1  c2    c3

t1   1   2  0.50

t2   3   4  0.75

（3）删除若干行、删除若干列

① 仅删除一列：del

语法：

当列标签为label index（列标签索引）时，只能使用label index删：del df['列标签索引']
当列标签为position index（列位置索引）时，只能使用position index删：del df[列位置索引]

注意：del方法只能用于删除一列，不能使用索引、切片的方法删除多列

import numpy as np

import pandas as pd

df = pd.DataFrame([[1,3,5,7,9],[2,4,6,8,10]])

print(df); print('==============')

del df[1]

print(df); print('==============')

df.columns=['c1','c2','c3','c4']

df.index=['t1','t2']

print(df); print('==============')

del df['c2']

print(df)

执行结果：

   0  1  2  3   4

0  1  3  5  7   9

1  2  4  6  8  10

==============

   0  2  3   4

0  1  5  7   9

1  2  6  8  10

==============

    c1  c2  c3  c4

t1   1   5   7   9

t2   2   6   8  10

==============

    c1  c3  c4

t1   1   7   9

t2   2   8  10

② 删除若干行、删除若干列：df.drop()

语法：

当标签为label index（标签索引）时，只能使用label index删：

df.drop('标签索引'或['标签索引','标签索引',...], axis=0, inplace=False)

当标签为position index（位置索引）时，只能使用position index删：
```
df.drop(位置索引或[位置索引,位置索引,...], axis=0, inplace=False)
```

参数：

axis：默认为0，按行删；axis=1时，按列删
inplace：默认值为False，不对df本身进行修改；inplace=True时，直接对df进行修改

import numpy as np

import pandas as pd

df = pd.DataFrame(np.arange(100,149).reshape(7, 7))

print(df); print('==============')

df = df.drop(0)						# 删除第0行

df = df.drop([1,2])					# 删除第1、2行

print(df); print('==============')

df = df.drop(0,axis=1)				# 删除第0列

df = df.drop([1,2],axis=1)			# 删除第1、2列

print(df); print('==============')

df.columns=['c1','c2','c3','c4']

df.index=['t1','t2','t3','t4']

print(df); print('==============')

df = df.drop('t1')					# 删除't1'行

df = df.drop(['t2','t3'])			# 删除't2'、't3'列

print(df); print('==============')

df = df.drop('c1',axis=1)			# 删除'c1'行

df = df.drop(['c2','c3'],axis=1)	# 删除'c2'、'c3'列

print(df)

执行结果：

     0    1    2    3    4    5    6

0  100  101  102  103  104  105  106

1  107  108  109  110  111  112  113

2  114  115  116  117  118  119  120

3  121  122  123  124  125  126  127

4  128  129  130  131  132  133  134

5  135  136  137  138  139  140  141

6  142  143  144  145  146  147  148

==============

     0    1    2    3    4    5    6

3  121  122  123  124  125  126  127

4  128  129  130  131  132  133  134

5  135  136  137  138  139  140  141

6  142  143  144  145  146  147  148

==============

     3    4    5    6

3  124  125  126  127

4  131  132  133  134

5  138  139  140  141

6  145  146  147  148

==============

     c1   c2   c3   c4

t1  124  125  126  127

t2  131  132  133  134

t3  138  139  140  141

t4  145  146  147  148

==============

     c1   c2   c3   c4

t4  145  146  147  148

==============

     c4

t4  148

（3）修改DataFrame数据的值

首先使用df[]、df.loc[]、df.iloc[]、df.ix[]等方法选择相应的数据，然后使用等号赋值即可

特别地，可以对布尔值索引的筛选结果进行赋值修改，如：

# 将所有负数转化为正数

df[df<0] = -df

# 将所有PE小于0的PE数据赋值为1000

df.loc[df['PE'<0],'PE'] = 1000

# 新增一列new_column，并且令PE为负的行的new_column列为0（PE非负的行的new_column列的值将是NaN）

df.loc[df['PE']<0,'new_column'] = 0

（4）对DataFrame数据的值进行替换：df.replace()

语法：

df.replace('替换前的值', '替换后的值', inplace=False)			# 单个值的替换

df.replace({'旧1':'新1','旧2':'新2'...}, inplace=False)		# 多个值的替换

若要同时进行多个值的替换，可以使用字典将替换前的值、替换后的值组成键值对

参数inplace：默认值为False，不对df本身进行修改；inplace=True时，直接对df进行修改

注意：仅对值进行替换，不会对标签索引或位置索引进行替换

import numpy as np

import pandas as pd

df = pd.DataFrame([[1,2],[3,4]])

df.columns=[3,4]

print(df); print('===========')

df.replace(3,33,inplace=True)

print(df); print('===========')

df.replace({2:22,4:44},inplace=True)

print(df)

执行结果：

   3  4

0  1  2

1  3  4

===========

    3  4

0   1  2

1  33  4

===========

    3   4

0   1  22

1  33  44

（5）按指定列或行的数据值排序：df.sort_values()

语法：df.sort_values(by, axis=0, ascending=True, inplace=False)

参数：

by：可以是列标签或行标签（单列排序），也可以是列标签组成的list或行标签组成的list（多列排序），注意by与axis的对应关系（axis=0时by='列标签'或['列标签1','列标签2',...]，axis=1时by='行标签'或['行标签1','行标签2',...]）
axis：默认值为0，按指定列的数据值排序；axis=1时，按指定行的数据值排序
ascending：默认值为True，升序排列；ascending=False时，降序排列
inplace：默认值为False，不对df本身进行修改；inplace=True时，直接对df进行修改

import numpy as np

import pandas as pd

np.random.seed(1)

arr = np.random.randint(1,100,(4,4))

df = pd.DataFrame(arr,columns=['c1','c2','c3','c4'])

df.index = pd.date_range('2019-1-31', periods=4, freq='M')

print('df\n',df); print('===========')

# 按c1列的值的升序排列

df = df.sort_values(by='c1')

# df.sort_values(by='c1',inplace=True)   # 这样写也行

print('按c1列的值的升序排列\n',df); print('===========')

# 按2019-04-30行的值的降序排列

df = df.sort_values(by=pd.datetime(2019,4,30),axis=1,ascending=False)

# df.sort_values(by=pd.datetime(2019,4,30),axis=1,ascending=False,inplace=True) # 这样写也行

print('按2019-04-30行的值的降序排列\n',df)

执行结果：

df

             c1  c2  c3  c4

2019-01-31  38  13  73  10

2019-02-28  76   6  80  65

2019-03-31  17   2  77  72

2019-04-30   7  26  51  21

===========

按c1列的值的升序排列

             c1  c2  c3  c4

2019-04-30   7  26  51  21

2019-03-31  17   2  77  72

2019-01-31  38  13  73  10

2019-02-28  76   6  80  65

===========

按2019-04-30行的值的降序排列

             c3  c2  c4  c1

2019-04-30  51  26  21   7

2019-03-31  77   2  72  17

2019-01-31  73  13  10  38

2019-02-28  80   6  65  76

（6）按DataFrame的标签排序：df.sort_index()

语法：df.sort_index(axis=0, ascending=True, inplace=False)

参数：

axis：默认值为0，按行标签排序；axis=1时，按列标签排序
ascending：默认值为True，升序排列；ascending=False时，降序排列
inplace：默认值为False，不对df本身进行修改；inplace=True时，直接对df进行修改

import numpy as np

import pandas as pd

np.random.seed(1)

arr = np.random.randint(1,100,(4,4))

df = pd.DataFrame(arr,columns=['c1','c2','c3','c4'])

df.index = pd.date_range('2019-1-31', periods=4, freq='M')

print('df\n',df); print('===========')

# 按行标签降序排列

df = df.sort_index(ascending=False)

# df.sort_index(ascending=False,inplace=True)   # 这样写也行

print('按行标签降序排列\n',df); print('===========')

# 按列标签降序排列

df = df.sort_index(axis=1, ascending=False)

# df.sort_index(axis=1, ascending=False, inplace=True)  # 这样写也行

print('按列标签降序排列\n',df)

执行结果：

df

             c1  c2  c3  c4

2019-01-31  38  13  73  10

2019-02-28  76   6  80  65

2019-03-31  17   2  77  72

2019-04-30   7  26  51  21

===========

按行标签降序排列

             c1  c2  c3  c4

2019-04-30   7  26  51  21

2019-03-31  17   2  77  72

2019-02-28  76   6  80  65

2019-01-31  38  13  73  10

===========

按列标签降序排列

             c4  c3  c2  c1

2019-04-30  21  51  26   7

2019-03-31  72  77   2  17

2019-02-28  65  80   6  76

2019-01-31  10  73  13  38

（7）自定义列标签和行标签：df.columns和df.index

这两个属性都支持对其整体进行重新赋值，但不支持对其中的元素进行修改（会报错），详见本章“DataFrame对象的属性”

（8）将某一列的值设置为行标签索引：df.set_index()

语法：df.set_index('选中列的列标签', inplace=False)

将某一列的值设置为行标签索引（注意：指定列是被“剪切”到最左边当索引的，不是“复制”）

参数inplace：默认值为False（不替换原df，须使用一个新的变量进行接收）；值为True时，直接在原df上修改

import numpy as np

import pandas as pd

df = pd.DataFrame([[1,'a',3+4j],[2,'b',5+6j]],columns=['c1','c2','c3'],index=['t1','t2'])

print(df); print('===========')

df.set_index('c2', inplace = True)

print(df)

执行结果：

    c1 c2        c3

t1   1  a  3.0+4.0j

t2   2  b  5.0+6.0j

===========

    c1        c3

c2

a    1  3.0+4.0j

b    2  5.0+6.0j

（9）重置行索引：df.reset_index()

语法：df.reset_index(inplace=False)

将行标签索引删除，并将其保存为DataFrame数据的第一列（该列对应的列标签索引为df.index.name），df.index则被重置为0、1、2……的位置索引

参数inplace：默认值为False（不替换原df，须使用一个新的变量进行接收）；值为True时，直接在原df上修改

import numpy as np

import pandas as pd

df = pd.DataFrame([[1,3,5,7],[2,4,6,8]],columns=['c1','c2','c3','c4'],index=['t1','t2'])

df.index.name = 'myindex'	# 为df.index这个Series设置名字

print(df); print('===========')

df.reset_index(inplace=True)

print(df)

执行结果：

         c1  c2  c3  c4

myindex

t1        1   3   5   7

t2        2   4   6   8

===========

  myindex  c1  c2  c3  c4

0      t1   1   3   5   7

1      t2   2   4   6   8

（10）对列标签进行重命名：df.rename()

语法：df.rename(columns={'原列标签':'新列标签',...},inplace=False)

对列标签进行重命名。注意此方法仅能修改列标签columns，不能修改行标签index。

参数：

columns：一个字典，里面是原列标签和新列标签组成的键值对
inplace：默认值为False（不替换原df，须使用一个新的变量进行接收）；值为True时，直接在原df上修改

import numpy as np

import pandas as pd

arr = np.array([[1,2,3,4],[5,6,7,8]])

df1 = pd.DataFrame(arr,columns=['c1','c2','c3','c4'],index=['t1','t2'])

print("查看df1\n",df1); print('===========')

df2 = df1.rename(columns={'c2':'哈','c5':'嘿','t2':'哼'})

print("查看df2\n",df2)

执行结果：

查看df1

     c1  c2  c3  c4

t1   1   2   3   4

t2   5   6   7   8

===========

查看df2

     c1  哈  c3  c4

t1   1  2   3   4

t2   5  6   7   8

（11）修改某列的数据类型：astype()

语法：df['列标签'] = df['列标签'].astype('新的数据类型')

由于astype()不会对原对象本身进行修改，因此只能通过这样的赋值操作来实现修改

import numpy as np

import pandas as pd

df = pd.DataFrame([[1,2.9],[3,4.9]],columns=['c1','c2'],index=['t1','t2'])

print(df); print('==============')

print(df.dtypes); print('==============')

df['c1'] = df['c1'].astype('float')     	# 将c1列变为float

df['c2'] = df['c2'].astype('int')       	# 将c2列变为int

print(df); print('==============')

print(df.dtypes)

执行结果：

    c1   c2

t1   1  2.9

t2   3  4.9

==============

c1      int64

c2    float64

dtype: object

==============

     c1  c2

t1  1.0   2

t2  3.0   4

==============

c1    float64

c2      int32

dtype: object

（12）删除值重复的行：df.drop_duplicates()

语法：df.drop_duplicates(subset=None, keep='first', inplace=False)

删除值重复的行，返回一个DataFrame

参数：

subset：子集，默认为None，此时两行的所有列的值都相等，才认为这两行重复；当subset='列标签'或subset=['列标签','列标签',...]时，只要指定的列的值相等，就认为这两行重复
keep：重复时保留哪一行，默认为'first'，保留第一行；当keep='last'时，保留最后一行；当keep=False时，不保留
inplace：默认值为False（不替换原df，须使用一个新的变量进行接收）；值为True时，直接在原df上修改

import numpy as np

import pandas as pd

df = pd.DataFrame([[1,2,3],[1,2,3],[1,2,3],[1,2,4]],columns=['c1','c2','c3'],index=['t1','t2','t3','t4'])

print(df); print('===========')

print(df.drop_duplicates()); print('===========')

print(df.drop_duplicates(keep='last')); print('===========')

print(df.drop_duplicates(keep=False)); print('===========')

print(df.drop_duplicates(subset=['c1','c2']))

执行结果：

    c1  c2  c3

t1   1   2   3

t2   1   2   3

t3   1   2   3

t4   1   2   4

===========

    c1  c2  c3

t1   1   2   3

t4   1   2   4

===========

    c1  c2  c3

t3   1   2   3

t4   1   2   4

===========

    c1  c2  c3

t4   1   2   4

===========

    c1  c2  c3

t1   1   2   3

（13）将数据的值整体平移若干行（或列）：df.shift()

语法：df.shift(periods=1,axis=0)

返回一个DataFrame，其行标签、列标签与df均相同，值向某方向平移了若干行（或列），平移导致空缺的行（或列）使用NaN填充

参数：

periods：值被平移的行数或列数，默认为1（int）
axis：默认值为0，纵向平移；axis=1时横向平移

注意：

period和axis两个参数共同决定了平移的方向：

axis=0 axis=1

period<0 向上平移向左平移

period=0 不平移不平移

period>0 向下平移向右平移
仅对DataFrame的值进行平移，行标签、列标签不跟着平移
常使用df/df.shift()计算资产价格每日收益率，见下例：

	axis=0	axis=1
period<0	向上平移	向左平移
period=0	不平移	不平移
period>0	向下平移	向右平移

import numpy as np

import pandas as pd

df = pd.DataFrame({'600001': [10,11,12,13,14],'600002': [20,21,22,23,24],'600003':[30,31,32,33,34]},index=['t1','t2','t3','t4','t5'])

print('df\n',df); print('-----------')

print('向下平移1行：df.shift()\n',df.shift()); print('-----------')

print('每日收益率df/df.shift()-1\n',df/df.shift()-1); print('===========')

print('向上平移1行：df.shift(-1)\n',df.shift(-1)); print('-----------')

print('向右平移1行：df.shift(1,axis=1)\n',df.shift(1,axis=1)); print('-----------')

print('向左平移1行：df.shift(-1,axis=1)\n',df.shift(-1,axis=1))

执行结果：

df

     600001  600002  600003

t1      10      20      30

t2      11      21      31

t3      12      22      32

t4      13      23      33

t5      14      24      34

-----------

向下平移1行：df.shift()

     600001  600002  600003

t1     NaN     NaN     NaN

t2    10.0    20.0    30.0

t3    11.0    21.0    31.0

t4    12.0    22.0    32.0

t5    13.0    23.0    33.0

-----------

每日收益率df/df.shift()-1

       600001    600002    600003

t1       NaN       NaN       NaN

t2  0.100000  0.050000  0.033333

t3  0.090909  0.047619  0.032258

t4  0.083333  0.045455  0.031250

t5  0.076923  0.043478  0.030303

===========

向上平移1行：df.shift(-1)

     600001  600002  600003

t1    11.0    21.0    31.0

t2    12.0    22.0    32.0

t3    13.0    23.0    33.0

t4    14.0    24.0    34.0

t5     NaN     NaN     NaN

-----------

向右平移1行：df.shift(1,axis=1)

     600001  600002  600003

t1     NaN    10.0    20.0

t2     NaN    11.0    21.0

t3     NaN    12.0    22.0

t4     NaN    13.0    23.0

t5     NaN    14.0    24.0

-----------

向左平移1行：df.shift(-1,axis=1)

     600001  600002  600003

t1    20.0    30.0     NaN

t2    21.0    31.0     NaN

t3    22.0    32.0     NaN

t4    23.0    33.0     NaN

t5    24.0    34.0     NaN

（14）将DataFrame转换为字典：df.to_dict()

语法：df.to_dict(orient='dict')

将DataFrame转换为字典

参数orient：转换成的字典的类型，默认值为'dict'，还可以是'list'、'series'、'split'、'records'、'index'，每个参数的效果见示例代码：

import numpy as np

import pandas as pd

df = pd.DataFrame([[1,2],[3,4]],columns=['c1','c2'],index=['t1','t2'])

print(df); print('===========')

print(df.to_dict(orient='dict')); print('===========')

print(df.to_dict(orient='list')); print('===========')

print(df.to_dict(orient='series')); print('===========')

print(df.to_dict(orient='split')); print('===========')

print(df.to_dict(orient='records')); print('===========')

print(df.to_dict(orient='index'))

执行结果：

    c1  c2

t1   1   2

t2   3   4

===========

{'c1': {'t1': 1, 't2': 3}, 'c2': {'t1': 2, 't2': 4}}

===========

{'c1': [1, 3], 'c2': [2, 4]}

===========

{'c1': t1    1

t2    3

Name: c1, dtype: int64, 'c2': t1    2

t2    4

Name: c2, dtype: int64}

===========

{'index': ['t1', 't2'], 'columns': ['c1', 'c2'], 'data': [[1, 2], [3, 4]]}

===========

[{'c1': 1, 'c2': 2}, {'c1': 3, 'c2': 4}]

===========

{'t1': {'c1': 1, 'c2': 2}, 't2': {'c1': 3, 'c2': 4}}

（15）使用DataFrame中的若干列构建数据透视表：df.pivot()

语法：df.pivot(index='列标签', columns='列标签', values='列标签'或['列标签','列标签'...])

返回一个数据透视表形式的DataFrame

参数：

index：数据透视表的y轴（str）
columns：数据透视表的x轴（str）
values：数据透视表的数据值（str或list）

import numpy as np

import pandas as pd

df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two','two'],

                   'bar': ['A', 'B', 'C', 'A', 'B', 'C'],

                   'baz': [1, 2, 3, 4, 5, 6],

                   'zoo': ['x', 'y', 'z', 'q', 'w', 't']})

print(df); print('===========')

df1 = df.pivot(index='foo', columns='bar', values='baz')

print(df1); print(type(df1)); print('===========')

df2 = df.pivot(index='foo', columns='bar', values=['baz', 'zoo'])

print(df2); print(type(df2))

执行结果：

   foo bar  baz zoo

0  one   A    1   x

1  one   B    2   y

2  one   C    3   z

3  two   A    4   q

4  two   B    5   w

5  two   C    6   t

===========

bar  A  B  C

foo

one  1  2  3

two  4  5  6

<class 'pandas.core.frame.DataFrame'>

===========

    baz       zoo

bar   A  B  C   A  B  C

foo

one   1  2  3   x  y  z

two   4  5  6   q  w  t

<class 'pandas.core.frame.DataFrame'>

（16）对层次化索引的DataFrame进行变形（行标签与列标签的转换）：df.unstack()和df.stack()

详见本章“13.层次化索引 - （3）使用unstack()和stack()对层次化索引的Series和DataFrame进行变形（行标签与列标签的转换）”

9. DataFrame的空值（NaN）处理

（1）手动输入NaN的方法

输入numpy.nan
将None作为list中的一个元素，并使用此list创建DataFrame，则相应位置的None会自动变为NaN

（2）df.isnull(), df.isna(), df.notna()

语法1：df.isnull(), df.isna()

判断每个df的每个元素是否为NaN（是NaN时返回True），返回一个结构与df相同的、由布尔值组成的DataFrame

语法2：df.notna()

判断每个df的每个元素是否为NaN（不是NaN时返回True），返回一个结构与df相同的、由布尔值组成的DataFrame

import numpy as np

import pandas as pd

df = pd.DataFrame([[np.nan,2],[3,4]],columns=['c1','c2'],index=['t1','t2'])

print(df); print('===========')

print(df.isnull()); print('===========')

print(df.isna()); print('===========')

print(df.notna())

执行结果：

     c1  c2

t1  NaN   2

t2  3.0   4

===========

       c1     c2

t1   True  False

t2  False  False

===========

       c1     c2

t1   True  False

t2  False  False

===========

       c1    c2

t1  False  True

t2   True  True

（3）df.dropna()

语法：df.dropna(axis=0, how='any', inplace=False)

删除df中含有NaN的行（或列）

参数：

axis：默认值为0，按行删除；axis=1时按列删除
how：默认值为'any'，只要这一行（或列）有一个元素是NaN，就删除整行（或列）；当how='all'时，必须这一行（或列）所有元素都是NaN，才删除整行（或列）
inplace：默认值为False，此时不会对df本身进行修改，需要额外定义一个变量来接收结果；而当inplace=True时，则会直接对df本身进行修改

import numpy as np

import pandas as pd

df = pd.DataFrame([[np.nan,2],[3,4]],columns=['c1','c2'],index=['t1','t2'])

print(df); print('===========')

df.dropna(inplace=True)

print(df)

执行结果：

     c1  c2

t1  NaN   2

t2  3.0   4

===========

     c1  c2

t2  3.0   4

（4）df.fillna()、df.ffill()、df.bfill()

对df中的NaN按照一定的规则进行替换

语法：df.fillna(value=None, method=None, axis=0, inplace=False)，它和df.ffill(axis=)与df.bfill(axis=)的关系见下面“注意”中的表格

参数：

value可以有两种形式：
- value可以是一个固定的值，表示将df中的所有NaN统一替换成这个值
- value可以是一个字典，字典的键是df的列标签，字典的值是将该列中NaN替换成的值（可以实现对不同的列中的NaN替换成不同的值）
- value还可以是一个Series或DataFrame，此时遇到NaN会寻找对应位置的元素进行填充
method是填充方法，默认为None；当method='ffill'或method='pad'时，向前填充，与df.ffill(axis=)等价；当method='bfill'时，向后填充，与df.bfill(axis=)等价
axis：坐标轴方向，默认值为0，纵向；axis=1时为横向
inplace默认值为False，此时不会对df本身进行修改，需要额外定义一个变量来接收结果；而当inplace=True时，则会直接对df本身进行修改

注意：

value和method两个参数能且只能输入一个，即二者不能同时为None，也不能同时不为None，否则报错

当使用method时，method和axis两个参数共同决定了寻找填充值的方向：

	method='ffill'	method='bfill'
axis=0	取上方单元格的值填充NaN 等价于`df.ffill()`	取下方单元格的值填充NaN 等价于`df.bfill()`
axis=1	取左侧单元格的值填充NaN 等价于`df.ffill(axis=1)`	取右侧单元格的值填充NaN 等价于`df.bfill(axis=1)`

当指向的单元格仍为NaN时，继续越过该单元格向相同方向寻找填充值，若直到DataFrame的边界都是NaN，则不再进行填充，保留这些NaN

import numpy as np

import pandas as pd

df = pd.DataFrame([[np.nan,101,102,np.nan],

                   [110,np.nan,112,113],

                   [120,121,np.nan,123],

                   [np.nan,131,132,np.nan]],

                  columns=['c1','c2','c3','c4'],index=['t1','t2','t3','t4'])

df0 = df.fillna(0)

df1 = df.fillna({'c1':-1,'c4':-4})

df2 = df.fillna(method='ffill')

df3 = df.fillna(method='bfill')

df4 = df.fillna(method='ffill', axis=1)

df5 = df.fillna(method='bfill', axis=1)

df6 = df.ffill()

df7 = df.bfill()

df8 = df.ffill(axis=1)

df9 = df.bfill(axis=1)

print('df\n',df); print('===========')

print('df0(替换为0)\n',df0); print('===========')

print('df1(基于字典)\n',df1); print('===========')

print('df2(向上)\n',df2); print('===========')

print('df3(向下)\n',df3); print('===========')

print('df4(向左)\n',df4); print('===========')

print('df5(向右)\n',df5); print('===========')

print('df6(向上)\n',df6); print('===========')

print('df7(向下)\n',df7); print('===========')

print('df8(向左)\n',df8); print('===========')

print('df9(向右)\n',df9)

执行结果：

df

        c1     c2     c3     c4

t1    NaN  101.0  102.0    NaN

t2  110.0    NaN  112.0  113.0

t3  120.0  121.0    NaN  123.0

t4    NaN  131.0  132.0    NaN

===========

df0(替换为0)

        c1     c2     c3     c4

t1    0.0  101.0  102.0    0.0

t2  110.0    0.0  112.0  113.0

t3  120.0  121.0    0.0  123.0

t4    0.0  131.0  132.0    0.0

===========

df1(基于字典)

        c1     c2     c3     c4

t1   -1.0  101.0  102.0   -4.0

t2  110.0    NaN  112.0  113.0

t3  120.0  121.0    NaN  123.0

t4   -1.0  131.0  132.0   -4.0

===========

df2(向上)

        c1     c2     c3     c4

t1    NaN  101.0  102.0    NaN

t2  110.0  101.0  112.0  113.0

t3  120.0  121.0  112.0  123.0

t4  120.0  131.0  132.0  123.0

===========

df3(向下)

        c1     c2     c3     c4

t1  110.0  101.0  102.0  113.0

t2  110.0  121.0  112.0  113.0

t3  120.0  121.0  132.0  123.0

t4    NaN  131.0  132.0    NaN

===========

df4(向左)

        c1     c2     c3     c4

t1    NaN  101.0  102.0  102.0

t2  110.0  110.0  112.0  113.0

t3  120.0  121.0  121.0  123.0

t4    NaN  131.0  132.0  132.0

===========

df5(向右)

        c1     c2     c3     c4

t1  101.0  101.0  102.0    NaN

t2  110.0  112.0  112.0  113.0

t3  120.0  121.0  123.0  123.0

t4  131.0  131.0  132.0    NaN

===========

df6(向上)

        c1     c2     c3     c4

t1    NaN  101.0  102.0    NaN

t2  110.0  101.0  112.0  113.0

t3  120.0  121.0  112.0  123.0

t4  120.0  131.0  132.0  123.0

===========

df7(向下)

        c1     c2     c3     c4

t1  110.0  101.0  102.0  113.0

t2  110.0  121.0  112.0  113.0

t3  120.0  121.0  132.0  123.0

t4    NaN  131.0  132.0    NaN

===========

df8(向左)

        c1     c2     c3     c4

t1    NaN  101.0  102.0  102.0

t2  110.0  110.0  112.0  113.0

t3  120.0  121.0  121.0  123.0

t4    NaN  131.0  132.0  132.0

===========

df9(向右)

        c1     c2     c3     c4

t1  101.0  101.0  102.0    NaN

t2  110.0  112.0  112.0  113.0

t3  120.0  121.0  123.0  123.0

t4  131.0  131.0  132.0    NaN

（5）df.interpolate()

interpolate：v.插值

语法：df.interpolate(method='linear', axis=0, inplace=False)

使用插值法对df中的NaN进行替换

参数：

method：插值方法，默认为'linear'，线性插值，也可以选择其他方法
axis：插值方向，默认为0，纵向插值；axis=0时为横向插值
inplace默认值为False，此时不会对df本身进行修改，需要额外定义一个变量来接收结果；而当inplace=True时，则会直接对df本身进行修改

线性插值时，df边界行（或边界列）上的NaN的处理方法：

若axis=0，则上边界行上的NaN保持不变，下边界行上的NaN取其上方的值对NaN进行替换（向上填充）
若axis=1，则左边界列上的NaN保持不变，右边界列上的NaN取其左侧的值对NaN进行替换（向左填充）

关于时间序列低频调整为高频（升采样）的插值法，见本章“14. Pandas中的时间相关格式及方法-（9）df.resample()-低频调整为高频（升采样）：通过线性插值实现”

import numpy as np

import pandas as pd

df = pd.DataFrame([[10,np.nan,30],[np.nan,50,np.nan],[70,np.nan,90]],columns=['c1','c2','c3'],index=['t1','t2','t3'])

df1 = df.interpolate()

df2 = df.interpolate(axis=1)

print('df\n',df); print('===========')

print('df1\n',df1); print('===========')

print('df2\n',df2)

执行结果：

df

       c1    c2    c3

t1  10.0   NaN  30.0

t2   NaN  50.0   NaN

t3  70.0   NaN  90.0

===========

df1

       c1    c2    c3

t1  10.0   NaN  30.0

t2  40.0  50.0  60.0

t3  70.0  50.0  90.0

===========

df2

       c1    c2    c3

t1  10.0  20.0  30.0

t2   NaN  50.0  50.0

t3  70.0  80.0  90.0

10. DataFrame对象的方法和Pandas模块的方法

（0）一元通用函数

由于Pandas底层是NumPy，所以大部分numpy模块中的通用函数都适用于df，如：

求平方根：np.sqrt(df)

四舍五入：df.round(n)、np.round(df,n)

（1）一元通用函数补充

① 判断df中的每个元素是否在指定列表里

语法：df.isin(list)

返回一个与df结构相同的、由布尔值组成的DataFrame

  import numpy as np

import pandas as pd

  df = pd.DataFrame([[1,2],[3,4]],columns=['c1','c2'],index=['t1','t2'])

print(df.isin([2,3,5,7,11,13,17]))

执行结果

         c1     c2

t1  False   True

  t2   True  False

② 计算纵向相对百分比变化

语法：为df.pct_change(periods=1)

返回一个DataFrame，里面的每一项都是相对百分比变化，即：

\[新的DataFrame第m行n列的值=\frac{df第m行n列的值}{df第(m-periods)行n列的值}
\]

periods是从分子到分母所移动的行数，默认值为1。当periods>0时，相当于用某单元格的值 / 它上面某单元格的值；当periods<0时，相当于用某单元格的值 / 它下面某单元格的值

常使用df.pct_change()计算资产价格的每日收益率

import numpy as np

import pandas as pd

df = pd.DataFrame({'600001': [10,11,12,13,14],'600002': [20,21,22,23,24],'600003':[30,31,32,33,34]},index=['t1','t2','t3','t4','t5'])

print('df\n',df); print('===========')

print(df.pct_change()); print('===========')	# 常见的收益率计算方式

print(df.pct_change(-1))

执行结果：

df

     600001  600002  600003

t1      10      20      30

t2      11      21      31

t3      12      22      32

t4      13      23      33

t5      14      24      34

===========

      600001    600002    600003

t1       NaN       NaN       NaN

t2  0.100000  0.050000  0.033333

t3  0.090909  0.047619  0.032258

t4  0.083333  0.045455  0.031250

t5  0.076923  0.043478  0.030303

===========

      600001    600002    600003

t1 -0.090909 -0.047619 -0.032258

t2 -0.083333 -0.045455 -0.031250

t3 -0.076923 -0.043478 -0.030303

t4 -0.071429 -0.041667 -0.029412

t5       NaN       NaN       NaN

（2）二元通用函数

由于Pandas底层是NumPy，所以大部分numpy模块中的通用函数都适用于df

（3）统计相关方法

① df.min()和df.idxmin()

语法：df.min(axis=0)

按列（或按行）计算给定的DataFrame中的最小值，返回一个Series

参数axis：

默认值为0，即按列求最小。当给定的DataFrame为m行n列时，返回的Series长度为n，其索引为DataFrame的列标签，值为该列的最小值
若axis=1，则按行求最小。当给定的DataFrame为m行n列时，返回的Series长度为m，其索引为DataFrame的行标签，值为该行的最小值

语法：df.idxmin(axis=0)

按列（或按行）计算给定的DataFrame中的最小值所对应的行标签（或列标签），返回一个Series

参数axis：默认值为0，即按列求最小；若axis=1，则按行求最小

注意：df.argmin()方法已弃用，改为df.idxmin()方法

import numpy as np

import pandas as pd

arr = np.array([[1,2],[4,3]])

df = pd.DataFrame(arr,columns=['c1','c2'],index=['t1','t2'])

print(df); print('===========')

print(df.min()); print('-----------')       # 寻找每一列最小的值

print(df.min(axis=1)); print('===========') # 寻找每一行最小的值

print(df.idxmin()); print('-----------')    # 寻找每一列最小的值对应的行标签

print(df.idxmin(axis=1))                    # 寻找每一行最小的值对应的行标签

执行结果：

    c1  c2

t1   1   2

t2   4   3

===========

c1    1

c2    2

dtype: int32

-----------

t1    1

t2    3

dtype: int32

===========

c1    t1

c2    t1

dtype: object

-----------

t1    c1

t2    c2

dtype: object

② df.max()和df.idxmax()

语法：df.max(axis=0)

按列（或按行）计算给定的DataFrame中的最大值，返回一个Series

参数axis：

默认值为0，即按列求最大。当给定的DataFrame为m行n列时，返回的Series长度为n，其索引为DataFrame的列标签，值为该列的最大值
若axis=1，则按行求最大。当给定的DataFrame为m行n列时，返回的Series长度为m，其索引为DataFrame的行标签，值为该行的最大值

语法：df.idxmax(axis=0)

按列（或按行）计算给定的DataFrame中的最大值所对应的行标签（或列标签），返回一个Series

参数axis：默认值为0，即按列求最大；若axis=1，则按行求最大

注意：df.argmax()方法已弃用，改为df.idxmax()方法

import numpy as np

import pandas as pd

arr = np.array([[1,2],[4,3]])

df = pd.DataFrame(arr,columns=['c1','c2'],index=['t1','t2'])

print(df); print('===========')

print(df.max()); print('-----------')       # 寻找每一列最大的值

print(df.max(axis=1)); print('===========') # 寻找每一行最大的值

print(df.idxmax()); print('-----------')    # 寻找每一列最大的值对应的行标签

print(df.idxmax(axis=1))                    # 寻找每一行最大的值对应的行标签

执行结果：

    c1  c2

t1   1   2

t2   4   3

===========

c1    4

c2    3

dtype: int32

-----------

t1    2

t2    4

dtype: int32

===========

c1    t2

c2    t2

dtype: object

-----------

t1    c2

t2    c1

dtype: object

③ df.sum()

语法：df.sum(axis=0)

按列（或按行）对给定的DataFrame中的数据求和，返回一个Series

参数axis：

默认值为0，即按列求和。当给定的DataFrame为m行n列时，返回的Series长度为n，其索引为DataFrame的列标签，值为该列所有数据的和
若axis=1，则按行求和。当给定的DataFrame为m行n列时，返回的Series长度为m，其索引为DataFrame的行标签，值为该行所有数据的和

import numpy as np

import pandas as pd

arr = np.array([[1,2],[3,4]])

df = pd.DataFrame(arr,columns=['c1','c2'],index=['t1','t2'])

print(df); print('===========')

print(df.sum(),type(df.sum())); print('===========')    # 默认按列求和

print(df.sum(axis=1))                    				# 按行求和

执行结果：

    c1  c2

t1   1   2

t2   3   4

===========

c1    4

c2    6

dtype: int64 <class 'pandas.core.series.Series'>

===========

t1    3

t2    7

④ df.mean()

语法：df.mean(axis=0)

按列（或按行）对给定的DataFrame中的数据求算数平均，返回一个Series

参数axis：

默认值为0，即按列求平均。当给定的DataFrame为m行n列时，返回的Series长度为n，其索引为DataFrame的列标签，值为该列所有数据的算数平均
若axis=1，则按行求平均。当给定的DataFrame为m行n列时，返回的Series长度为m，其索引为DataFrame的行标签，值为该行所有数据的算数平均

import numpy as np

import pandas as pd

arr = np.array([[1,2],[3,4]])

df = pd.DataFrame(arr,columns=['c1','c2'],index=['t1','t2'])

print(df); print('===========')

print(df.mean(),type(df.mean())); print('===========')    # 默认按列求平均

print(df.mean(axis=1),type(df.mean()))                    # 按行求平均

执行结果：

    c1  c2

t1   1   2

t2   3   4

===========

c1    2.0

c2    3.0

dtype: float64 <class 'pandas.core.series.Series'>

===========

t1    1.5

t2    3.5

dtype: float64 <class 'pandas.core.series.Series'>

⑤ df.count()

语法：df.count(axis=0)

按列（或按行）统计给定的DataFrame中非空数据的个数，返回一个Series

参数axis：

默认值为0，即按列计数。当给定的DataFrame为m行n列时，返回的Series长度为n，其索引为DataFrame的列标签，值为该列非空数据的个数
若axis=1，则按行计数。当给定的DataFrame为m行n列时，返回的Series长度为m，其索引为DataFrame的行标签，值为该行非空数据的个数

import numpy as np

import pandas as pd

arr = np.array([[np.nan,2],[3,4]])							# 手动输入NaN的方式

df = pd.DataFrame(arr,columns=['c1','c2'],index=['t1','t2'])

print(df); print('===========')

print(df.count(),type(df.count())); print('===========')    # 默认按列计数

print(df.count(axis=1))                    					# 按行计数

执行结果：

     c1   c2

t1  NaN  2.0

t2  3.0  4.0

===========

c1    1

c2    2

dtype: int64 <class 'pandas.core.series.Series'>

===========

t1    1

t2    2

dtype: int64

⑥ 累积计算：df.cumsum()、df.cumprod()、df.cummax()、df.cummin()

语法：df.cumXXX(axis=0)

按列（或按行）计算给定的DataFrame累积和、累积积、累积最大、累积最小，返回一个DataFrame

参数axis：默认值为0，即按列计算；若axis=1，则按行计算

注意：

累积最大指从第一个数据到当前数据这段区间内的最大，累积最小同理
累积最大、累积最小是DataFrame特有的方法，numpy.ndarray没有这两种方法
累积最大常用于计算最大回撤（max drawdown）
和numpy.ndarray不同，DataFrame不会将多维数据变为一维再计算其累积值

import numpy as np

import pandas as pd

df = pd.DataFrame({'c1':[1,-2,3,4,-5,6],'c2':[10,20,-30,-40,50,-60]},index=['t1','t2','t3','t4','t5','t6'])

df1 = df.cumsum()

df2 = df.cumsum(axis=1)

df3 = df.cumprod()

df4 = df.cummax()

df5 = df.cummin()

print('df\n', df); print('===========')

print('df1\n', df1); print('===========')

print('df2\n', df2); print('===========')

print('df3\n', df3); print('===========')

print('df4\n', df4); print('===========')

print('df5\n', df5)

执行结果：

df

     c1  c2

t1   1  10

t2  -2  20

t3   3 -30

t4   4 -40

t5  -5  50

t6   6 -60

===========

df1

     c1  c2

t1   1  10

t2  -1  30

t3   2   0

t4   6 -40

t5   1  10

t6   7 -50

===========

df2

     c1  c2

t1   1  11

t2  -2  18

t3   3 -27

t4   4 -36

t5  -5  45

t6   6 -54

===========

df3

      c1         c2

t1    1         10

t2   -2        200

t3   -6      -6000

t4  -24     240000

t5  120   12000000

t6  720 -720000000

===========

df4

     c1  c2

t1   1  10

t2   1  20

t3   3  20

t4   4  20

t5   4  50

t6   6  50

===========

df5

     c1  c2

t1   1  10

t2  -2  10

t3  -2 -30

t4  -2 -40

t5  -5 -40

t6  -5 -60

⑦ df.corr() 相关系数矩阵

语法：df.corr()

返回df每一列的相关系数矩阵（DataFrame格式）

注意：也可以直接使用Series的corr()方法计算两个Series的相关系数，其用法为s1.corr(s2)

# 计算上证指数、深圳成指、沪深300指数的相关系数矩阵

import numpy as np

import pandas as pd

import tushare as ts

df_close = pd.DataFrame({

    'sh':ts.get_k_data('sh', start='2019-01-01',end='2019-06-30')['close'],

    'sz':ts.get_k_data('sz', start='2019-01-01',end='2019-06-30')['close'],

    'hs300':ts.get_k_data('hs300', start='2019-01-01',end='2019-06-30')['close'],

})

df_return = df_close.pct_change().fillna(0)

df_corr = df_return.corr()

print(df_return.head()); print('===========')

print(df_corr); print(type(df_corr))

执行结果：

         sh        sz     hs300

0  0.000000  0.000000  0.000000

1 -0.000377 -0.008369 -0.001583

2  0.020496  0.027562  0.023957

3  0.007245  0.015836  0.006071

4 -0.002617 -0.001155 -0.002161

===========

             sh        sz     hs300

sh     1.000000  0.953529  0.976994

sz     0.953529  1.000000  0.941665

hs300  0.976994  0.941665  1.000000

<class 'pandas.core.frame.DataFrame'>

⑧ df.describe()

语法：df.describe(percentiles=None)

按列统计给定的DataFrame中各项描述性统计信息（包括count()、mean()、std()、min()、50%、max()和自定义的百分位数），返回一个DataFrame

参数percentiles：自定义的百分比列表，是一个由float组成的list。默认值为None，此时默认的自定义百分比是25%、75%。

注意：此方法无axis参数，无法按行统计

import numpy as np

import pandas as pd

arr = np.array([[np.nan,2],[3,4],[5,6]])

df = pd.DataFrame(arr,columns=['c1','c2'],index=['t1','t2','t3'])

print(df); print('===========')

print(df.describe(),'\n',type(df.describe())); print('===========')

print(df.describe(percentiles=[0.05,0.95]),'\n',type(df.describe(percentiles=[0.05,0.95])))

执行结果：

     c1   c2

t1  NaN  2.0

t2  3.0  4.0

t3  5.0  6.0

===========

             c1   c2

count  2.000000  3.0

mean   4.000000  4.0

std    1.414214  2.0

min    3.000000  2.0

25%    3.500000  3.0

50%    4.000000  4.0

75%    4.500000  5.0

max    5.000000  6.0

 <class 'pandas.core.frame.DataFrame'>

===========

             c1   c2

count  2.000000  3.0

mean   4.000000  4.0

std    1.414214  2.0

min    3.000000  2.0

5%     3.100000  2.2

50%    4.000000  4.0

95%    4.900000  5.8

max    5.000000  6.0

 <class 'pandas.core.frame.DataFrame'>

⑨ df.resample()：重采样

见本章“14. Pandas中的时间相关格式及方法-（9）df.resample()-低频调整为高频（升采样）：通过线性插值实现”

此外，关于DataFrame填充空值的插值法df.interpolate()，见本章“9. DataFrame的空值（NaN）处理 - （5）df.interpolate()”

⑩ df.rolling()：滑动时间窗

见本章“14. Pandas中的时间相关格式及方法 -（10）df.rolling()：滑动时间窗”

（4）将DataFrame存储为本地文件

包括两类方法：

df.to_xxx()系列方法
基于HDF5的存储方法

详见“AQF笔记-第2部分-第7章-金融数据源处理实现-二、金融数据的存储”

（5）其他重要方法

① df.apply()

语法：df.apply(func, axis=0)

将df逐列（或逐行）以Series的形式传递给func作为其参数并执行func()，并将每次func()的返回值组成一个Series，作为df.apply()整体的返回值。

参数：

func：已定义的函数名，也可以是一个匿名函数
axis：默认值为0，逐列传递；axis=1时，逐行传递

此外，对于DataFrame分组聚合时建立的分组对象group_obj，也有类似的apply()方法（一个不同之处在于group_obj.apply(func)没有axis参数），其原理及应用详见本章“11. DataFrame的分组、聚合 - （3）分组对象group_obj的应用 - ⑤ 使用group_obj.apply()”

import numpy as np

import pandas as pd

# 从CSV文件读取数据，进行处理后，保留前5行的'code','name','roe'三列

data = pd.read_csv('2019Q1.csv')

data = data.sort_values('code').reset_index()

data = data[['code','name','roe']].head(5)

print(data); print('===========')

# 按ROE对股票进行分类的函数

def map_func(x):

    print('----map_func内部开始----')

    print(x)

    print(type(x))

    print('----map_func内部结束----')

    if x['roe'] > 4:

        return '高成长'

    elif x['roe'] >= 0:

        return '低成长'

    elif x['roe'] < 0:

        return '亏损'

# 执行data.apply()，axis=1代表按行取Series传给map_func()

result = data.apply(map_func, axis=1)

print('===========')

print(result)

print(type(result))

print('===========')

# 根据 ROE 数据计算“成长性”，并将此列添加到data

data['成长性'] = result    # 相当于data['成长性'] = data.apply(map_func, axis=1)

print(data)

执行结果：

   code  name   roe

0     1  平安银行  2.96

1     2   万科A  0.71

2     4  国农科技  4.66

3     5  世纪星源 -1.20

4     6  深振业A  1.75

===========

----map_func内部开始----

code       1

name    平安银行

roe     2.96

Name: 0, dtype: object

<class 'pandas.core.series.Series'>

----map_func内部结束----

----map_func内部开始----

code       2

name     万科A

roe     0.71

Name: 1, dtype: object

<class 'pandas.core.series.Series'>

----map_func内部结束----

----map_func内部开始----

code       4

name    国农科技

roe     4.66

Name: 2, dtype: object

<class 'pandas.core.series.Series'>

----map_func内部结束----

----map_func内部开始----

code       5

name    世纪星源

roe     -1.2

Name: 3, dtype: object

<class 'pandas.core.series.Series'>

----map_func内部结束----

----map_func内部开始----

code       6

name    深振业A

roe     1.75

Name: 4, dtype: object

<class 'pandas.core.series.Series'>

----map_func内部结束----

===========

0    低成长

1    低成长

2    高成长

3     亏损

4    低成长

dtype: object

<class 'pandas.core.series.Series'>

===========

   code  name   roe  成长性

0     1  平安银行  2.96  低成长

1     2   万科A  0.71  低成长

2     4  国农科技  4.66  高成长

3     5  世纪星源 -1.20   亏损

4     6  深振业A  1.75  低成长

② df.applymap()

语法：df.applymap(func)

将df中的每个元素分别传递给func作为其参数并执行func()，并将每次func()的返回值组成一个结构相同的新的DataFrame，作为df.applymap()整体的返回值。

参数：func：已定义的函数名，也可以是一个匿名函数

import numpy as np

import pandas as pd

df1 = pd.DataFrame([[10,20],[30,40]],columns=['c1','c2'],index=['t1','t2'])

df2 = df1.applymap(lambda x:x+1)

print(df1); print('===========')

print(df2)

执行结果：

    c1  c2

t1  10  20

t2  30  40

===========

    c1  c2

t1  11  21

t2  31  41

关于apply()、applymap()和map()的总结：

	apply()	applymap()	map()
Python内置函数	NA	NA	遍历每一个元素
Series方法	遍历每一个元素	NA	遍历每一个元素
DataFrame方法	遍历行或列	遍历每一个元素	NA

③ df.iterrows()

语法：df.iterrows()

返回一个生成器（generator），该生成器使用df逐行生成一个元组，其中元组第0项是df的行索引，元组第1项是该行数据组成的Series（df的值也是Series的值，df的列索引则是Series的索引）。上述索引中，有标签索引的优先使用标签索引，否则使用位置索引。通常使用两个变量以拆包的方式分别接收两个返回值。

df.iterrows() 通常用于循环遍历df的每一行数据。

import numpy as np

import pandas as pd

df = pd.DataFrame([[10,20,30],[40,50,60],[70,80,90]],columns=['c1','c2','c3'],index=['t1','t2','t3'])

print(df); print('===========')

print(df.iterrows()); print('===========')

for i,j in df.iterrows():

    print(i); print('-----------')

    print(type(i)); print('-----------')

    print(j); print('-----------')

    print(type(j)); print('===========')

执行结果：

    c1  c2  c3

t1  10  20  30

t2  40  50  60

t3  70  80  90

===========

<generator object DataFrame.iterrows at 0x000000001431ECA8>

===========

t1

-----------

<class 'str'>

-----------

c1    10

c2    20

c3    30

Name: t1, dtype: int64

-----------

<class 'pandas.core.series.Series'>

===========

t2

-----------

<class 'str'>

-----------

c1    40

c2    50

c3    60

Name: t2, dtype: int64

-----------

<class 'pandas.core.series.Series'>

===========

t3

-----------

<class 'str'>

-----------

c1    70

c2    80

c3    90

Name: t3, dtype: int64

-----------

<class 'pandas.core.series.Series'>

===========

④ df.all()

语法：df.all(axis=0)

返回一个布尔值组成的Series

参数axis：

默认值为0，按列计算，Series的索引为df的列索引，当df中该列所有值均为True时，Series中对应项为True，否则为False
若axis=1，按行计算，Series的索引为df的行索引，当df中该行所有值均为True时，Series中对应项为True，否则为False

import numpy as np

import pandas as pd

a = pd.DataFrame([[0,0],[1,1]],columns=['c1','c2'],index=['t1','t2'])

b = pd.DataFrame([[1,1],[1,1]],columns=['c1','c2'],index=['t1','t2'])

print(a); print('-----------')

print(a.all()); print('-----------')

print(a.all(axis=1)); print('===========')

print(b); print('-----------')

print(b.all()); print('-----------')

print(b.all(axis=1))

执行结果：

    c1  c2

t1   0   0

t2   1   1

-----------

c1    False

c2    False

dtype: bool

-----------

t1    False

t2     True

dtype: bool

===========

    c1  c2

t1   1   1

t2   1   1

-----------

c1    True

c2    True

dtype: bool

-----------

t1    True

t2    True

dtype: bool

⑤ df.any()

语法：df.any(axis=0)

返回一个布尔值组成的Series

参数axis：

默认值为0，按列计算，Series的索引为df的列索引，当df中该列任意一个值为True时，Series中对应项为True，否则为False
若axis=1，按行计算，Series的索引为df的行索引，当df中该行任意一个值为True时，Series中对应项为True，否则为False

import numpy as np

import pandas as pd

a = pd.DataFrame([[0,0],[1,1]],columns=['c1','c2'],index=['t1','t2'])

b = pd.DataFrame([[0,0],[0,0]],columns=['c1','c2'],index=['t1','t2'])

print(a); print('-----------')

print(a.any()); print('-----------')

print(a.any(axis=1)); print('===========')

print(b); print('-----------')

print(b.any()); print('-----------')

print(b.any(axis=1))

执行结果：

    c1  c2

t1   0   0

t2   1   1

-----------

c1    True

c2    True

dtype: bool

-----------

t1    False

t2     True

dtype: bool

===========

    c1  c2

t1   0   0

t2   0   0

-----------

c1    False

c2    False

dtype: bool

-----------

t1    False

t2    False

dtype: bool

⑥ `df.len()`和len(df)

返回df的长度（int类型），它等于df的行数，即df.shape()返回的元组的第0项

⑦ df.head()

语法：df.head(n=5)

获取df的前n行，n的默认值为5，返回DataFrame类型。示例代码见“6.DataFrame的数据选择 - （5）使用df的属性和方法进行数据选择”

该方法主要用于快速预览一个行数较多的DataFrame

⑧ df.tail()

语法：df.tail(n=5)

获取df的后n行，n的默认值为5，返回DataFrame类型。示例代码见“6.DataFrame的数据选择 - （5）使用df的属性和方法进行数据选择”

该方法主要用于快速预览一个行数较多的DataFrame

⑨ df.info()

自动在屏幕输出df的一些基本信息

import numpy as np

import pandas as pd

df = pd.DataFrame([[1,2],[3,4]], columns=['c1','c2'], index=['t1','t2'])

df.info()

执行结果：

<class 'pandas.core.frame.DataFrame'>

Index: 2 entries, t1 to t2

Data columns (total 2 columns):

c1    2 non-null int64

c2    2 non-null int64

dtypes: int64(2)

memory usage: 48.0+ bytes

⑩ df.duplicated()

语法：df.duplicated(subset=None, keep='first')

判断df中的每一行的值是否重复，返回一个bool组成的Series

参数：

subset：子集，默认为None，此时两行的所有列的值都相等，才认为这两行重复；当subset='列标签'或subset=['列标签','列标签',...]时，只要指定的列的值相等，就认为这两行重复
keep：重复时的标记方式，默认为'first'
- 'first'：在重复的行中，除了第一行标记为False，其他行都标记为True
- 'last'：在重复的行中，除了最后一行标记为False，其他行都标记为True
- False：在重复的行中，所有行均标记为True

import numpy as np

import pandas as pd

df = pd.DataFrame([[1,3],[2,3],[2,3],[1,4],[2,4]])

df.columns = ['c1','c2']

df.index = ['t1','t2','t3','t4','t5']

print(df)

print('---\n',df.duplicated())                      # 所有列值都相等视为重复

print('---\n',df.duplicated(subset=['c1']))         # 'c1'列值相等视为重复

print('---\n',df.duplicated(subset=['c2']))         # 'c1'列值相等视为重复

print('---\n',df.duplicated(subset=['c1','c2']))    # 'c1','c2'列值都相等视为重复

执行结果：

    c1  c2

t1   1   3

t2   2   3

t3   2   3

t4   1   4

t5   2   4

---

 t1    False

t2    False

t3     True

t4    False

t5    False

dtype: bool

---

 t1    False

t2    False

t3     True

t4     True

t5     True

dtype: bool

---

 t1    False

t2     True

t3     True

t4    False

t5     True

dtype: bool

---

 t1    False

t2    False

t3     True

t4    False

t5    False

dtype: bool

⑾ df.rank()

语法：

df.rank(axis=0,method='average',numeric_only=None,na_option='keep',ascending=True,pct=False)

返回一个形状与df相同的DataFrame，里面的数据是其在本列（或本行）所有数据中的排名

参数：

axis：默认值为0，按列数据计算排名；axis=1时，按行数据计算排名
method：存在并列时排名的计算方法，默认值为'average'，取值可以为'average'，'first'，'min'， 'max'，'dense'。假设参与升序排名的数据为100、150、150、200：
- average：平均排名，当存在并列时，取这些并列项的顺序排名的平均值（1、2.5、2.5、4）
- first：顺序排名，当存在并列时，谁在DataFrame中的顺序靠前，谁的顺序排名也靠前（1、2、3、4），注意method='first'时不支持非数字类型的排名
- min：最小排名，当存在并列时，取这些并列项的顺序排名的最小值（1、2、2、4）
- max：最大排名，当存在并列时，取这些并列项的顺序排名的最大值（1、3、3、4）
- dense：密集排名，后一项的排名总是与前一项相同或加一，不跳跃（1、2、2、3）
numeric_only：bool，是否仅仅计算数字类型的columns
na_option：NaN值是否参与排序及如何排序，默认值'keep'，取值可以为'keep'、'top'、'bottom'：
- 'keep'：NaN的排名还是NaN
- 'top'：把NaN放在排名首位
- 'bottom'：把NaN放在排名末位
ascending：bool，是否升序，默认值True
pct：bool，是否以百分比方式显示排名，默认值False

注意：df.rank()只能实现每个字段分别排名，无法实现多字段联合排名，后者的功能需要通过df.groupby(['排序字段1','排序字段2',...]).ngroup()实现

import numpy as np

import pandas as pd

df = pd.DataFrame({'animal': ['cat', 'penguin', 'dog','spider', 'snake'],

                   'legs': [4, 2, 4, 8, np.nan]})

print("df\n",df,'\n----')

print("df.rank(method='average')\n",df.rank(method='average'),'\n----')

print("df.legs.rank(method='first')\n",df.legs.rank(method='first'),'\n----')

print("df.rank(method='min')\n",df.rank(method='min'),'\n----')

print("df.rank(method='max')\n",df.rank(method='max'),'\n----')

print("df.rank(method='dense')\n",df.rank(method='dense'),'\n====')

print("df.rank(method='min',na_option='top')\n",df.rank(method='min',na_option='top'),'\n----')

print("df.rank(method='min',na_option='bottom')\n",df.rank(method='min',na_option='bottom'),'\n====')

print("df.rank(method='min',pct=True)\n",df.rank(method='min',pct=True))

执行结果：

df

     animal  legs

0      cat   4.0

1  penguin   2.0

2      dog   4.0

3   spider   8.0

4    snake   NaN

----

df.rank(method='average')

    animal  legs

0     1.0   2.5

1     3.0   1.0

2     2.0   2.5

3     5.0   4.0

4     4.0   NaN

----

df.legs.rank(method='first')

 0    2.0

1    1.0

2    3.0

3    4.0

4    NaN

Name: legs, dtype: float64

----

df.rank(method='min')

    animal  legs

0     1.0   2.0

1     3.0   1.0

2     2.0   2.0

3     5.0   4.0

4     4.0   NaN

----

df.rank(method='max')

    animal  legs

0     1.0   3.0

1     3.0   1.0

2     2.0   3.0

3     5.0   4.0

4     4.0   NaN

----

df.rank(method='dense')

    animal  legs

0     1.0   2.0

1     3.0   1.0

2     2.0   2.0

3     5.0   3.0

4     4.0   NaN

====

df.rank(method='min',na_option='top')

    animal  legs

0     1.0   3.0

1     3.0   2.0

2     2.0   3.0

3     5.0   5.0

4     4.0   1.0

----

df.rank(method='min',na_option='bottom')

    animal  legs

0     1.0   2.0

1     3.0   1.0

2     2.0   2.0

3     5.0   4.0

4     4.0   5.0

====

df.rank(method='min',pct=True)

    animal  legs

0     0.2  0.50

1     0.6  0.25

2     0.4  0.50

3     1.0  1.00

4     0.8   NaN

11. DataFrame的分组、聚合

（1）分组对象group_obj的创建

DataFrame的分组、聚合都是基于分组对象group_obj实现的（它是pandas.core.groupby.generic. DataFrameGroupBy类的一个实例化对象），因此首先应创建分组对象group_obj，语法为：

按照一列进行分组时：group_obj = df.groupby('列标签')
按照多列联合分组时：group_obj = df.groupby(['列标签','列标签',...])

（2）分组对象group_obj的构成

假设按照创建分组对象group_obj时定义的分组规则，一共可以分成n组，则group_obj就是一个由n个元组组成的可迭代对象，其中每个元组的第0项是一种分组下的列标签（或元组形式的多个列标签的组合），每个元组第1项是该分组下对应的DataFrame。

（3）分组对象group_obj的应用

① 使用group_obj的指定方法查看df的一项统计信息

group_obj.size()：每个分组的记录数（含带空值的记录）（Series类型）
group_obj.max()：每个分组的最大值（DataFrame类型）
group_obj.min()：每个分组的最小值（DataFrame类型）
group_obj.sum()：每个分组的求和（DataFrame类型）
group_obj.mean()：每个分组的平均值（DataFrame类型）
group_obj.std()：每个分组的标准差（DataFrame类型）
group_obj.count()：每个分组的非空记录数（DataFrame类型）
group_obj.cumsum()：逐行的累加和（与原DataFrame行数相等的DataFrame）
group_obj.cumprod()：逐行的累乘积（与原DataFrame行数相等的DataFrame）
group_obj.cumcount()：从0开始的逐行的累加计数（与原DataFrame行数相等的Series）

② 使用group_obj.describe()查看df的全部统计信息

group_obj.describe()：横向查看，因表格过长，不推荐（DataFrame类型）
group_obj.describe().T：纵向查看，推荐（DataFrame类型）

③ 使用group_obj.agg()查看自定义项目的统计信息

group_obj.agg([np.mean, np.std])：查看所有字段的均值和标准差（DataFrame类型）
group_obj.agg({'c1':np.mean, 'c2':np.std})：查看'c1'字段的均值和'c2'字段的标准差（DataFrame类型）

④ 使用group_obj.get_group()获取指定分组的数据

按照一列进行分组时获取数据：group_obj.get_group('分组字段的某个值')
按照多列联合分组时获取数据：group_obj.get_group(('分组字段1的某个值','分组字段2的某个值',...))

⑤ 使用group_obj.ngroup()实现多字段联合排名

语法：group_obj.ngroup(ascending=True)

返回一个Series。根据创建group_obj时的几个分组字段，比较每个字段的值，当第一个字段的值相等时比较第二个字段，当第二个字段相等时比较第三个字段……若所有分组字段的值都相等，则两条记录的排名也相同（注意：排名是从0开始的）

import numpy as np

import pandas as pd

# 创建DataFrame

data = [('a', 70, 5),

        ('b', 80, 4),

        ('c', 70, 4),

        ('d', 70, 5),

        ('e', 80, 5),

        ('f', 75, 4)]

df = pd.DataFrame(data, columns=['name', 'score', 'homework'])

df = df.sort_values(['score', 'homework'], ascending=False)

print(df, '\n----------')

s = df.groupby(['score', 'homework']).ngroup(ascending=False)

print(s, '\n----------\n', type(s), '\n----------')

df['ranking'] = s + 1

print(df)

执行结果：

  name  score  homework

4    e     80         5

1    b     80         4

5    f     75         4

0    a     70         5

3    d     70         5

2    c     70         4

----------

4    0

1    1

5    2

0    3

3    3

2    4

dtype: int64

----------

 <class 'pandas.core.series.Series'>

----------

  name  score  homework  ranking

4    e     80         5        1

1    b     80         4        2

5    f     75         4        3

0    a     70         5        4

3    d     70         5        4

2    c     70         4        5

⑥ 使用group_obj.apply()实现自定义功能

语法：group_obj.apply(func)

参数func：已定义的函数名，也可以是一个匿名函数

将构成group_obj的每个元组中的DataFrame分别传递给func作为其参数并执行func()，然后将每次func()的返回值纵向拼接成一个DataFrame或Series，并为其在列标签的最外层添加分组列组成的列标签索引（联合分组时则在列标签的最外层添加多层列标签索引），然后将其作为group_obj.apply()整体的返回值。group_obj.apply()的应用见下面的例3。

此外，对于DataFrame对象，也有类似的apply()方法（一个不同之处在于df.apply(func,axis=0)有axis参数），其原理及应用详见本章“10. DataFrame对象的方法和Pandas模块的方法 -（5）其他方法 - ① df.apply()”

# 例1：本例仅用于研究分组对象group_obj的构成（group_obj的应用见例2）

import numpy as np

import pandas as pd

# 创建数据集

np.random.seed(0)

period = pd.date_range('2019-9-22', periods=1000, freq='D')

df = pd.DataFrame(np.random.randn(1000, 2), columns=['c1','c2'], index = period)

df['g1'] = np.random.choice(['M', 'N'], 1000)

df['g2'] = np.random.choice(['X', 'Y'], 1000)

for i in period:    # 随机产生空值

    if np.random.random() < 0.05: df.loc[i,'c1'] = np.nan

    if np.random.random() < 0.05: df.loc[i,'c2'] = np.nan

print('df.head()\n',df.head()); print('===========')

# 创建分组对象（由于这里是研究分组对象group_obj的构成，所以仅查看按照两列联合分组的结果）

# group_obj = df.groupby('g1')				# 按照一列进行分组

group_obj = df.groupby(['g1','g2'])		# 按照两列联合分组

for i in group_obj:

    print('开始一次group_obj的迭代\n',type(i)); print('-----------')

    for j in i:

        print(j); print('-----------')

        print(type(j)); print('-----------')

执行结果：

df.head()

                   c1        c2 g1 g2

2019-09-22       NaN  0.400157  N  X

2019-09-23  0.978738  2.240893  N  Y

2019-09-24  1.867558 -0.977278  M  X

2019-09-25  0.950088 -0.151357  N  X

2019-09-26 -0.103219  0.410599  M  Y

===========

开始一次group_obj的迭代

 <class 'tuple'>

-----------

('M', 'X')

-----------

<class 'tuple'>

-----------

                  c1        c2 g1 g2

2019-09-24  1.867558 -0.977278  M  X

...              ...       ... .. ..

2022-06-17 -1.141901 -1.310970  M  X

[260 rows x 4 columns]

-----------

<class 'pandas.core.frame.DataFrame'>

-----------

开始一次group_obj的迭代

 <class 'tuple'>

-----------

('M', 'Y')

-----------

<class 'tuple'>

-----------

                  c1        c2 g1 g2

2019-09-26 -0.103219  0.410599  M  Y

...              ...       ... .. ..

2022-06-15  0.197828  0.097751  M  Y

[245 rows x 4 columns]

-----------

<class 'pandas.core.frame.DataFrame'>

-----------

开始一次group_obj的迭代

 <class 'tuple'>

-----------

('N', 'X')

-----------

<class 'tuple'>

-----------

                  c1        c2 g1 g2

2019-09-22       NaN  0.400157  N  X

...              ...       ... .. ..

2022-06-14  1.315138 -0.323457  N  X

[256 rows x 4 columns]

-----------

<class 'pandas.core.frame.DataFrame'>

-----------

开始一次group_obj的迭代

 <class 'tuple'>

-----------

('N', 'Y')

-----------

<class 'tuple'>

-----------

                  c1        c2 g1 g2

2019-09-23  0.978738  2.240893  N  Y

...              ...       ... .. ..

2022-06-16  1.401523       NaN  N  Y

[239 rows x 4 columns]

-----------

<class 'pandas.core.frame.DataFrame'>

# 例2：group_obj的应用（group_obj.apply()见例3）

import numpy as np

import pandas as pd

# 创建数据集

np.random.seed(0)

period = pd.date_range('2019-9-22', periods=1000, freq='D')

df = pd.DataFrame(np.random.randn(1000, 2), columns=['c1','c2'], index = period)

df['g1'] = np.random.choice(['M', 'N'], 1000)

df['g2'] = np.random.choice(['X', 'Y'], 1000)

for i in period:    # 随机产生空值

    if np.random.random() < 0.05: df.loc[i,'c1'] = np.nan

    if np.random.random() < 0.05: df.loc[i,'c2'] = np.nan

print('df.head()\n',df.head()); print('===========')

# 创建分组对象

group_obj = df.groupby('g1')				# 按照一列进行分组

# group_obj = df.groupby(['g1','g2'])		# 按照两列联合分组

print('group_obj\n',group_obj); print('===========')

print('type(group_obj)\n',type(group_obj)); print('===========')

# 使用分组对象的指定方法查看一项统计信息

print('group_obj.size()\n',group_obj.size(),'\n',type(group_obj.size())); print('===========')

print('group_obj.max()\n',group_obj.max(),'\n',type(group_obj.max())); print('===========')

print('group_obj.min()\n',group_obj.min(),'\n',type(group_obj.min())); print('===========')

print('group_obj.sum()\n',group_obj.sum(),'\n',type(group_obj.sum())); print('===========')

print('group_obj.mean()\n',group_obj.mean(),'\n',type(group_obj.mean())); print('===========')

print('group_obj.std()\n',group_obj.std(),'\n',type(group_obj.std())); print('===========')

print('group_obj.count()\n',group_obj.count(),'\n',type(group_obj.count())); print('===========')

# 使用分组对象的describe()方法查看全部统计信息

print('group_obj.describe()\n',group_obj.describe(),'\n',type(group_obj.describe())); print('===========')

print('group_obj.describe().T\n',group_obj.describe().T,'\n',type(group_obj.describe().T)); print('===========')

# 使用分组对象的agg()方法查看自定义项目的统计信息

print('group_obj.agg([np.mean, np.std])\n',group_obj.agg([np.mean, np.std]),'\n',type(group_obj.agg([np.mean, np.std]))); print('===========')

print("group_obj.agg({'c1':np.mean, 'c2':np.std})\n",group_obj.agg({'c1':np.mean, 'c2':np.std}),'\n',type(group_obj.agg({'c1':np.mean, 'c2':np.std}))); print('===========')

# 使用分组对象的get_group()方法获取指定分组的数据

# 按照一列进行分组时获取数据

print("group_obj.get_group('M').head()\n",group_obj.get_group('M').head())

# 按照两列联合分组时获取数据

# print("group_obj.get_group(('M','X')).head()\n",group_obj.get_group(('M','X')).head())

# 例2的执行结果(按照一列进行分组)：

df.head()

                   c1        c2 g1 g2

2019-09-22       NaN  0.400157  N  X

2019-09-23  0.978738  2.240893  N  Y

2019-09-24  1.867558 -0.977278  M  X

2019-09-25  0.950088 -0.151357  N  X

2019-09-26 -0.103219  0.410599  M  Y

===========

group_obj

 <pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000000003DEB518>

===========

type(group_obj)

 <class 'pandas.core.groupby.generic.DataFrameGroupBy'>

===========

group_obj.size()

 g1

M    505

N    495

dtype: int64

 <class 'pandas.core.series.Series'>

===========

group_obj.max()

           c1        c2 g2

g1

M   2.680571  2.642936  Y

N   3.170975  2.759355  Y

 <class 'pandas.core.frame.DataFrame'>

===========

group_obj.min()

           c1        c2 g2

g1

M  -2.802203 -2.772593  X

N  -2.994613 -3.046143  X

 <class 'pandas.core.frame.DataFrame'>

===========

group_obj.sum()

            c1         c2

g1

M  -21.519519 -23.540760

N    3.243500   1.208014

 <class 'pandas.core.frame.DataFrame'>

===========

group_obj.mean()

           c1        c2

g1

M  -0.046080 -0.048338

N   0.006901  0.002549

 <class 'pandas.core.frame.DataFrame'>

===========

group_obj.std()

           c1        c2

g1

M   0.989397  0.982953

N   0.963094  0.980751

 <class 'pandas.core.frame.DataFrame'>

===========

group_obj.count()

      c1   c2   g2

g1

M   467  487  505

N   470  474  495

 <class 'pandas.core.frame.DataFrame'>

===========

group_obj.describe()

        c1                                ...        c2

    count      mean       std       min  ...       25%       50%       75%       max

g1                                       ...

M   467.0 -0.046080  0.989397 -2.802203  ... -0.735622 -0.065488  0.549966  2.642936

N   470.0  0.006901  0.963094 -2.994613  ... -0.600193 -0.003481  0.656109  2.759355

[2 rows x 16 columns]

 <class 'pandas.core.frame.DataFrame'>

===========

group_obj.describe().T

 g1                 M           N

c1 count  467.000000  470.000000

   mean    -0.046080    0.006901

   std      0.989397    0.963094

   min     -2.802203   -2.994613

   25%     -0.726487   -0.669359

   50%     -0.056133    0.038123

   75%      0.603422    0.669524

   max      2.680571    3.170975

c2 count  487.000000  474.000000

   mean    -0.048338    0.002549

   std      0.982953    0.980751

   min     -2.772593   -3.046143

   25%     -0.735622   -0.600193

   50%     -0.065488   -0.003481

   75%      0.549966    0.656109

   max      2.642936    2.759355

 <class 'pandas.core.frame.DataFrame'>

===========

group_obj.agg([np.mean, np.std])

           c1                  c2

        mean       std      mean       std

g1

M  -0.046080  0.989397 -0.048338  0.982953

N   0.006901  0.963094  0.002549  0.980751

 <class 'pandas.core.frame.DataFrame'>

===========

group_obj.agg({'c1':np.mean, 'c2':np.std})

           c1        c2

g1

M  -0.046080  0.982953

N   0.006901  0.980751

 <class 'pandas.core.frame.DataFrame'>

===========

group_obj.get_group('M').head()

                   c1        c2 g1 g2

2019-09-24  1.867558 -0.977278  M  X

2019-09-26 -0.103219  0.410599  M  Y

2019-09-29  0.443863  0.333674  M  Y

2019-09-30  1.494079 -0.205158  M  X

2019-10-05  0.045759 -0.187184  M  X

# 例2的执行结果(按照两列联合分组)：

df.head()

                   c1        c2 g1 g2

2019-09-22       NaN  0.400157  N  X

2019-09-23  0.978738  2.240893  N  Y

2019-09-24  1.867558 -0.977278  M  X

2019-09-25  0.950088 -0.151357  N  X

2019-09-26 -0.103219  0.410599  M  Y

===========

group_obj

 <pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000000003F5B518>

===========

type(group_obj)

 <class 'pandas.core.groupby.generic.DataFrameGroupBy'>

===========

group_obj.size()

 g1  g2

M   X     260

    Y     245

N   X     256

    Y     239

dtype: int64

 <class 'pandas.core.series.Series'>

===========

group_obj.max()

              c1        c2

g1 g2

M  X   2.320800  2.642936

   Y   2.680571  2.380745

N  X   3.170975  2.412454

   Y   2.497200  2.759355

 <class 'pandas.core.frame.DataFrame'>

===========

group_obj.min()

              c1        c2

g1 g2

M  X  -2.802203 -2.534554

   Y  -2.437564 -2.772593

N  X  -2.582797 -2.739677

   Y  -2.994613 -3.046143

 <class 'pandas.core.frame.DataFrame'>

===========

group_obj.sum()

               c1         c2

g1 g2

M  X  -18.444141 -21.132778

   Y   -3.075378  -2.407982

N  X   20.090225  -1.593149

   Y  -16.846726   2.801164

 <class 'pandas.core.frame.DataFrame'>

===========

group_obj.mean()

              c1        c2

g1 g2

M  X  -0.077496 -0.083860

   Y  -0.013430 -0.010247

N  X   0.084413 -0.006611

   Y  -0.072615  0.012022

 <class 'pandas.core.frame.DataFrame'>

===========

group_obj.std()

              c1        c2

g1 g2

M  X   1.002794  0.976511

   Y   0.976399  0.990480

N  X   0.985893  0.900264

   Y   0.934578  1.059461

 <class 'pandas.core.frame.DataFrame'>

===========

group_obj.count()

         c1   c2

g1 g2

M  X   238  252

   Y   229  235

N  X   238  241

   Y   232  233

 <class 'pandas.core.frame.DataFrame'>

===========

group_obj.describe()

           c1                      ...        c2

       count      mean       std  ...       50%       75%       max

g1 g2                             ...

M  X   238.0 -0.077496  1.002794  ... -0.122370  0.553665  2.642936

   Y   229.0 -0.013430  0.976399  ...  0.024612  0.548398  2.380745

N  X   238.0  0.084413  0.985893  ...  0.001248  0.547481  2.412454

   Y   232.0 -0.072615  0.934578  ... -0.008210  0.823504  2.759355

[4 rows x 16 columns]

 <class 'pandas.core.frame.DataFrame'>

===========

group_obj.describe().T

 g1                 M                       N

g2                 X           Y           X           Y

c1 count  238.000000  229.000000  238.000000  232.000000

   mean    -0.077496   -0.013430    0.084413   -0.072615

   std      1.002794    0.976399    0.985893    0.934578

   min     -2.802203   -2.437564   -2.582797   -2.994613

   25%     -0.799725   -0.652409   -0.531745   -0.708426

   50%     -0.031588   -0.061743    0.039490    0.018710

   75%      0.601873    0.604137    0.721432    0.591636

   max      2.320800    2.680571    3.170975    2.497200

c2 count  252.000000  235.000000  241.000000  233.000000

   mean    -0.083860   -0.010247   -0.006611    0.012022

   std      0.976511    0.990480    0.900264    1.059461

   min     -2.534554   -2.772593   -2.739677   -3.046143

   25%     -0.698973   -0.740036   -0.575788   -0.680178

   50%     -0.122370    0.024612    0.001248   -0.008210

   75%      0.553665    0.548398    0.547481    0.823504

   max      2.642936    2.380745    2.412454    2.759355

 <class 'pandas.core.frame.DataFrame'>

===========

group_obj.agg([np.mean, np.std])

              c1                  c2

           mean       std      mean       std

g1 g2

M  X  -0.077496  1.002794 -0.083860  0.976511

   Y  -0.013430  0.976399 -0.010247  0.990480

N  X   0.084413  0.985893 -0.006611  0.900264

   Y  -0.072615  0.934578  0.012022  1.059461

 <class 'pandas.core.frame.DataFrame'>

===========

group_obj.agg({'c1':np.mean, 'c2':np.std})

              c1        c2

g1 g2

M  X  -0.077496  0.976511

   Y  -0.013430  0.990480

N  X   0.084413  0.900264

   Y  -0.072615  1.059461

 <class 'pandas.core.frame.DataFrame'>

===========

group_obj.get_group(('M','X')).head()

                   c1        c2 g1 g2

2019-09-24  1.867558 -0.977278  M  X

2019-09-30  1.494079 -0.205158  M  X

2019-10-05  0.045759 -0.187184  M  X

2019-10-12 -1.048553 -1.420018  M  X

2019-10-13       NaN  1.950775  M  X

# 例3：group_obj.apply()的原理及应用

import numpy as np

import pandas as pd

# 创建数据集

np.random.seed(0)

period = pd.date_range('2019-9-22', periods=1000, freq='D')

df = pd.DataFrame(np.random.randn(1000, 2), columns=['c1','c2'], index = period)

df['g1'] = np.random.choice(['M', 'N'], 1000)

df['g2'] = np.random.choice(['X', 'Y'], 1000)

for i in period:    # 随机产生空值

    if np.random.random() < 0.05: df.loc[i,'c1'] = np.nan

    if np.random.random() < 0.05: df.loc[i,'c2'] = np.nan

print('df.head()\n',df.head()); print('===========')

# 仅演示按照两列联合分组的情况

# group_obj = df.groupby('g1')			# 按照一列进行分组

group_obj = df.groupby(['g1','g2'])		# 按照两列联合分组

def group_func(df):

    print('---group_func内部开始---')

    print(df)

    print(type(df))

    print('---group_func内部结束---')

    # 将df按照'c1'列升序排列，取前两行结果作为返回值

    return df.sort_values(['c1'], ascending=True)[:2]

result = group_obj.apply(group_func)

print('===========')

print(result); print('-----------')

print(type(result)); print('-----------')

print(result.index)

执行结果：

df.head()

                   c1        c2 g1 g2

2019-09-22       NaN  0.400157  N  X

2019-09-23  0.978738  2.240893  N  Y

2019-09-24  1.867558 -0.977278  M  X

2019-09-25  0.950088 -0.151357  N  X

2019-09-26 -0.103219  0.410599  M  Y

===========

---group_func内部开始---

                  c1        c2 g1 g2

2019-09-24  1.867558 -0.977278  M  X

...              ...       ... .. ..

2022-06-17 -1.141901 -1.310970  M  X

[260 rows x 4 columns]

<class 'pandas.core.frame.DataFrame'>

---group_func内部结束---

---group_func内部开始---

                  c1        c2 g1 g2

2019-09-26 -0.103219  0.410599  M  Y

...              ...       ... .. ..

2022-06-15  0.197828  0.097751  M  Y

[245 rows x 4 columns]

<class 'pandas.core.frame.DataFrame'>

---group_func内部结束---

---group_func内部开始---

                  c1        c2 g1 g2

2019-09-22       NaN  0.400157  N  X

...              ...       ... .. ..

2022-06-14  1.315138 -0.323457  N  X

[256 rows x 4 columns]

<class 'pandas.core.frame.DataFrame'>

---group_func内部结束---

---group_func内部开始---

                  c1        c2 g1 g2

2019-09-23  0.978738  2.240893  N  Y

...              ...       ... .. ..

2022-06-16  1.401523       NaN  N  Y

[239 rows x 4 columns]

<class 'pandas.core.frame.DataFrame'>

---group_func内部结束---

===========

                        c1        c2 g1 g2

g1 g2

M  X  2021-08-31 -2.802203 -1.188424  M  X

      2020-03-07 -2.659172  0.606320  M  X

   Y  2022-03-03 -2.437564  1.114925  M  Y

      2020-05-13 -2.288620  0.251484  M  Y

N  X  2020-11-20 -2.582797 -1.153950  N  X

      2020-06-12 -2.369587  0.864052  N  X

   Y  2021-09-14 -2.994613  0.880938  N  Y

      2021-06-11 -2.777359  1.151734  N  Y

-----------

<class 'pandas.core.frame.DataFrame'>

-----------

MultiIndex([('M', 'X', '2021-08-31'),

            ('M', 'X', '2020-03-07'),

            ('M', 'Y', '2022-03-03'),

            ('M', 'Y', '2020-05-13'),

            ('N', 'X', '2020-11-20'),

            ('N', 'X', '2020-06-12'),

            ('N', 'Y', '2021-09-14'),

            ('N', 'Y', '2021-06-11')],

           names=['g1', 'g2', None])

12. DataFrame的合并

DataFrame的合并可以基于DataFrame定义、concat()、join()、merge()来实现，其中merge()的功能最为强大，使用它也可以实现其他方法的功能

（1）基于DataFrame定义的合并

最原始的合并方式，需要手动逐列定义字典的键（列标签）和值（列数据），行标签则只能采取外连接方式（取并集）

import numpy as np

import pandas as pd

# 定义原始数据

df1 = pd.DataFrame([[1,2],[3,4]], columns=['c2','c1'], index=['t2','t1'])

df2 = pd.DataFrame([[5,6],[7,8]], columns=['c3','c1'], index=['t3','t1'])

# 开始合并

df = pd.DataFrame({'C2': df1['c2'], 'C3': df2['c3']})

print('df1\n',df1); print('===========')

print('df2\n',df2); print('===========')

print('df\n',df)

执行结果：

df1

     c2  c1

t2   1   2

t1   3   4

===========

df2

     c3  c1

t3   5   6

t1   7   8

===========

df

      C2   C3

t1  3.0  7.0

t2  1.0  NaN

t3  NaN  5.0

（2）df.append()

语法：df.append(obj, sort=???, ignore_index=False)

整体与pd.concat()实现的效果类似，纵向拼接，行标签不合并且保留原始顺序，列标签会合并

参数：

obj：拼接对象，可以是一个DataFrame或Series。当是Series时，须满足下面两个条件之一（即要么拼接的这行自己有名字作为其行标签，要么忽略所有行标签），否则报错：
- 该Series有name属性
- df.append()中的ignore_index参数值为True
sort：布尔值，拼接后的DataFrame是否按列标签排序。注意：当前版本Pandas默认值为None（根据不同的情况默认True或False），未来版本会取消默认值，因此为保险起见，不管sort为True还是False都要写上，不写会弹出警告（warning）
ignore_index：拼接后的DataFrame是否忽略所有行标签（即重置为0、1、2……的行位置索引），默认值为False

import numpy as np

import pandas as pd

import warnings; warnings.simplefilter('ignore') # 忽略可能会出现的警告信息；警告并不是错误，可以忽略；可能出现警告的场景包括：df.ix[]、pd.concat()

# 定义原始数据

df1 = pd.DataFrame([[1,2],[3,4]], columns=['c2','c1'], index=['t2','t1'])

df2 = pd.DataFrame([[5,6],[7,8]], columns=['c3','c1'], index=['t3','t1'])

s1 = pd.Series([9,10,11],index=['c2','c4','c3'], name='s1')

s2 = pd.Series([12,13,14],index=['c2','c4','c3'])

# 开始拼接

df3 = df1.append(df2, sort=False)

df4 = df1.append(df2, sort=True)

df5 = df1.append(df2, sort=False, ignore_index=True)

df6 = df1.append(s1, sort=False)

df7 = df1.append(s2, sort=False, ignore_index=True)

print('df1\n',df1); print('===========')

print('df2\n',df2); print('===========')

print('df3\n',df3); print('===========')

print('df4\n',df4); print('===========')

print('df5\n',df5); print('===========')

print('df6\n',df6); print('===========')

print('df7\n',df7)

执行结果：

df1

     c2  c1

t2   1   2

t1   3   4

===========

df2

     c3  c1

t3   5   6

t1   7   8

===========

df3

      c2  c1   c3

t2  1.0   2  NaN

t1  3.0   4  NaN

t3  NaN   6  5.0

t1  NaN   8  7.0

===========

df4

     c1   c2   c3

t2   2  1.0  NaN

t1   4  3.0  NaN

t3   6  NaN  5.0

t1   8  NaN  7.0

===========

df5

     c2  c1   c3

0  1.0   2  NaN

1  3.0   4  NaN

2  NaN   6  5.0

3  NaN   8  7.0

===========

df6

      c2   c1    c3    c4

t2  1.0  2.0   NaN   NaN

t1  3.0  4.0   NaN   NaN

s1  9.0  NaN  11.0  10.0

===========

df7

      c2   c1    c3    c4

0   1.0  2.0   NaN   NaN

1   3.0  4.0   NaN   NaN

2  12.0  NaN  14.0  13.0

（3）df.join()

语法：df1.join(df2, how='left', lsuffix='', rsuffix='')

只能实现横向拼接，列标签不允许重名也不能合并（重名时须指定后缀），行标签可以选择左、右、内、外四种连接方式（默认左连接）

参数：

how：行标签显示方式
- 默认值为'left'，左连接，显示左侧df的所有行标签
- how='right'时，右连接，显示右侧df的所有行标签
- how='inner'时，内连接，显示df1和df2行标签的交集
- how='outer'时，外连接，显示df1和df2行标签的并集
lsuffix：有重名列时，左侧df该列的标签添加的后缀，默认为空字符串''
rsuffix：有重名列时，右侧df该列的标签添加的后缀，默认为空字符串''

注意：

当有重名列时，lsuffix和rsuffix至少应有一个不为空，否则报错

import numpy as np

import pandas as pd

# 定义原始数据

df1 = pd.DataFrame([[1,2],[3,4]], columns=['c2','c1'], index=['t2','t1'])

df2 = pd.DataFrame([[5,6],[7,8]], columns=['c3','c1'], index=['t3','t1'])

# 开始合并

df3 = df1.join(df2, lsuffix='_l', rsuffix='_r')

df4 = df1.join(df2, how='left', lsuffix='_l', rsuffix='_r')

df5 = df1.join(df2, how='right', lsuffix='_l', rsuffix='_r')

df6 = df1.join(df2, how='inner', lsuffix='_l', rsuffix='_r')

df7 = df1.join(df2, how='outer', lsuffix='_l', rsuffix='_r')

print('df1\n',df1); print('===========')

print('df2\n',df2); print('===========')

print('df3\n',df3); print('===========')

print('df4\n',df4); print('===========')

print('df5\n',df5); print('===========')

print('df6\n',df6); print('===========')

print('df7\n',df7)

执行结果：

df1

     c2  c1

t2   1   2

t1   3   4

===========

df2

     c3  c1

t3   5   6

t1   7   8

===========

df3

     c2  c1_l   c3  c1_r

t2   1     2  NaN   NaN

t1   3     4  7.0   8.0

===========

df4

     c2  c1_l   c3  c1_r

t2   1     2  NaN   NaN

t1   3     4  7.0   8.0

===========

df5

      c2  c1_l  c3  c1_r

t3  NaN   NaN   5     6

t1  3.0   4.0   7     8

===========

df6

     c2  c1_l  c3  c1_r

t1   3     4   7     8

===========

df7

      c2  c1_l   c3  c1_r

t1  3.0   4.0  7.0   8.0

t2  1.0   2.0  NaN   NaN

t3  NaN   NaN  5.0   6.0

（4）pd.concat()

语法：pd.concat(objs, axis=0, join='outer', ignore_index=False, sort=???, keys=None, names=None)

实现纵向拼接（拼接轴为y轴）或横向拼接（拼接轴为x轴），拼接轴上的标签不合并且保留原始顺序，非拼接轴上的标签会合并

参数：

objs：由若干个DataFrame组成的可迭代对象，如(df1,df2)、[df1,df2,df3]等
axis：默认值为0，纵向拼接，行标签不合并且顺序不变，列标签合并；axis=1时，横向拼接，列标签不合并且顺序不变，行标签合并。
join：默认值为'outer'，非拼接轴外连接（取并集）；join='inner'时，非拼接轴内连接（取交集）

注意：join的取值只能是'outer'或'inner'，没有别的
ignore_index：默认值为False，保留拼接轴的标签索引；ignore_index=True时，删除拼接轴的标签索引
sort：当join='outer'时，非拼接轴是否按标签排序；当join='inner'时，sort参数没有用。注意：当前版本Pandas默认值为True，未来版本默认值将改为False，因此为保险起见，只要join='outer'，不管sort为True还是False都要写上，不写会弹出警告（warning）
keys：是一个list，当axis=0时，在y轴最外层添加一个层次化索引，这个list中的每个元素都是该层次化索引的行标签，即该list的长度应该等于objs的长度（因为给objs中的每一个DataFrame都分配了一个最外层层次化索引的行标签）；当axis=1则对x轴执行相似操作。代码示例见“AQF笔记-第2部分-第7章-金融数据源处理实现-三、金融数据的处理-2.同时获取多只股价信息”
names：是一个list，当axis=0时，里面的每个元素都是y轴的每级层次化索引的名字（因为每级层次化索引都是一个Series，相当于批量设置每个Series的name属性）；当axis=1则对x轴执行相似操作。代码示例见“AQF笔记-第2部分-第7章-金融数据源处理实现-三、金融数据的处理-2.同时获取多只股价信息”

import numpy as np

import pandas as pd

import warnings; warnings.simplefilter('ignore') # 忽略可能会出现的警告信息；警告并不是错误，可以忽略；可能出现警告的场景包括：df.ix[]、pd.concat()

# 定义原始数据

df1 = pd.DataFrame([[1,2],[3,4]], columns=['c2','c1'], index=['t2','t1'])

df2 = pd.DataFrame([[5,6],[7,8]], columns=['c3','c1'], index=['t3','t1'])

# 开始拼接

df3 = pd.concat((df1,df2), sort=True)

df4 = pd.concat((df1,df2), sort=False)

df5 = pd.concat((df1,df2), sort=True, join='inner')

df6 = pd.concat((df1,df2), axis=1, sort=True)

df7 = pd.concat((df1,df2), ignore_index=True, sort=True)

df8 = pd.concat((df1,df2), axis=1, ignore_index=True, sort=True)

print('df1\n',df1); print('===========')

print('df2\n',df2); print('===========')

print('df3\n',df3); print('===========')

print('df4\n',df4); print('===========')

print('df5\n',df5); print('===========')

print('df6\n',df6); print('===========')

print('df7\n',df7); print('===========')

print('df8\n',df8)

执行结果：

df1

     c2  c1

t2   1   2

t1   3   4

===========

df2

     c3  c1

t3   5   6

t1   7   8

===========

df3

     c1   c2   c3

t2   2  1.0  NaN

t1   4  3.0  NaN

t3   6  NaN  5.0

t1   8  NaN  7.0

===========

df4

      c2  c1   c3

t2  1.0   2  NaN

t1  3.0   4  NaN

t3  NaN   6  5.0

t1  NaN   8  7.0

===========

df5

     c1

t2   2

t1   4

t3   6

t1   8

===========

df6

      c2   c1   c3   c1

t1  3.0  4.0  7.0  8.0

t2  1.0  2.0  NaN  NaN

t3  NaN  NaN  5.0  6.0

===========

df7

    c1   c2   c3

0   2  1.0  NaN

1   4  3.0  NaN

2   6  NaN  5.0

3   8  NaN  7.0

===========

df8

       0    1    2    3

t1  3.0  4.0  7.0  8.0

t2  1.0  2.0  NaN  NaN

t3  NaN  NaN  5.0  6.0

（5）pd.merge()

只能实现横向拼接，列标签不会出现重名也不能合并（重名时会自动添加后缀），行标签可以选择内、外、左、右、四种连接方式（默认内连接）。根据主键选取方式的不同，语法分为三种情况（分别对应下方示例代码的例1、例2、例3）：

以行标签为主键（行标签相同的合并为一行），保留行标签信息：
```
pd.merge(df1, df2, how='inner', left_index=True, right_index=True, sort=False, suffixes=('_x', '_y'))
```
注意：此方法与join()实现的效果共同；此外，left_index和right_index的默认值都是False，如果想采用此方法，需要手动传关键字参数
以df1和df2的一个同名列为主键（该列值相同的合并为一行），会导致丢失行标签信息：
```
pd.merge(df1, df2, how='inner', on='主键列标签', sort=False, suffixes=('_x', '_y'))
```
注意：不写on参数时，Pandas会以首个df1和df2都有的同名列为主键，可能会自动匹配到错误的列，因此建议写上on这个参数，以便明确地声明使用哪列作为主键

以df1的某一列为主键，以df2的另一列为主键（主键列值相同的合并为一行），会导致丢失行标签信息：

pd.merge(df1, df2, how='inner', left_on='df1的主键列标签', right_on='df2的主键列标签', sort=False, suffixes=('_x','_y'))

公共参数：

how：行标签显示方式
- 默认值为'inner'，内连接，主键列显示df1和df2主键的交集
- how='outer'时，外连接，主键列显示df1和df2主键的并集
- how='left'时，左连接，主键列为左侧df1的主键列
- how='right'时，右连接，主键列为右侧df2的主键列
sort：是否对返回结果的主键列排序，默认值为False
suffixes：非主键列重名时添加的后缀，是一个元组类型数据，元组第一项给df1用，第二项给df2用，默认值为('_x', '_y')。元组的两项值可以相同，但是不能同时为空

# 例1：以行标签为主键（行标签相同的合并为一行）

import numpy as np

import pandas as pd

# 定义原始数据

df1 = pd.DataFrame([[1,2],[3,4]], columns=['c2','c1'], index=['t2','t1'])

df2 = pd.DataFrame([[5,6],[7,8]], columns=['c3','c1'], index=['t3','t1'])

# 开始合并

df3 = pd.merge(df1,df2,left_index=True,right_index=True)

df4 = pd.merge(df1,df2,how='outer',left_index=True,right_index=True)

df5 = pd.merge(df1,df2,how='outer',left_index=True,right_index=True,sort=True,suffixes=('_df1', '_df2'))

df6 = pd.merge(df1,df2,how='left',left_index=True,right_index=True)

df7 = pd.merge(df1,df2,how='right',left_index=True,right_index=True)

print('df1\n',df1); print('===========')

print('df2\n',df2); print('===========')

print('df3\n',df3); print('===========')

print('df4\n',df4); print('===========')

print('df5\n',df5); print('===========')

print('df6\n',df6); print('===========')

print('df7\n',df7)

执行结果：

df1

     c2  c1

t2   1   2

t1   3   4

===========

df2

     c3  c1

t3   5   6

t1   7   8

===========

df3

     c2  c1_x  c3  c1_y

t1   3     4   7     8

===========

df4

      c2  c1_x   c3  c1_y

t1  3.0   4.0  7.0   8.0

t2  1.0   2.0  NaN   NaN

t3  NaN   NaN  5.0   6.0

===========

df5

      c2  c1_df1   c3  c1_df2

t1  3.0     4.0  7.0     8.0

t2  1.0     2.0  NaN     NaN

t3  NaN     NaN  5.0     6.0

===========

df6

     c2  c1_x   c3  c1_y

t2   1     2  NaN   NaN

t1   3     4  7.0   8.0

===========

df7

      c2  c1_x  c3  c1_y

t3  NaN   NaN   5     6

t1  3.0   4.0   7     8

# 例2：以df1和df2的同名列'c1'为主键（'c1'列值相同的合并为一行），会导致丢失行标签信息

import numpy as np

import pandas as pd

# 定义原始数据

df1 = pd.DataFrame([[1,'C'],[2,'B']], columns=['c2','c1'], index=['t2','t1'])

df2 = pd.DataFrame([[3,'C'],[4,'A']], columns=['c2','c1'], index=['t2','t1'])

# 开始合并

df3 = pd.merge(df1,df2)  # 以首个df1和df2都有的列为主键，因此这里自动匹配到的是'c2'列，不是'c1'列

df4 = pd.merge(df1,df2,how='inner',on='c1')

df5 = pd.merge(df1,df2,how='outer',on='c1')

df6 = pd.merge(df1,df2,how='outer',on='c1',sort=True,suffixes=('_df1', '_df2'))

df7 = pd.merge(df1,df2,how='left',on='c1')

df8 = pd.merge(df1,df2,how='right',on='c1')

print('df1\n',df1); print('===========')

print('df2\n',df2); print('===========')

print('df3\n',df3); print('===========')

print('df4\n',df4); print('===========')

print('df5\n',df5); print('===========')

print('df6\n',df6); print('===========')

print('df7\n',df7); print('===========')

print('df8\n',df8)

执行结果：

df1

     c2 c1

t2   1  C

t1   2  B

===========

df2

     c2 c1

t2   3  C

t1   4  A

===========

df3

 Empty DataFrame

Columns: [c2, c1]

Index: []

===========

df4

    c2_x c1  c2_y

0     1  C     3

===========

df5

    c2_x c1  c2_y

0   1.0  C   3.0

1   2.0  B   NaN

2   NaN  A   4.0

===========

df6

    c2_df1 c1  c2_df2

0     NaN  A     4.0

1     2.0  B     NaN

2     1.0  C     3.0

===========

df7

    c2_x c1  c2_y

0     1  C   3.0

1     2  B   NaN

===========

df8

    c2_x c1  c2_y

0   1.0  C     3

1   NaN  A     4

# 例3：以df1的'c1'列为主键，以df2的'c2'列为主键（主键列值相同的合并为一行），会导致丢失行标签信息

import numpy as np

import pandas as pd

# 定义原始数据

df1 = pd.DataFrame([[1,'C'],[2,'B']], columns=['c3','c1'], index=['t2','t1'])

df2 = pd.DataFrame([[3,'C'],[4,'A']], columns=['c3','c2'], index=['t2','t1'])

# 开始合并

df3 = pd.merge(df1,df2,left_on='c1',right_on='c2')

df4 = pd.merge(df1,df2,how='outer',left_on='c1',right_on='c2')

df5 = pd.merge(df1,df2,how='outer',left_on='c1',right_on='c2',sort=True,suffixes=('_df1','_df2'))

df6 = pd.merge(df1,df2,how='left',left_on='c1',right_on='c2')

df7 = pd.merge(df1,df2,how='right',left_on='c1',right_on='c2')

print('df1\n',df1); print('===========')

print('df2\n',df2); print('===========')

print('df3\n',df3); print('===========')

print('df4\n',df4); print('===========')

print('df5\n',df5); print('===========')

print('df6\n',df6); print('===========')

print('df7\n',df7)

执行结果：

df1

     c3 c1

t2   1  C

t1   2  B

===========

df2

     c3 c2

t2   3  C

t1   4  A

===========

df3

    c3_x c1  c3_y c2

0     1  C     3  C

===========

df4

    c3_x   c1  c3_y   c2

0   1.0    C   3.0    C

1   2.0    B   NaN  NaN

2   NaN  NaN   4.0    A

===========

df5

    c3_df1   c1  c3_df2   c2

0     NaN  NaN     4.0    A

1     2.0    B     NaN  NaN

2     1.0    C     3.0    C

===========

df6

    c3_x c1  c3_y   c2

0     1  C   3.0    C

1     2  B   NaN  NaN

===========

df7

    c3_x   c1  c3_y c2

0   1.0    C     3  C

1   NaN  NaN     4  A

13. Series和DataFrame的层次化索引

由于在最新版本的Pandas中已将该Panel数据类型删除，因此可以使用层次化索引间接实现Panel数据类型的效果

（1）Series的层次化索引

① 创建层次化索引的Series

语法和创建普通Series的语法相同，只须把index变为多维结构即可。定义了层次化索引的Series后，s.index的数据类型变成了pandas.core.indexes.multi.MultiIndex

靠前的索引（如下例中的大写字母）是外层索引，其level值为0；靠后的索引（如下例中的小写字母）是内层索引，其level值以整数递增（本例中其level=1）。

import numpy as np

import pandas as pd

s = pd.Series([1,2,3,4,5,6,7,8],

              index=[

                  ['A','A','B','B','C','C','D','D'],

                  ['e','f','e','g','f','h','g','h']

              ])

print('s','\n',s); print('===========')

print('s.index','\n',s.index,'\n',type(s.index))

执行结果：

s

 A  e    1

   f    2

B  e    3

   g    4

C  f    5

   h    6

D  g    7

   h    8

dtype: int64

===========

s.index

 MultiIndex([('A', 'e'),

            ('A', 'f'),

            ('B', 'e'),

            ('B', 'g'),

            ('C', 'f'),

            ('C', 'h'),

            ('D', 'g'),

            ('D', 'h')],

           )

 <class 'pandas.core.indexes.multi.MultiIndex'>

② 层次化索引的Series的索引和切片

import numpy as np

import pandas as pd

s = pd.Series([1,2,3,4,5,6,7,8],

              index=[

                  ['A','A','B','B','C','C','D','D'],

                  ['e','f','e','g','f','h','g','h']

              ])

print('s','\n',s); print('===========')

print("s['A']",'\n',s['A'],'\n',type(s['A'])); print('===========')

print("s['A':'C']",'\n',s['A':'C'],'\n',type(s['A':'C'])); print('===========')

print("s[['A','C']]",'\n',s[['A','C']],'\n',type(s[['A','C']])); print('===========')

print("s[:,'f']",'\n',s[:,'f'],'\n',type(s[:,'f'])); print('===========')

print("s['A','e']",'\n',s['A','e'],'\n',type(s['A','e'])); print('===========')

# 下面几种形式会导致报错：

# print(s['A':'C','f'])

# print(s[['A','C'],'f'])

# print(s[:,'e':'f'])

# print(s[:,['e','f']])

执行结果：

s

 A  e    1

   f    2

B  e    3

   g    4

C  f    5

   h    6

D  g    7

   h    8

dtype: int64

===========

s['A']

 e    1

f    2

dtype: int64

 <class 'pandas.core.series.Series'>

===========

s['A':'C']

 A  e    1

   f    2

B  e    3

   g    4

C  f    5

   h    6

dtype: int64

 <class 'pandas.core.series.Series'>

===========

s[['A','C']]

 A  e    1

   f    2

C  f    5

   h    6

dtype: int64

 <class 'pandas.core.series.Series'>

===========

s[:,'f']

 A    2

C    5

dtype: int64

 <class 'pandas.core.series.Series'>

===========

s['A','e']

 1

 <class 'numpy.int64'>

③ 层次化索引的Series的分组聚合

s.sum(level=0)与s.groupby(level=0).sum()等效

s.sum(level=1)与s.groupby(level=1).sum()等效

import numpy as np

import pandas as pd

s = pd.Series([1,2,3,4,5,6,7,8],

              index=[

                  ['A','A','B','B','C','C','D','D'],

                  ['e','f','e','g','f','h','g','h']

              ])

print('s','\n',s); print('===========')

s1 = s.sum(level=0)

s2 = s.sum(level=1)

s3 = s.groupby(level=0).sum()

s4 = s.groupby(level=1).sum()

print('s1','\n',s1,'\n',type(s1)); print('===========')

print('s2','\n',s2,'\n',type(s2)); print('===========')

print('s3','\n',s3,'\n',type(s3)); print('===========')

print('s4','\n',s4,'\n',type(s4))

执行结果：

s

 A  e    1

   f    2

B  e    3

   g    4

C  f    5

   h    6

D  g    7

   h    8

dtype: int64

===========

s1

 A     3

B     7

C    11

D    15

dtype: int64

 <class 'pandas.core.series.Series'>

===========

s2

 e     4

f     7

g    11

h    14

dtype: int64

 <class 'pandas.core.series.Series'>

===========

s3

 A     3

B     7

C    11

D    15

dtype: int64

 <class 'pandas.core.series.Series'>

===========

s4

 e     4

f     7

g    11

h    14

dtype: int64

 <class 'pandas.core.series.Series'>

（2）DataFrame的层次化索引

① 创建层次化索引的DataFrame

语法和创建普通DataFrame的语法相同，只须把index变为多维结构即可。定义了层次化索引的DataFrame后，df.index的数据类型变成了pandas.core.indexes.multi.MultiIndex

关于层次化索引的df.index.name和df.index.names的区别，见本章“二、Pandas模块 - 5. DataFrame对象的属性 - （3）df.index.name和df.index.names”

import numpy as np

import pandas as pd

# 创建方式一

df = pd.DataFrame([1,2,3,4,5,6,7,8])

df.columns = ['c1']

df.index = [['A','A','B','B','C','C','D','D'],

            ['e','f','e','g','f','h','g','h']]

df.index.name='my_index_name'

df.index.names = ['i1','i2']

# 创建方式二（两种方式等效）

"""

df = pd.DataFrame([['A','e',1],

                  ['A','f',2],

                  ['B','e',3],

                  ['B','g',4],

                  ['C','f',5],

                  ['C','h',6],

                  ['D','g',7],

                  ['D','h',8]])

df.columns=['i1','i2','c1']

df = df.set_index(['i1','i2'])

df.index.name='my_index_name'

"""

print('df\n',df,'\n',type(df)); print('===========')

print('df.index\n',df.index,'\n',type(df.index)); print('-----------')

print(df.index.name); print('-----------')

print(df.index.names); print('===========')

print('df.columns\n',df.columns,'\n',type(df.columns))

执行结果：

df

        c1

i1 i2

A  e    1

   f    2

B  e    3

   g    4

C  f    5

   h    6

D  g    7

   h    8

 <class 'pandas.core.frame.DataFrame'>

===========

df.index

 MultiIndex([('A', 'e'),

            ('A', 'f'),

            ('B', 'e'),

            ('B', 'g'),

            ('C', 'f'),

            ('C', 'h'),

            ('D', 'g'),

            ('D', 'h')],

           name='my_index_name')

 <class 'pandas.core.indexes.multi.MultiIndex'>

-----------

my_index_name

-----------

['i1', 'i2']

===========

df.columns

 Index(['c1'], dtype='object')

 <class 'pandas.core.indexes.base.Index'>

② 层次化索引的DataFrame的索引和切片

import numpy as np

import pandas as pd

df = pd.DataFrame([1,2,3,4,5,6,7,8],columns=['c1'],

              index=[

                  ['A','A','B','B','C','C','D','D'],

                  ['e','f','e','g','f','h','g','h']

              ])

print('df\n',df,'\n',type(df)); print('===========')

print("df.loc['A']",'\n',df.loc['A'],'\n',type(df.loc['A'])); print('===========')

print("df.loc['A':'C']",'\n',df.loc['A':'C'],'\n',type(df.loc['A':'C'])); print('===========')

print("df.loc[['A','C']]",'\n',df.loc[['A','C']],'\n',type(df.loc[['A','C']])); print('===========')

print("df.loc[('A','f')]",'\n',df.loc[('A','f')],'\n',type(df.loc[('A','f')])); print('===========')

print("df.loc[('A','f'),'c1']",'\n',df.loc[('A','f'),'c1'],'\n',type(df.loc[('A','f'),'c1'])); print('===========')

# 下面的写法将导致错误：

# print("df.loc[:,'f']",'\n',df.loc[:,'f'],'\n',type(df.loc[:,'f'])); print('===========')

# print("df.loc[(:,'f')]",'\n',df.loc[(:,'f')],'\n',type(df.loc[(:,'f')])); print('===========')

# print("df.loc[('A','f':'h')]",'\n',df.loc[('A','f':'h')],'\n',type(df.loc[('A','f':'h')])); print('===========')

# print("df.loc[(['A','C'],'f')]",'\n',df.loc[(['A','C'],'f')],'\n',type(df.loc[(['A','C'],'f')])); print('===========')

# print("df.loc[('A',['f','h'])]",'\n',df.loc[('A',['f','h'])],'\n',type(df.loc[('A',['f','h'])])); print('===========')

执行结果：

df

      c1

A e   1

  f   2

B e   3

  g   4

C f   5

  h   6

D g   7

  h   8

 <class 'pandas.core.frame.DataFrame'>

===========

df.loc['A']

    c1

e   1

f   2

 <class 'pandas.core.frame.DataFrame'>

===========

df.loc['A':'C']

      c1

A e   1

  f   2

B e   3

  g   4

C f   5

  h   6

 <class 'pandas.core.frame.DataFrame'>

===========

df.loc[['A','C']]

      c1

A e   1

  f   2

C f   5

  h   6

 <class 'pandas.core.frame.DataFrame'>

===========

df.loc[('A','f')]

 c1    2

Name: (A, f), dtype: int64

 <class 'pandas.core.series.Series'>

===========

df.loc[('A','f'),'c1']

 2

 <class 'numpy.int64'>

③ 层次化索引的DataFrame的分组聚合

df.sum(level=0)与df.groupby(level=0).sum()等效

df.sum(level=1)与df.groupby(level=1).sum()等效

import numpy as np

import pandas as pd

df = pd.DataFrame([1,2,3,4,5,6,7,8],columns=['c1'],

              index=[

                  ['A','A','B','B','C','C','D','D'],

                  ['e','f','e','g','f','h','g','h']

              ])

print('df\n',df,'\n',type(df)); print('===========')

df1 = df.sum(level=0)

df2 = df.groupby(level=0).sum()

df3 = df.sum(level=1)

df4 = df.groupby(level=1).sum()

print('df1','\n',df1,'\n',type(df1)); print('===========')

print('df2','\n',df2,'\n',type(df2)); print('===========')

print('df3','\n',df3,'\n',type(df3)); print('===========')

print('df4','\n',df4,'\n',type(df4))

执行结果：

df

      c1

A e   1

  f   2

B e   3

  g   4

C f   5

  h   6

D g   7

  h   8

 <class 'pandas.core.frame.DataFrame'>

===========

df1

    c1

A   3

B   7

C  11

D  15

 <class 'pandas.core.frame.DataFrame'>

===========

df2

    c1

A   3

B   7

C  11

D  15

 <class 'pandas.core.frame.DataFrame'>

===========

df3

    c1

e   4

f   7

g  11

h  14

 <class 'pandas.core.frame.DataFrame'>

===========

df4

    c1

e   4

f   7

g  11

h  14

 <class 'pandas.core.frame.DataFrame'>

④ 重置层次化索引

df.reset_index()：重置所有层次化索引

df.reset_index(level=0)：重置level=0的层次化索引

df.reset_index(level=1)：重置level=1的层次化索引

也可以重新设定df.index.levels，详见本章5.DataFrame对象的属性 - （4）df.index.levels

（3）使用unstack()和stack()对层次化索引的Series和DataFrame进行变形（行标签与列标签的转换）

stack: v.堆栈，unstack：v.拆栈

s.unstack()的效果为：将s最内层行标签（纵向）进行转置变为列标签（横向），若转置后数据的行数变为1，则为Series类型数据，否则为DataFrame类型数据

s.stack()：报错，Series数据类型没有stack()方法，因为Series数据类型没有可供转置用的列标签

df.unstack()的效果为：将df最内层行标签（纵向）进行转置变为列标签（横向），若转置后数据的行数变为1，则为Series类型数据，否则为DataFrame类型数据

df.stack()的效果为：将df列标签（横向）进行转置变为最内层行标签（纵向），若转置后数据的列数变为1，则为Series类型数据，否则为DataFrame类型数据

# Series的数据变形的例子

import numpy as np

import pandas as pd

s = pd.Series([1,2,3,4,5,6,7,8],

              index=[

                  ['A','A','B','B','C','C','D','D'],

                  ['e','f','e','g','f','h','g','h']

              ])

# s_s = s.stack()   				# 导致报错

s_u = s.unstack()

s_us = s.unstack().stack()			# 又变回了s

s_uu = s.unstack().unstack()		# 实现了内、外层索引的互转

print('s\n',s,'\n',type(s)); print('===========')

# print('s_s\n',s_s,'\n',type(s_s)); print('===========')	# 导致报错

print('s_u\n',s_u,'\n',type(s_u)); print('===========')

print('s_us\n',s_us,'\n',type(s_us)); print('===========')

print('s_uu\n',s_uu,'\n',type(s_uu))

执行结果：

s

 A  e    1

   f    2

B  e    3

   g    4

C  f    5

   h    6

D  g    7

   h    8

dtype: int64

 <class 'pandas.core.series.Series'>

===========

s_u

      e    f    g    h

A  1.0  2.0  NaN  NaN

B  3.0  NaN  4.0  NaN

C  NaN  5.0  NaN  6.0

D  NaN  NaN  7.0  8.0

 <class 'pandas.core.frame.DataFrame'>

===========

s_us

 A  e    1.0

   f    2.0

B  e    3.0

   g    4.0

C  f    5.0

   h    6.0

D  g    7.0

   h    8.0

dtype: float64

 <class 'pandas.core.series.Series'>

===========

s_uu

 e  A    1.0

   B    3.0

   C    NaN

   D    NaN

f  A    2.0

   B    NaN

   C    5.0

   D    NaN

g  A    NaN

   B    4.0

   C    NaN

   D    7.0

h  A    NaN

   B    NaN

   C    6.0

   D    8.0

dtype: float64

 <class 'pandas.core.series.Series'>

# DataFrame的数据变形的例子

import numpy as np

import pandas as pd

df = pd.DataFrame([1,2,3,4,5,6,7,8],columns=['t1'],

              index=[

                  ['A','A','B','B','C','C','D','D'],

                  ['e','f','e','g','f','h','g','h']

              ])

df_s = df.stack()

df_u = df.unstack()

df_us = df.unstack().stack()		# 又变回了df

df_uu = df.unstack().unstack()

print('df\n',df,'\n',type(df)); print('===========')

print('df_s\n',df_s,'\n',type(df_s)); print('===========')

print('df_u\n',df_u,'\n',type(df_u)); print('===========')

print('df_us\n',df_us,'\n',type(df_us)); print('===========')

print('df_uu\n',df_uu,'\n',type(df_uu))

执行结果：

df

      t1

A e   1

  f   2

B e   3

  g   4

C f   5

  h   6

D g   7

  h   8

 <class 'pandas.core.frame.DataFrame'>

===========

df_s

 A  e  t1    1

   f  t1    2

B  e  t1    3

   g  t1    4

C  f  t1    5

   h  t1    6

D  g  t1    7

   h  t1    8

dtype: int64

 <class 'pandas.core.series.Series'>

===========

df_u

     t1

     e    f    g    h

A  1.0  2.0  NaN  NaN

B  3.0  NaN  4.0  NaN

C  NaN  5.0  NaN  6.0

D  NaN  NaN  7.0  8.0

 <class 'pandas.core.frame.DataFrame'>

===========

df_us

       t1

A e  1.0

  f  2.0

B e  3.0

  g  4.0

C f  5.0

  h  6.0

D g  7.0

  h  8.0

 <class 'pandas.core.frame.DataFrame'>

===========

df_uu

 t1  e  A    1.0

       B    3.0

       C    NaN

       D    NaN

    f  A    2.0

       B    NaN

       C    5.0

       D    NaN

    g  A    NaN

       B    4.0

       C    NaN

       D    7.0

    h  A    NaN

       B    NaN

       C    6.0

       D    8.0

dtype: float64

 <class 'pandas.core.series.Series'>

14. Pandas中的时间相关格式及方法

（1）Pandas中的时间格式及特殊索引、切片方法

① pandas._libs.tslibs.timestamps.Timestamp：时间戳

② pandas.core.indexes.datetimes.DatetimeIndex：时间格式索引

③ pandas._libs.tslibs.period.Period：时期

④ pandas.core.indexes.period.PeriodIndex：时期格式索引

其中，②是由①组成的，④是由③组成的

当一个DataFrame拥有②或者④格式的行标签索引时，它将支持下面各种灵活的索引、切片方式（注意：对频率为月的④类型行标签索引，索引该月的任意一天即视为索引该月；对频率为年的④类型行标签索引，索引该年的任意一天即视为索引该年）：

# 精确索引，只能使用df.loc[]和df.ix[]方式

# df['2019-12-31']	# 精确索引不能使用df[]格式，会导致报错

# df['2019.12.31']	# 精确索引不能使用df[]格式，会导致报错

df.loc['20191231']

df.loc['2019-12-31']

df.loc[pd.datetime(2019,12,31)]

df.ix['20191231']

df.ix['2019-12-31']

df.ix[pd.datetime(2019,12,31)]

...

# 模糊索引，可以使用df[]、df.loc[]和df.ix[]方式

df['2019-12']

df['2019.12']

df['2019']

df.loc['2019-12']

df.loc['2019.12']

df.loc['2019']

df.ix['2019-12']

df.ix['2019.12']

df.ix['2019']

...

# 混合使用精确索引和模糊索引进行切片

df['2019-08':'2019-09-22']

df.loc['2019-08':'2019-09-22']

df.ix['2019-08':'2019-09-22']

...

（2）pd.Timestamp()

语法：pd.Timestamp(n)

返回一个pandas._libs.tslibs.timestamps.Timestamp对象

参数n：经过时间原点的纳秒（10的负9次方秒）数

import numpy as np

import pandas as pd

t1 = pd.Timestamp(0)

t2 = pd.Timestamp(1)

print(t1)

print(t2)

print(type(t2))

执行结果：

1970-01-01 00:00:00

1970-01-01 00:00:00.000000001

<class 'pandas._libs.tslibs.timestamps.Timestamp'>

（3）pd.datetime()

语法：pd.datetime(年,月,日)

返回一个datetime.datetime对象

import numpy as np

import pandas as pd

t = pd.datetime(2019,9,22)

print(t)

print(type(t))

执行结果：

2019-09-22 00:00:00

<class 'datetime.datetime'>

（4）pd.to_datetime()

语法：pd.to_datetime(“看着像日期”的数据类型或其组成的list、ndarray、Series)

将“看着像日期”的数据类型（或其组成的list、ndarray、Series）转换为pandas._libs.tslibs.timestamps.Timestamp类型数据（或pandas.core.indexes.datetimes.DatetimeIndex类型数据）

独立的None会转换为None，但是列表中的None则会转换为NaT（pandas._libs.tslibs.nattype.NaTType类的实例化对象）

可使用df['列标签'] = pd.to_datetime(df['列标签']) 的形式将df中的某列从str类型转换为时间类型

可使用df.index = pd.to_datetime(df.index) 的形式将df中的行索引从str类型转换为时间类型

import numpy as np

import pandas as pd

# print(pd.to_datetime('2019922'))  导致报错

print(pd.to_datetime('20190922'),type(pd.to_datetime('20190922'))); print('-----------')

print(pd.to_datetime(['2019-09-22','2019.09.23']))

print(pd.to_datetime(['2019-9-22','2019.9.23']))

print(pd.to_datetime(['Sept 22 2019','September 23rd, 2019'])); print('===========')

print(pd.to_datetime(None),type(pd.to_datetime(None))); print('-----------')

print(pd.to_datetime([None])); print(type(pd.to_datetime([None])));

print(pd.to_datetime([None])[0]); print(type(pd.to_datetime([None])[0])); print('===========')

print(pd.to_datetime(

    np.array(['20190922','20190923'])

)); print('-----------')

print(pd.to_datetime(

    pd.Series(['20190922','20190923'])

))

执行结果：

2019-09-22 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>

-----------

DatetimeIndex(['2019-09-22', '2019-09-23'], dtype='datetime64[ns]', freq=None)

DatetimeIndex(['2019-09-22', '2019-09-23'], dtype='datetime64[ns]', freq=None)

DatetimeIndex(['2019-09-22', '2019-09-23'], dtype='datetime64[ns]', freq=None)

===========

None <class 'NoneType'>

-----------

DatetimeIndex(['NaT'], dtype='datetime64[ns]', freq=None)

<class 'pandas.core.indexes.datetimes.DatetimeIndex'>

NaT

<class 'pandas._libs.tslibs.nattype.NaTType'>

===========

DatetimeIndex(['2019-09-22', '2019-09-23'], dtype='datetime64[ns]', freq=None)

-----------

0   2019-09-22

1   2019-09-23

dtype: datetime64[ns]

（5）pd.DatetimeIndex()

输入一个由“看着像日期”的数据类型组成的一维list，将其中的每一项元素转为pandas._libs.tslibs.timestamps.Timestamp类型后，整体以pandas.core.indexes.datetimes.DatetimeIndex类型返回

列表中的None会转换为NaT（pandas._libs.tslibs.nattype.NaTType类的实例化对象）

可使用df['列标签'] = pd.DatetimeIndex(df['列标签']) 的形式将df中的某列从str类型转换为时间类型

import datetime

import numpy as np

import pandas as pd

dti1 = pd.DatetimeIndex(['20190101','20190102',None])

dti2 = pd.DatetimeIndex(['2019-01-01','2019-01-02'])

dti3 = pd.DatetimeIndex(['Jan 1,2019','January 2nd, 2019'])

dti4 = pd.DatetimeIndex([datetime.datetime(2019,1,1),datetime.datetime(2019,1,2)])

dti5 = pd.DatetimeIndex([pd.datetime(2019,1,1),pd.datetime(2019,1,2)])

dti6 = pd.DatetimeIndex([pd.Timestamp(0),pd.Timestamp(1e18)])

print(dti1,'\n',type(dti1[0]))

print(dti2,'\n',type(dti2[0]))

print(dti3,'\n',type(dti3[0]))

print(dti4,'\n',type(dti4[0]))

print(dti5,'\n',type(dti5[0]))

print(dti6,'\n',type(dti6[0]))

执行结果：

DatetimeIndex(['2019-01-01', '2019-01-02', 'NaT'], dtype='datetime64[ns]', freq=None)

 <class 'pandas._libs.tslibs.timestamps.Timestamp'>

DatetimeIndex(['2019-01-01', '2019-01-02'], dtype='datetime64[ns]', freq=None)

 <class 'pandas._libs.tslibs.timestamps.Timestamp'>

DatetimeIndex(['2019-01-01', '2019-01-02'], dtype='datetime64[ns]', freq=None)

 <class 'pandas._libs.tslibs.timestamps.Timestamp'>

DatetimeIndex(['2019-01-01', '2019-01-02'], dtype='datetime64[ns]', freq=None)

 <class 'pandas._libs.tslibs.timestamps.Timestamp'>

DatetimeIndex(['2019-01-01', '2019-01-02'], dtype='datetime64[ns]', freq=None)

 <class 'pandas._libs.tslibs.timestamps.Timestamp'>

DatetimeIndex(['1970-01-01 00:00:00', '2001-09-09 01:46:40'], dtype='datetime64[ns]', freq=None)

 <class 'pandas._libs.tslibs.timestamps.Timestamp'>

（6）pd.date_range()

语法：pd.date_range(start=None, end=None, periods=None, freq='D')

生成由若干个pandas._libs.tslibs.timestamps.Timestamp对象组成的pandas.core.indexes.datetimes.DatetimeIndex对象

参数：

start：起始日期
end：终止日期
periods：长度（数据个数）

freq：频率（相邻数据的间隔时间），默认值为1天'D'。可以改成诸如30秒'30S'、5分钟'5T'、2小时'2H'、3天'3D'、2周'2W'、每月最后一天'M'、每月第一天'MS'、1年'Y'等形式。此外，频率以'B'为单位时代表工作日，但是这个工作日仅仅代表周一到周五，不考虑法定节假日。此项参数的其他复杂取值：

名称	说明
W-MON	周-星期一
WOM-1MON	月-第1个星期一
Q-JAN	季度，以一月最后一日结束（可把JAN换成FEB, MAR）
QS-JAN	季度，以一月第一日结束（可把JAN换成FEB, MAR）
A-JAN	年，以一月最后一个日历日结束（可把JAN换成FEB,...,DEC）
AS-JAN	年，以一月第一个日历日结束（可把JAN换成FEB,...,DEC）

注意：

参数freq默认值为'D'，start、end、periods三个参数，至少要输入两个，否则报错
pd.date_range()常用于给df.index赋值，以便生成行标签，如：
```
...

df.index = pd.date_range('2019-9-22', periods=5, freq='M')
```
pd.date_range()生成的DatetimeIndex对象可以用索引方式来取值，如：
```
...

t = pd.date_range('2019-9-22', periods=5, freq='M')

print(t[0])
```
不可以直接用字符串来判断其是否等于返回结果中的某一个日期，可以用pd.datetime()来判断。不过，对于DataFrame中的pd.date_range()类型的标签索引，既可以使用pd.datetime()进行标签索引，也可以使用字符串进行标签索引

代码示例：

import numpy as np

import pandas as pd

import warnings; warnings.simplefilter('ignore') # 忽略可能会出现的警告信息；警告并不是错误，可以忽略；可能出现警告的场景包括：df.ix[]、pd.concat()

t1 = pd.date_range('2019-9-22', periods=2, freq='3D')

t2 = pd.date_range('2019-9-22', periods=2, freq='2W')

t3 = pd.date_range('2019-9-22', periods=3, freq='M')

t4 = pd.date_range('2019-9-22', periods=3, freq='Y')

print('t1','\n',t1,'\n',type(t1),'\n',t1[0],'\n',type(t1[0])); print('-----------')

print('t2','\n',t2); print('-----------')

print('t3','\n',t3); print('-----------')

print('t4','\n',t4); print('===========')

print(t1[0] == '2019-9-22')                     		        # 错误的判断方式

print(t1[0] == '2019-09-22')                       		        # 错误的判断方式

print(t1[0] == pd.datetime(2019,9,22))							# 正确的判断方式

执行结果：

t1

 DatetimeIndex(['2019-09-22', '2019-09-25'], dtype='datetime64[ns]', freq='3D')

 <class 'pandas.core.indexes.datetimes.DatetimeIndex'>

 2019-09-22 00:00:00

 <class 'pandas._libs.tslibs.timestamps.Timestamp'>

-----------

t2

 DatetimeIndex(['2019-09-22', '2019-10-06'], dtype='datetime64[ns]', freq='2W-SUN')

-----------

t3

 DatetimeIndex(['2019-09-30', '2019-10-31', '2019-11-30'], dtype='datetime64[ns]', freq='M')

-----------

t4

 DatetimeIndex(['2019-12-31', '2020-12-31', '2021-12-31'], dtype='datetime64[ns]', freq='A-DEC')

===========

False

False

True

（7）pd.period_range()

语法：pd.period_range(start=None, end=None, periods=None, freq='D')

生成由若干个pandas._libs.tslibs.period.Period对象组成的pandas.core.indexes.period.PeriodIndex对象

参数：

start：起始日期
end：终止日期
periods：长度（数据个数）
freq：频率（相邻数据的间隔时间），默认值为1天'D'。可以改成诸如30秒'30S'、5分钟'5T'、2小时'2H'、3天'3D'、每周一'W-Mon'、2周'2W'、1个月'M'、1年'Y'等形式。此外，频率以'B'为单位时代表工作日，但是这个工作日仅仅代表周一到周五，不考虑法定节假日。当频率以月为单位时，产生的数据中仅有年、月；当频率以年为单位时，产生的数据中仅有年。

注意：

参数freq默认值为'D'时，start、end、periods三个参数，至少要输入两个，否则报错
pd.period_range()常用于给df.index赋值，以便生成行标签，如：
```
...

df.index = pd.period_range('2019-9-22', periods=5, freq='W')
```
pd.period_range()生成的PeriodIndex对象可以用索引方式来取值，如：
```
...

p = pd.period_range('2019-9-22', periods=5, freq='M')

print(p[0])
```
不可以直接用字符串来判断其是否等于返回结果中的某一个日期，也不能用pd.datetime()来判断。不过，对于DataFrame中的pandas.core.indexes.period.PeriodIndex类型的标签索引，既可以使用pd.datetime()进行标签索引，也可以使用字符串进行标签索引（对频率为月的pandas.core.indexes.period.PeriodIndex类型行标签索引，索引该月的任意一天即视为索引该月；对频率为年的pandas.core.indexes.period.PeriodIndex类型行标签索引，索引该年的任意一天即视为索引该年）

代码示例：

# 不可以直接用字符串来判断其是否等于返回结果中的某一个日期，也不能用pd.datetime()来判断

import numpy as np

import pandas as pd

p1 = pd.period_range('2019-9-22', periods=2, freq='3D')

p2 = pd.period_range('2019-9-22', periods=2, freq='2W')

p3 = pd.period_range('2019-9-22', periods=3, freq='M')

p4 = pd.period_range('2019-9-22', periods=3, freq='Y')

print('p1','\n',p1,'\n',type(p1),'\n',p1[0],'\n',type(p1[0])); print('-----------')

print('p2','\n',p2); print('-----------')

print('p3','\n',p3); print('-----------')

print('p4','\n',p4); print('===========')

print(p1[0] == '2019-9-22')					# 错误的判断方式

print(p1[0] == '2019-09-22')				# 错误的判断方式

print(p1[0] == pd.datetime(2019,9,22))		# 错误的判断方式

执行结果：

p1

 PeriodIndex(['2019-09-22', '2019-09-25'], dtype='period[3D]', freq='3D')

 <class 'pandas.core.indexes.period.PeriodIndex'>

 2019-09-22

 <class 'pandas._libs.tslibs.period.Period'>

-----------

p2

 PeriodIndex(['2019-09-16/2019-09-22', '2019-09-30/2019-10-06'], dtype='period[2W-SUN]', freq='2W-SUN')

-----------

p3

 PeriodIndex(['2019-09', '2019-10', '2019-11'], dtype='period[M]', freq='M')

-----------

p4

 PeriodIndex(['2019', '2020', '2021'], dtype='period[A-DEC]', freq='A-DEC')

===========

False

False

False

# 频率为日的索引示例

import numpy as np

import pandas as pd

np.random.seed(0)

arr = np.random.randn(5,2)

p = pd.period_range('2019-9-22', periods=5, freq='D')		# 频率为日

df = pd.DataFrame(arr, columns=['c1','c2'], index=p)

print(df); print('===========')

print(df.loc['2019-9-22']); print('===========')

print(df.loc[pd.datetime(2019,9,22)])

执行结果：

                  c1        c2

2019-09-22  1.764052  0.400157

2019-09-23  0.978738  2.240893

2019-09-24  1.867558 -0.977278

2019-09-25  0.950088 -0.151357

2019-09-26 -0.103219  0.410599

===========

c1    1.764052

c2    0.400157

Name: 2019-09-22, dtype: float64

===========

c1    1.764052

c2    0.400157

Name: 2019-09-22, dtype: float64

# 频率为月的索引示例

import numpy as np

import pandas as pd

np.random.seed(0)

arr = np.random.randn(5,2)

p = pd.period_range('2019-9-22', periods=5, freq='M')      # 频率为月

df = pd.DataFrame(arr, columns=['c1','c2'], index=p)

print(df); print('-----------')

print(df.loc['20190922']); print('-----------')

print(df.loc['2019-9-23']); print('-----------')

print(df.loc[pd.datetime(2019,9,22)]); print('-----------')

print(df.loc[pd.datetime(2019,9,23)])

执行结果：

               c1        c2

2019-09  1.764052  0.400157

2019-10  0.978738  2.240893

2019-11  1.867558 -0.977278

2019-12  0.950088 -0.151357

2020-01 -0.103219  0.410599

-----------

c1    1.764052

c2    0.400157

Name: 2019-09, dtype: float64

-----------

c1    1.764052

c2    0.400157

Name: 2019-09, dtype: float64

-----------

c1    1.764052

c2    0.400157

Name: 2019-09, dtype: float64

-----------

c1    1.764052

c2    0.400157

Name: 2019-09, dtype: float64

# 频率为年的索引示例

import numpy as np

import pandas as pd

np.random.seed(0)

arr = np.random.randn(5,2)

p = pd.period_range('2019-9-22', periods=5, freq='Y')      # 频率为年

df = pd.DataFrame(arr, columns=['c1','c2'], index=p)

print(df); print('-----------')

print(df.loc['20190922']); print('-----------')

print(df.loc['2019-9-23']); print('-----------')

print(df.loc[pd.datetime(2019,9,22)]); print('-----------')

print(df.loc[pd.datetime(2019,9,23)])

执行结果：

            c1        c2

2019  1.764052  0.400157

2020  0.978738  2.240893

2021  1.867558 -0.977278

2022  0.950088 -0.151357

2023 -0.103219  0.410599

-----------

c1    1.764052

c2    0.400157

Name: 2019, dtype: float64

-----------

c1    1.764052

c2    0.400157

Name: 2019, dtype: float64

-----------

c1    1.764052

c2    0.400157

Name: 2019, dtype: float64

-----------

c1    1.764052

c2    0.400157

Name: 2019, dtype: float64

（8）pd.date_range()和pd.period_range()的对比

相同点：

返回值可以作为DataFrame的行标签索引，并支持df[]形式的特殊索引
返回值都可以作为DataFrame数据中的一列

不同点：

freq='M'、freq='Y'时显示的数据不同
数据类型不同
由于数据类型不同，导致个别的属性和方法不同（这里不再展开）

（9）df.resample()：重采样

resample：v.重采样

对行索引为pandas.core.indexes.datetimes.DatetimeIndex或pandas.core.indexes.period.PeriodIndex类型的DataFrame进行重采样（频率调整），具体步骤为：

首先，获取DataFrame格式数据df
接着，使用resample_obj = df.resample(rule,axis=0,closed=None)获取resample对象（pandas.core.resample.DatetimeIndexResampler类型）

参数：
- rule：调整后的频率，如：'S'（秒）、'T'或'min'（分钟）、'H'（小时）、'D'（天）、'W'（周）、'M'（月）、'Q'（季度）、'A'或'Y'（年），还可以在字母前加上数字，如：'3D'（3天）
- axis：默认值为0，按列处理；axis=1时按行处理（一般无须指定此参数）
- closed：时间区间的闭合方式，left为前闭，right为后闭（一般无须指定此参数）
最后，应用resemple对象的相应方法进行处理，如：
- 高频调整为低频（降采样）适用的方法：
  - resample_obj.mean()：使用对应时间段内所有数据的平均值进行聚合
  - resample_obj.max()：使用对应时间段内所有数据的最大值进行聚合（最高价聚合常用）
  - resample_obj.min()：使用对应时间段内所有数据的最小值进行聚合（最低价聚合常用）
  - resample_obj.median()：使用对应时间段内所有数据的中位数进行聚合
  - resample_obj.sum()：使用对应时间段内所有数据的和进行聚合（成交量聚合常用）
  - resample_obj.prod()：使用对应时间段内所有数据的乘积进行聚合
  - resample_obj.std()：使用对应时间段内所有数据的标准差进行聚合
  - resample_obj.var()：使用对应时间段内所有数据的方差进行聚合
  - resample_obj.count()：使用对应时间段内所有非空数据的计数进行聚合
  - resample_obj.first()：使用对应时间段内的第一个数据进行聚合（开盘价聚合常用）
  - resample_obj.last()：使用对应时间段内的最后一个数据进行聚合（收盘价聚合常用）
  - resample_obj.nunique()：使用对应时间段内有多少个不同的值来进行聚合
  - resample_obj.asfreq()：使用显示的日期所对应的数据进行聚合（比如将日数据降采样为月数据，显示的是每月最后一天，就使用这天的数据进行聚合，但是每月最后一天可能不是交易日没有数据，此时这条数据就是NaN）
  - resample_obj.ohlc()：使用对应时间段内所有数据的open、high、low、close四项特征数据进行聚合
  - resample_obj.apply(<func>)：使用自定义的聚合函数，apply()方法的详细解释本章“10. DataFrame对象的方法和Pandas模块的方法 -（5）其他方法”，示例代码见本节下面
- 低频调整为高频（升采样）适用的方法：
  - resample_obj.ffill()：使用向前填充法处理空值
  - resample_obj.pad()：使用向前填充法处理空值
  - resample_obj.bfill()：使用向后填充法处理空值
  - resample_obj.fillna()：使用fillna()方法处理空值
  - 线性插值法：
    - resample_obj.interpolate()：使用线性插值法填充两个数据之间的空值，简便，推荐
    - df.interpolate()：也可以不基于resample对象，手动在两条数据间插入指定数量的空值，然后使用DataFrame对象自带的插值法进行填充，详见本章“9. DataFrame的空值（NaN）处理-（5）df.interpolate()”。但是这样操作过于复杂，不推荐
  - resample_obj.apply(<func>)：使用自定义的插值函数，每次传到func()里的是对应时间段的数据组成的Series或DataFrame，详见本章“10. DataFrame对象的方法和Pandas模块的方法 -（5）其他重要方法 - ①df.apply()”

注意：讲义中提到的df.resample('M', how='mean')里的how参数已弃用，以上述新方法为准。

① 高频调整为低频（降采样）：通过聚合实现

注意：聚合后的索引是聚合前的索引在该时间段内的最后一个值（若日数据按照月重采样，则索引变为每个月的最后一天）

# 基于收益率的resample_obj.mean()和resample_obj.apply(<func>)方法示例

import pandas as pd

import numpy as np

# 读取本地文件'000001.csv'

data = pd.read_csv('000001.csv', index_col=0, parse_dates=True)

# 使用收盘价计算每日收益率

data_return = data['close'] / data['close'].shift()

# 获取频率为月的resample对象

resample_obj = data_return.resample('M')

print(resample_obj,'\n',type(resample_obj)); print('===========')

# 使用对应月份所有数据的平均值进行聚合（两种方式等效）

print(resample_obj.mean()); print('-----------')

print(resample_obj.apply(lambda x: x.mean())); print('===========')     # 当只有一列时，x为Series；当有多列时，x为DataFrame

# 检验聚合结果是否正确

print(data_return['2017-03'].mean().round(6)); print('===========')

print(data_return['2019-07'].mean().round(6))

执行结果：

DatetimeIndexResampler [freq=<MonthEnd>, axis=0, closed=right, label=right, convention=start, base=0]

 <class 'pandas.core.resample.DatetimeIndexResampler'>

===========

date

2017-02-28    1.005302

2017-03-31    0.998575

...

2019-07-31    1.001199

2019-08-31    1.000752

Freq: M, Name: close, dtype: float64

-----------

date

2017-02-28    1.005302

2017-03-31    0.998575

...

2019-07-31    1.001199

2019-08-31    1.000752

Freq: M, Name: close, dtype: float64

===========

0.998575

===========

1.001199

# 基于收盘价的resample_obj.ohlc()方法示例

import pandas as pd

import numpy as np

# 读取本地文件'000001.csv'

data = pd.read_csv('000001.csv', index_col=0, parse_dates=True)

# 获取每日收盘价

data_close = data['close']

# 获取频率为1个月的resample对象

resample_obj = data_close.resample('M')

print(resample_obj,'\n',type(resample_obj)); print('===========')

# 使用对应时间段内所有数据的open、high、low、close四项特征数据进行聚合

print(resample_obj.ohlc()); print('===========')

# 检验OHLC结果是否正确

print(data_close['2017-03-01'])

print(data_close['2017-04'].max())

print(data_close['2019-06'].min())

print(data_close['2019-07-31'])

执行结果：

DatetimeIndexResampler [freq=<MonthEnd>, axis=0, closed=right, label=right, convention=start, base=0]

 <class 'pandas.core.resample.DatetimeIndexResampler'>

===========

             open   high    low  close

date

2017-02-28   9.43   9.48   9.43   9.48

2017-03-31   9.49   9.52   9.08   9.17

2017-04-30   9.21   9.21   8.91   8.99

...

2019-06-30  11.90  13.80  11.85  13.78

2019-07-31  13.93  14.37  13.54  14.13

2019-08-31  14.10  15.12  13.35  14.25

===========

9.49

9.21

11.85

14.13

② 低频调整为高频（升采样）：通过线性插值实现

注意：若使用resample_obj.interpolate()方法，则插值前的索引是插值后的索引在该时间段的第一个值（假设日数据按照小时重采样，则每天的实际值变为该日00:00:00的值）

# 基于收盘价的resample_obj.interpolate()方法示例

import pandas as pd

import numpy as np

# 读取本地文件'000001.csv'

data = pd.read_csv('000001.csv', index_col=0, parse_dates=True)

# 获取前两天的收盘价

data_close = data['close'][1:3]

print(data_close); print('===========')

# 获取频率为1小时的resample对象

resample_obj = data_close.resample('H')

print(resample_obj,'\n',type(resample_obj)); print('===========')

# 使用线性插值法填补中间空缺的数据

print(resample_obj.interpolate())

执行结果：

date

2017-02-28    9.48

2017-03-01    9.49

Name: close, dtype: float64

===========

DatetimeIndexResampler [freq=<Hour>, axis=0, closed=left, label=left, convention=start, base=0]

 <class 'pandas.core.resample.DatetimeIndexResampler'>

===========

date

2017-02-28 00:00:00    9.480000

2017-02-28 01:00:00    9.480417

2017-02-28 02:00:00    9.480833

2017-02-28 03:00:00    9.481250

...

2017-02-28 21:00:00    9.488750

2017-02-28 22:00:00    9.489167

2017-02-28 23:00:00    9.489583

2017-03-01 00:00:00    9.490000

Freq: H, Name: close, dtype: float64

（10）df.rolling()：滑动时间窗

对DataFrame或Series类型数据进行滑动时间窗处理，具体步骤为：

首先，获取DataFrame格式数据df（或Series类型数据s）
接着，使用rolling_obj = df.rolling(window, min_periods=None, center=False)获取rolling对象（pandas.core.window.Rolling类型），参数：
- window：时间窗大小，即时间窗中包含几个数据，必须输入（int类型）
- min_periods：在边界处使时间窗计算结果不为NaN的最小数据量，默认为None，此时时间窗中必须有window个数据才会计算结果，否则显示NaN；指定min_periods后，时间窗中只要有min_periods个数据就可以计算结果。
- center：时间窗标签是否居中，默认为False，此时时间窗标签为时间窗内最后一个时间点（即时间窗内最后一行的行标签）；center=True时，时间窗标签为时间窗内中间位置的时间点（即时间窗内中间一行的行标签）
最后，应用rolling对象的相应方法进行处理，如：
- rolling_obj.mean()：使用时间窗内所有数据的平均值作为时间窗标签对应的值（移动平均SMA）
- rolling_obj.max()：使用时间窗内所有数据的最大值作为时间窗标签对应的值
- rolling_obj.min()：使用时间窗内所有数据的最小值作为时间窗标签对应的值
- rolling_obj.sum()：使用时间窗内所有数据的和作为时间窗标签对应的值
- rolling_obj.std()：使用时间窗内所有数据的标准差作为时间窗标签对应的值
- rolling_obj.apply(<func>)：编写自定义函数func，将时间窗内所有数据作为参数传递给func，并将func的返回值作为时间窗标签对应的值

# 滑动时间窗示例

import pandas as pd

import numpy as np

# 读取本地文件'000001.csv'

data = pd.read_csv('000001.csv', index_col=0, parse_dates=True)

# 获取每日收盘价

data_close = data['close']

# 获取不同参数的rolling对象

rolling_obj3 = data_close.rolling(3)

rolling_obj32 = data_close.rolling(3, min_periods=2)

rolling_obj31 = data_close.rolling(3, min_periods=1)

rolling_obj3c = data_close.rolling(3, center=True)

# 使用rolling对象进行滑动时间窗处理

print('data_close\n',data_close); print('===========')

print(rolling_obj3,type(rolling_obj3)); print('===========')

print('rolling_obj3.mean()\n',rolling_obj3.mean()); print('===========')

print('rolling_obj32.mean()\n',rolling_obj32.mean()); print('===========')

print('rolling_obj31.mean()\n',rolling_obj31.mean()); print('===========')

print('rolling_obj3c.mean()\n',rolling_obj3c.mean())

执行结果：

data_close

 date

2017-02-27     9.43

2017-02-28     9.48

2017-03-01     9.49

2017-03-02     9.43

2017-03-03     9.40

              ...

2019-08-20    14.99

2019-08-21    14.45

2019-08-22    14.31

2019-08-23    14.65

2019-08-26    14.25

Name: close, Length: 613, dtype: float64

===========

Rolling [window=3,center=False,axis=0] <class 'pandas.core.window.Rolling'>

===========

rolling_obj3.mean()

 date

2017-02-27          NaN

2017-02-28          NaN

2017-03-01     9.466667

2017-03-02     9.466667

2017-03-03     9.440000

                ...

2019-08-20    14.936667

2019-08-21    14.786667

2019-08-22    14.583333

2019-08-23    14.470000

2019-08-26    14.403333

Name: close, Length: 613, dtype: float64

===========

rolling_obj32.mean()

 date

2017-02-27          NaN

2017-02-28     9.455000

2017-03-01     9.466667

2017-03-02     9.466667

2017-03-03     9.440000

                ...

2019-08-20    14.936667

2019-08-21    14.786667

2019-08-22    14.583333

2019-08-23    14.470000

2019-08-26    14.403333

Name: close, Length: 613, dtype: float64

===========

rolling_obj31.mean()

 date

2017-02-27     9.430000

2017-02-28     9.455000

2017-03-01     9.466667

2017-03-02     9.466667

2017-03-03     9.440000

                ...

2019-08-20    14.936667

2019-08-21    14.786667

2019-08-22    14.583333

2019-08-23    14.470000

2019-08-26    14.403333

Name: close, Length: 613, dtype: float64

===========

rolling_obj3c.mean()

 date

2017-02-27          NaN

2017-02-28     9.466667

2017-03-01     9.466667

2017-03-02     9.440000

2017-03-03     9.426667

                ...

2019-08-20    14.786667

2019-08-21    14.583333

2019-08-22    14.470000

2019-08-23    14.403333

2019-08-26          NaN

Name: close, Length: 613, dtype: float64