「Python 机器学习」Pandas 数据分析

时间:2021-02-15 01:21:07

Pandas 是一个开源的 Python 库,专为数据处理和分析任务而设计;它提供了高性能、易用的数据结构和数据分析工具,使得在 Python 中进行数据科学变得简单高效;Pandas 基于 NumPy,因此可以与许多其他基于 NumPy 的库(如 SciPy 和 scikit-learn)无缝集成;

Pandas 中的两个主要数据结构

  • Series,一个一维带标签的数组,可存储整数、浮点数、字符串等不同类型的数据;Series 具有索引(index),这使得它类似于 Python 字典,但具有更多的特性和功能;

  • DataFrame,一个二维带标签的数据结构,类似于表格或电子表格;它由一系列具有相同索引的列组成,每列可以具有不同的数据类型;DataFrame 提供了各种功能,如筛选、排序、分组、合并和聚合,以便在大型数据集上进行高效操作;

支持的数据处理和分析任务

  • 数据导入和导出
  • 数据清洗和预处理
  • 数据过滤和选择
  • 数据排序、排名和聚合
  • 缺失值处理
  • 分组操作
  • 数据透视表
  • 时间序列分析
  • 合并和连接多个数据集

Pandas 提供了丰富的功能,使得它成为 Python 数据科学生态系统中最受欢迎和广泛使用的库之一;

import pandas as pd
import numpy as np

1. Series

1. 构造与初始化

  1. Series 是一个一维数据结构,Pandas 会默认用 0 到 n 来作为 Series 的 index;
>>> s = pd.Series([1, 3, 'Beijing', 3.14, -123, 'Year!'])
>>> s
0          1
1          3
2    Beijing
3       3.14
4       -123
5      Year!
dtype: object
  1. 自行指定 index;
>>> s = pd.Series([1, 3, 'Beijing', 3.14, -123, 'Year!'], index=['A', 'B', 'C', 'D', 'E','G'])
>>> s
A          1
B          3
C    Beijing
D       3.14
E       -123
G      Year!
dtype: object
  1. 直接使用 dictionary 构造 Series, 因为 Series 本身就是 keyvalue pairs;
>>> cities = {'Beijing': 55000, 'Shanghai': 60000, 'Shenzhen': 50000,
>>>           'Hangzhou': 20000, 'Guangzhou': 25000, 'Suzhou': None}
>>> apts = pd.Series(cities)
>>> apts
Beijing      55000.0
Shanghai     60000.0
Shenzhen     50000.0
Hangzhou     20000.0
Guangzhou    25000.0
Suzhou           NaN
dtype: float64

2. 选择数据

  1. 通过 index 选择数据
>>> apts['Hangzhou']
20000.0

>>> apts[['Hangzhou', 'Beijing', 'Shenzhen']]
Hangzhou    20000.0
Beijing     55000.0
Shenzhen    50000.0
dtype: float64

>>> # boolean indexing
>>> apts[apts < 50000]
Hangzhou     20000.0
Guangzhou    25000.0
dtype: float64

>>> # boolean indexing 的工作方式
>>> less_than_50000 = apts < 50000
>>> less_than_50000
Beijing      False
Shanghai     False
Shenzhen     False
Hangzhou      True
Guangzhou     True
Suzhou       False
dtype: bool

>>> apts[less_than_50000]
Hangzhou     20000.0
Guangzhou    25000.0
dtype: float64

3. 元素赋值

>>> print("Old value: ", apts['Shenzhen'])
Old value:  50000.0

>>> apts['Shenzhen'] = 55000
>>> print("New value: ", apts['Shenzhen'])
New value:  55000.0

>>> print(apts[apts < 50000])
Hangzhou     20000.0
Guangzhou    25000.0
dtype: float64

>>> apts[apts <= 50000] = 40000
>>> print(apts[apts < 50000])
angzhou     40000.0
Guangzhou    40000.0
dtype: float64

4. 数学运算

>>> apts / 2
Beijing      27500.0
Shanghai     30000.0
Shenzhen     27500.0
Hangzhou     20000.0
Guangzhou    20000.0
Suzhou           NaN
dtype: float64

>>> np.square(apts)
Beijing      3.025000e+09
Shanghai     3.600000e+09
Shenzhen     3.025000e+09
Hangzhou     1.600000e+09
Guangzhou    1.600000e+09
Suzhou                NaN
dtype: float64

>>> cars = pd.Series({'Beijing': 300000, 'Shanghai': 400000, 'Shenzhen': 300000,
>>>                   'Tianjin': 200000, 'Guangzhou': 200000, 'Chongqing': 150000})
>>> cars
Beijing      300000
Shanghai     400000
Shenzhen     300000
Tianjin      200000
Guangzhou    200000
Chongqing    150000
dtype: int64

>>> # 按 index 运算
>>> cars + apts * 100
Beijing      5800000.0
Chongqing          NaN
Guangzhou    4200000.0
Hangzhou           NaN
Shanghai     6400000.0
Shenzhen     5800000.0
Suzhou             NaN
Tianjin            NaN
dtype: float64

5. 数据缺失

>>> print('Hangzhou' in apts)
True

>>> print('Hangzhou' in cars)
False

>>> apts.notnull()
Beijing       True
Shanghai      True
Shenzhen      True
Hangzhou      True
Guangzhou     True
Suzhou       False
dtype: bool

>>> print(apts.isnull())
Beijing      False
Shanghai     False
Shenzhen     False
Hangzhou     False
Guangzhou    False
Suzhou        True
dtype: bool

>>> print(apts[apts.isnull()])
Suzhou   NaN
dtype: float64

2. DataFrame

一个 DataFrame 就是一张表格,Series 表示的是一维数组,DataFrame 则是一个二维数组,可以类比成一张 excel 的表格;也可以把 DataFrame 当做一个一组 Series 的集合;

1. 构造和选择数据

  1. 由一个 dictionary 构造;
>>> data = {'city': ['Beijing', 'Shanghai', 'Guangzhou', 'Shenzhen', 'Hangzhou', 'Chongqing'],
>>>         'year': [2016, 2017, 2016, 2017, 2016, 2016],
>>>         'population': [2100, 2300, 1000, 700, 500, 500]}
>>> d1 = pd.DataFrame(data)
>>> print(d1)
        city  year  population
0    Beijing  2016        2100
1   Shanghai  2017        2300
2  Guangzhou  2016        1000
3   Shenzhen  2017         700
4   Hangzhou  2016         500
5  Chongqing  2016         500
  1. 全列遍历
>>> for row in d1.values:
>>>     print(row)
['Beijing' 2016 2100]
['Shanghai' 2017 2300]
['Guangzhou' 2016 1000]
['Shenzhen' 2017 700]
['Hangzhou' 2016 500]
['Chongqing' 2016 500]
  1. 选择列并合并遍历
>>> for row in zip(d1['city'], d1['year'], d1['population']):
>>>     print(row)
('Beijing', 2016, 2100)
('Shanghai', 2017, 2300)
('Guangzhou', 2016, 1000)
('Shenzhen', 2017, 700)
('Hangzhou', 2016, 500)
('Chongqing', 2016, 500)

>>> print(d1.columns)
Index(['city', 'year', 'population'], dtype='object')
  1. 列序重组
>>> print(pd.DataFrame(data, columns=['year', 'city', 'population']))
   year       city  population
0  2016    Beijing        2100
1  2017   Shanghai        2300
2  2016  Guangzhou        1000
3  2017   Shenzhen         700
4  2016   Hangzhou         500
5  2016  Chongqing         500
  1. 行与列索引
>>> frame2 = pd.DataFrame(data,
>>>                       columns=['year', 'city', 'population', 'debt'],
>>>                       index=['one', 'two', 'three', 'four', 'five', 'six'])
>>> print(frame2)
       year       city  population debt
one    2016    Beijing        2100  NaN
two    2017   Shanghai        2300  NaN
three  2016  Guangzhou        1000  NaN
four   2017   Shenzhen         700  NaN
five   2016   Hangzhou         500  NaN
six    2016  Chongqing         500  NaN

>>> print(frame2['city'])
one        Beijing
two       Shanghai
three    Guangzhou
four      Shenzhen
five      Hangzhou
six      Chongqing
Name: city, dtype: object

>>> print(frame2.year)
one      2016
two      2017
three    2016
four     2017
five     2016
six      2016
Name: year, dtype: int64

>>> # loc 取 label based indexing or iloc 取 positional indexing
>>> print(frame2.loc['three'])
year               2016
city          Guangzhou
population         1000
debt                NaN
Name: three, dtype: object

>>> print(frame2.iloc[2].copy())
year               2016
city          Guangzhou
population         1000
debt                NaN
Name: three, dtype: object

2. 元素赋值

  1. 整列赋值(单值);
>>> frame2['debt'] = 100
>>> print(frame2)
       year       city  population  debt
one    2016    Beijing        2100   100
two    2017   Shanghai        2300   100
three  2016  Guangzhou        1000   100
four   2017   Shenzhen         700   100
five   2016   Hangzhou         500   100
six    2016  Chongqing         500   100
  1. 整列赋值(列表值);
>>> frame2.debt = np.arange(6)
>>> print(frame2)
       year       city  population  debt
one    2016    Beijing        2100     0
two    2017   Shanghai        2300     1
three  2016  Guangzhou        1000     2
four   2017   Shenzhen         700     3
five   2016   Hangzhou         500     4
six    2016  Chongqing         500     5
  1. 用 Series 来指定需要修改的 index 以及相对应的 value,没有指定的默认用 NaN;
>>> val = pd.Series([100, 200, 300], index=['two', 'three', 'five'])
>>> frame2['debt'] = val
>>> print(frame2)
       year       city  population   debt
one    2016    Beijing        2100    NaN
two    2017   Shanghai        2300  100.0
three  2016  Guangzhou        1000  200.0
four   2017   Shenzhen         700    NaN
five   2016   Hangzhou         500  300.0
six    2016  Chongqing         500    NaN
  1. 用存在的列赋值(创建新列);
>>> frame2['western'] = (frame2.city == 'Chongqing')
>>> print(frame2)
       year       city  population   debt  western
one    2016    Beijing        2100    NaN    False
two    2017   Shanghai        2300  100.0    False
three  2016  Guangzhou        1000  200.0    False
four   2017   Shenzhen         700    NaN    False
five   2016   Hangzhou         500  300.0    False
six    2016  Chongqing         500    NaN     True
  1. DataFrame 的转置;
>>> pop = {'Beijing': {2016: 2100, 2017: 2200},
>>>        'Shanghai': {2015: 2400, 2016: 2500, 2017: 2600}}
>>> frame3 = pd.DataFrame(pop)
>>> print(frame3)
      Beijing  Shanghai
2016   2100.0      2500
2017   2200.0      2600
2015      NaN      2400

>>> print(frame3.T)
            2016    2017    2015
Beijing   2100.0  2200.0     NaN
Shanghai  2500.0  2600.0  2400.0
  1. 行序重组;
>>> pd.DataFrame(pop, index=[2015, 2016, 2017])
	Beijing	Shanghai
2015	NaN	2400
2016	2100.0	2500
2017	2200.0	2600
  1. 使用切片初始化数据;
>>> pdata = {'Beijing': frame3['Beijing'][:-1],
>>>          'Shanghai': frame3['Shanghai'][:-1]}
>>> pd.DataFrame(pdata)
	Beijing	Shanghai
2016	2100.0	2500
2017	2200.0	2600
  1. 指定 index 的名字和列的名字;
>>> frame3.index.name = 'year'
>>> frame3.columns.name = 'city'
>>> print(frame3)
city  Beijing  Shanghai
year
2016   2100.0      2500
2017   2200.0      2600
2015      NaN      2400

>>> print(frame2.values)
[[2016 'Beijing' 2100 nan False]
 [2017 'Shanghai' 2300 100.0 False]
 [2016 'Guangzhou' 1000 200.0 False]
 [2017 'Shenzhen' 700 nan False]
 [2016 'Hangzhou' 500 300.0 False]
 [2016 'Chongqing' 500 nan True]]

>>> print(frame2)
      year       city  population   debt  western
one    2016    Beijing        2100    NaN    False
two    2017   Shanghai        2300  100.0    False
three  2016  Guangzhou        1000  200.0    False
four   2017   Shenzhen         700    NaN    False
five   2016   Hangzhou         500  300.0    False
six    2016  Chongqing         500    NaN     True

>>> print(type(frame2.values))
<class 'numpy.ndarray'>

3. Index

1. Index 对象

>>> obj = pd.Series(range(3), index=['a', 'b', 'c'])
>>> index = obj.index
>>> index
Index(['a', 'b', 'c'], dtype='object')

>>> index[1:]
Index(['b', 'c'], dtype='object')

>>> # index 不能被动态改动
>>> # index[1]='d'

>>> index = pd.Index(np.arange(3))
>>> obj2 = pd.Series([2, 5, 7], index=index)
>>> print(obj2)
0    2
1    5
2    7
dtype: int64

>>> print(obj2.index is index)
True

>>> 2 in obj2.index
True

>>> pop = {'Beijing': {2016: 2100, 2017: 2200},
>>>        'Shanghai': {2015: 2400, 2016: 2500, 2017: 2600}}
>>> frame3 = pd.DataFrame(pop)
>>> print('Shanghai' in frame3.columns)
True

>>> print('2015' in frame3.index)
False

>>> print(2015 in frame3.index)
True

2. Index 索引和切片

>>> obj = pd.Series(np.arange(4), index=['a', 'b', 'c', 'd'])
>>> print(obj)
a    0
b    1
c    2
d    3
dtype: int64

>>> print(obj['b'])
1
  1. 使用默认的数字 index;
>>> print(obj[3])
3

>>> print(obj[[1, 3]])
b    1
d    3
dtype: int64

>>> print(obj[['b', 'd']])
b    1
d    3
dtype: int64
  1. 条件筛选;
>>> print(obj[obj < 2])
a    0
b    1
dtype: int32
  1. 切片筛选与赋值
>>> print(obj['b':'c'])
b    1
c    2
dtype: int64

>>> obj['b':'c'] = 5
>>> print(obj)
a    0
b    5
c    5
d    3
dtype: int64
  1. DataFrame 的 Indexing
>>> a = np.arange(9).reshape(3, 3)
>>> print(a)
[[0 1 2]
 [3 4 5]
 [6 7 8]]

>>> frame = pd.DataFrame(a,
>>>                      index=['a', 'c', 'd'],
>>>                      columns=['Hangzhou', 'Shenzhen', 'Nanjing'])
>>> frame
   Hangzhou  Shenzhen  Nanjing
a         0         1        2
c         3         4        5
d         6         7        8

>>> frame['Hangzhou']
a    0
c    3
d    6
Name: Hangzhou, dtype: int64

>>> frame[['Shenzhen', 'Nanjing']]
   Shenzhen  Nanjing
a         1        2
c         4        5
d         7        8

>>> frame[:2]
   Hangzhou  Shenzhen  Nanjing
a         0         1        2
c         3         4        5

>>> frame.loc['a']
Hangzhou    0
Shenzhen    1
Nanjing     2
Name: a, dtype: int64

>>> frame.loc[['a', 'd'], ['Shenzhen', 'Nanjing']]
   Shenzhen  Nanjing
a         1        2
d         7        8

>>> frame.loc[:'c', 'Hangzhou']
a    0
c    3
Name: Hangzhou, dtype: int64
  1. DataFrame 的 condition selection
>>> frame[frame.Hangzhou > 1]
   Hangzhou  Shenzhen  Nanjing
c         3         4        5
d         6         7        8

>>> frame < 5
   Hangzhou  Shenzhen  Nanjing
a      True      True     True
c      True      True    False
d     False     False    False

>>> frame[frame < 5] = 0
>>> frame
   Hangzhou  Shenzhen  Nanjing
a         0         0        0
c         0         0        5
d         6         7        8

3. reindex

  1. 按照新的 index 顺序进行重排;
>>> obj = pd.Series([4.5, 7.2, -5.3, 3.2], index=['d', 'b', 'a', 'c'])
>>> print(obj)
d    4.5
b    7.2
a   -5.3
c    3.2
dtype: float64

>>> obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
>>> obj2
a   -5.3
b    7.2
c    3.2
d    4.5
e    NaN
dtype: float64
  1. 在新的 index 上填充指定值;
>>> obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)
a   -5.3
b    7.2
c    3.2
d    4.5
e    0.0
dtype: float64
  1. 在新的 index 上填充前面最近的值;
>>> obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
>>> obj3
0      blue
2    purple
4    yellow
dtype: object

>>> # forward
>>> obj3.reindex(range(6), method='ffill')
0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object
  1. 在新的 index 上填充后面最近的值;
>>> # backward
>>> obj3.reindex(range(6), method='bfill')
0      blue
1    purple
2    purple
3    yellow
4    yellow
5       NaN
dtype: object
  1. 对 DataFrame 进行 reindex;
>>> frame = pd.DataFrame(np.arange(9).reshape(3, 3),
                     index=['a', 'c', 'd'],
                     columns=['Hangzhou', 'Shenzhen', 'Nanjing'])
>>> frame
   Hangzhou  Shenzhen  Nanjing
a         0         1        2
c         3         4        5
d         6         7        8

>>> frame2 = frame.reindex(['a', 'b', 'c', 'd'])
>>> frame2
   Hangzhou  Shenzhen  Nanjing
a       0.0       1.0      2.0
b       NaN       NaN      NaN
c       3.0       4.0      5.0
d       6.0       7.0      8.0
  1. 重新指定 columns;
>>> frame.reindex(columns=['Shenzhen', 'Hangzhou', 'Chongqing'])
   Shenzhen  Hangzhou  Chongqing
a         1         0        NaN
c         4         3        NaN
d         7         6        NaN

>>> frame3 = frame.reindex(index=['a', 'b', 'c', 'd'], method='ffill').reindex(
>>>     columns=['Shenzhen', 'Hangzhou', 'Chongqing'])
>>> print(frame3)
   Shenzhen  Hangzhou  Chongqing
a         1         0        NaN
b         1         0        NaN
c         4         3        NaN
d         7         6        NaN

>>> print(frame3.loc[['a', 'b', 'd', 'c'],
>>>       ['Shenzhen', 'Hangzhou', 'Chongqing']])
   Shenzhen  Hangzhou  Chongqing
a         1         0        NaN
b         1         0        NaN
d         7         6        NaN
c         4         3        NaN

4. drop

  1. 删除 Series 和 DataFrame 中的 index;
>>> print(obj3)
0      blue
2    purple
4    yellow
dtype: object

>>> obj4 = obj3.drop(2)
>>> print(obj4)
0      blue
4    yellow
dtype: object

>>> print(obj3.drop([2, 4]))
0    blue
dtype: object
>>> print(frame)
   Hangzhou  Shenzhen  Nanjing
a         0         1        2
c         3         4        5
d         6         7        8

>>> print(frame.drop(['a', 'c']))
   Hangzhou  Shenzhen  Nanjing
d         6         7        8
  1. 删除 DataFrame 中的 columns;
>>> print(frame.drop('Shenzhen', axis=1))
   Hangzhou  Nanjing
a         0        2
c         3        5
d         6        8

>>> print(frame.drop(['Shenzhen', 'Hangzhou'], axis=1))
   Nanjing
a        2
c        5
d        8

5. Hierarchical Indexing

  1. Series 的 hierarchical indexing;
>>> data = pd.Series(np.random.randn(10),
>>>                  index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'c', 'd', 'd'],
>>>                         [1, 2, 3, 1, 2, 1, 2, 3, 1, 2]])
>>> data
a  1   -0.587772
   2    0.597073
   3   -2.354382
b  1    1.403719
   2   -0.612704
c  1   -1.409393
   2    2.098933
   3    0.076322
d  1    0.295683
   2    1.188039

>>> data.index
MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('b', 1),
            ('b', 2),
            ('c', 1),
            ('c', 2),
            ('c', 3),
            ('d', 1),
            ('d', 2)],
           )

>>> data.b
1    1.403719
2   -0.612704
dtype: float64

>>> data['b':'c']
b  1    1.403719
   2   -0.612704
c  1   -1.409393
   2    2.098933
   3    0.076322
dtype: float64

>>> data[:2]
a  1   -0.587772
   2    0.597073
dtype: float64
  1. unstack 和 stack;
>>> # 将 hierarchical indexing 的 Series 转换为 DataFrame
>>> data.unstack()
          1         2         3
a -0.587772  0.597073 -2.354382
b  1.403719 -0.612704       NaN
c -1.409393  2.098933  0.076322
d  0.295683  1.188039       NaN

>>> type(data.unstack())
pandas.core.frame.DataFrame

>>> # 将 DataFrame 转换为 hierarchical indexing 的 Series
>>> data.unstack().stack()
a  1   -0.587772
   2    0.597073
   3   -2.354382
b  1    1.403719
   2   -0.612704
c  1   -1.409393
   2    2.098933
   3    0.076322
d  1    0.295683
   2    1.188039
dtype: float64
  1. DataFrame 的 hierarchical indexing;
>>> frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
>>>                      index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
>>>                      columns=[['Beijing', 'Beijing', 'Shanghai'],
>>>                               ['apts', 'cars', 'apts']])
>>> frame
    Beijing      Shanghai
       apts cars     apts
a 1       0    1        2
  2       3    4        5
b 1       6    7        8
  2       9   10       11

>>> frame.index.names = ['key1', 'key2']
>>> frame.columns.names = ['city', 'type']
>>> frame
city      Beijing      Shanghai
type         apts cars     apts
key1 key2
a    1          0    1        2
     2          3    4        5
b    1          6    7        8
     2          9   10       11

>>> frame.loc['a', 1]
city      type
Beijing   apts    0
          cars    1
Shanghai  apts    2
Name: (a, 1), dtype: int64

>>> frame.loc['a', 2]['Beijing']
type
apts    3
cars    4
Name: (a, 2), dtype: int64

>>> frame.loc['a', 2]['Beijing']['apts'] # 等价:frame.loc['a', 2]['Beijing', 'apts']
3

4. Concatenate 与 Append

>>> df1 = pd.DataFrame({'apts': [55000, 60000],
>>>                    'cars': [200000, 300000], },
>>>                    index=['Shanghai', 'Beijing'])
>>> print(df1)
           apts    cars
Shanghai  55000  200000
Beijing   60000  300000

>>> df2 = pd.DataFrame({'apts': [25000, 20000],
>>>                    'cars': [150000, 120000], },
>>>                    index=['Hangzhou', 'Najing'])
>>> print(df2)
           apts    cars
Hangzhou  25000  150000
Najing    20000  120000

>>> df3 = pd.DataFrame({'apts': [30000, 10000],
>>>                    'cars': [180000, 100000], },
>>>                    index=['Guangzhou', 'Chongqing'])
>>> print(df3)
            apts    cars
Guangzhou  30000  180000
Chongqing  10000  100000

1. 纵向 concat

>>> frames = [df1, df2, df3]
>>> print(frames)
[           apts    cars
Shanghai  55000  200000
Beijing   60000  300000,            apts    cars
Hangzhou  25000  150000
Najing    20000  120000,             apts    cars
Guangzhou  30000  180000
Chongqing  10000  100000]

>>> result = pd.concat(frames)
>>> print(result)
            apts    cars
Shanghai   55000  200000
Beijing    60000  300000
Hangzhou   25000  150000
Najing     20000  120000
Guangzhou  30000  180000
Chongqing  10000  100000
>>> # 给 concat 的每一个部分加上一个 Key
>>> result2 = pd.concat(frames, keys=['x', 'y', 'z'])
>>> print(result2)
              apts    cars
x Shanghai   55000  200000
  Beijing    60000  300000
y Hangzhou   25000  150000
  Najing     20000  120000
z Guangzhou  30000  180000
  Chongqing  10000  100000

>>> result2.loc['y']
           apts    cars
Hangzhou  25000  150000
Najing    20000  120000

2. 横向 concat

>>> df4 = pd.DataFrame({'salaries': [10000, 30000, 30000, 20000, 15000]},
>>>                    index=['Suzhou', 'Beijing', 'Shanghai', 'Guangzhou', 'Tianjin'])
>>> print(df4)
           salaries
Suzhou        10000
Beijing       30000
Shanghai      30000
Guangzhou     20000
Tianjin       15000

>>> result3 = pd.concat([result, df4], axis=1, sort=True)
>>> print(result3)
              apts      cars  salaries
Beijing    60000.0  300000.0   30000.0
Chongqing  10000.0  100000.0       NaN
Guangzhou  30000.0  180000.0   20000.0
Hangzhou   25000.0  150000.0       NaN
Najing     20000.0  120000.0       NaN
Shanghai   55000.0  200000.0   30000.0
Suzhou         NaN       NaN   10000.0
Tianjin        NaN       NaN   15000.0

>>> # DataFrame 转化为 hierarchical indexing 的 Series
>>> print(result3.stack())
Beijing    apts         60000.0
           cars        300000.0
           salaries     30000.0
Chongqing  apts         10000.0
           cars        100000.0
Guangzhou  apts         30000.0
           cars        180000.0
           salaries     20000.0
Hangzhou   apts         25000.0
           cars        150000.0
Najing     apts         20000.0
           cars        120000.0
Shanghai   apts         55000.0
           cars        200000.0
           salaries     30000.0
Suzhou     salaries     10000.0
Tianjin    salaries     15000.0
dtype: float64

3. join concat

  1. 按 index 进行 inner join;
>>> print(result)
            apts    cars
Shanghai   55000  200000
Beijing    60000  300000
Hangzhou   25000  150000
Najing     20000  120000
Guangzhou  30000  180000
Chongqing  10000  100000

>>> print(df4)
           salaries
Suzhou        10000
Beijing       30000
Shanghai      30000
Guangzhou     20000
Tianjin       15000

>>> result3 = pd.concat([result, df4], axis=1, join='inner')
>>> print(result3)
           apts    cars  salaries
Shanghai   55000  200000     30000
Beijing    60000  300000     30000
Guangzhou  30000  180000     20000

4. append(deprecated)

  1. 纵向 concat;
>>> print(df1)
           apts    cars
Shanghai  55000  200000
Beijing   60000  300000

>>> print(df2)
           apts    cars
Hangzhou  25000  150000
Najing    20000  120000

>>> df1.append(df2)
           apts    cars
Shanghai  55000  200000
Beijing   60000  300000
Hangzhou  25000  150000
Najing    20000  120000
  1. 横向 concat;
>>> print(df1)
           apts    cars
Shanghai  55000  200000
Beijing   60000  300000

>>> print(df4)
           salaries
Suzhou        10000
Beijing       30000
Shanghai      30000
Guangzhou     20000
Tianjin       15000

>>> df1.append(df4, sort=True)
              apts      cars  salaries
Shanghai   55000.0  200000.0       NaN
Beijing    60000.0  300000.0       NaN
Suzhou         NaN       NaN   10000.0
Beijing        NaN       NaN   30000.0
Shanghai       NaN       NaN   30000.0
Guangzhou      NaN       NaN   20000.0
Tianjin        NaN       NaN   15000.0

5. Series 与 DataFrame 进行 concatenate

  1. Series 作为 Column 进行 concat;
>>> s1 = pd.Series([60, 50], index=['Shanghai', 'Beijing'], name='meal')
>>> s1
Shanghai    60
Beijing     50
Name: meal, dtype: int64

>>> print(df1)
           apts    cars
Shanghai  55000  200000
Beijing   60000  300000

>>> print(s1)
Shanghai    60
Beijing     50
Name: meal, dtype: int64

>>> print(pd.concat([df1, s1], axis=1))
           apts    cars  meal
Shanghai  55000  200000    60
Beijing   60000  300000    50
  1. Series 作为 Row 进行 concat;
>>> s2 = pd.Series([18000, 12000], index=['apts', 'cars'], name='Xiamen')
>>> s2
apts    18000
cars    12000
Name: Xiamen, dtype: int64

>>> print(df1.append(s2))
           apts    cars
Shanghai  55000  200000
Beijing   60000  300000
Xiamen    18000   12000

5. Merge

1. 按指定列进行 inner join

>>> df1 = pd.DataFrame({'apts': [55000, 60000, 58000],
>>>                    'cars': [200000, 300000, 250000],
>>>                     'cities': ['Shanghai', 'Beijing', 'Shenzhen']})
>>> df1
    apts    cars    cities
0  55000  200000  Shanghai
1  60000  300000   Beijing
2  58000  250000  Shenzhen

>>> df4 = pd.DataFrame({'salaries': [10000, 30000, 30000, 20000, 15000],
>>>                     'cities': ['Suzhou', 'Beijing', 'Shanghai', 'Guangzhou', 'Tianjin']})
>>> df4
   salaries     cities
0     10000     Suzhou
1     30000    Beijing
2     30000   Shanghai
3     20000  Guangzhou
4     15000    Tianjin

>>> pd.merge(df1, df4, on='cities')
    apts    cars    cities  salaries
0  55000  200000  Shanghai     30000
1  60000  300000   Beijing     30000

2. 按指定列进行 outer join

>>> pd.merge(df1, df4, on='cities', how='outer')
      apts      cars     cities  salaries
0  55000.0  200000.0   Shanghai   30000.0
1  60000.0  300000.0    Beijing   30000.0
2  58000.0  250000.0   Shenzhen       NaN
3      NaN       NaN     Suzhou   10000.0
4      NaN       NaN  Guangzhou   20000.0
5      NaN       NaN    Tianjin   15000.0

3. 按指定列进行 right join

>>> pd.merge(df1, df4, on='cities', how='right')
      apts      cars     cities  salaries
0      NaN       NaN     Suzhou     10000
1  60000.0  300000.0    Beijing     30000
2  55000.0  200000.0   Shanghai     30000
3      NaN       NaN  Guangzhou     20000
4      NaN       NaN    Tianjin     15000

4. 按指定列进行 left join

>>> pd.merge(df1, df4, on='cities', how='left')
    apts    cars    cities  salaries
0  55000  200000  Shanghai   30000.0
1  60000  300000   Beijing   30000.0
2  58000  250000  Shenzhen       NaN

6. Join

>>> df1 = pd.DataFrame({'apts': [55000, 60000, 58000],
>>>                    'cars': [200000, 300000, 250000]},
>>>                    index=['Shanghai', 'Beijing', 'Shenzhen'])
>>> df1
           apts    cars
Shanghai  55000  200000
Beijing   60000  300000
Shenzhen  58000  250000

>>> df4 = pd.DataFrame({'salaries': [10000, 30000, 30000, 20000, 15000]},
>>>                    index=['Suzhou', 'Beijing', 'Shanghai', 'Guangzhou', 'Tianjin'])
>>> df4
           salaries
Suzhou        10000
Beijing       30000
Shanghai      30000
Guangzhou     20000
Tianjin       15000

1. 按 index 进行 inner join

>>> df1.join(df4)
           apts    cars  salaries
Shanghai  55000  200000   30000.0
Beijing   60000  300000   30000.0
Shenzhen  58000  250000       NaN

2. 按 index 进行 outer join

>>> df1.join(df4, how='outer')
              apts      cars  salaries
Beijing    60000.0  300000.0   30000.0
Guangzhou      NaN       NaN   20000.0
Shanghai   55000.0  200000.0   30000.0
Shenzhen   58000.0  250000.0       NaN
Suzhou         NaN       NaN   10000.0
Tianjin        NaN       NaN   15000.0

>>> pd.merge(df1, df4, left_index=True, right_index=True, how='outer')
              apts      cars  salaries
Beijing    60000.0  300000.0   30000.0
Guangzhou      NaN       NaN   20000.0
Shanghai   55000.0  200000.0   30000.0
Shenzhen   58000.0  250000.0       NaN
Suzhou         NaN       NaN   10000.0
Tianjin        NaN       NaN   15000.0

7. Group By

1. 分组求和

>>> salaries = pd.DataFrame({
>>>     'Name': ['July', 'Chu', 'Chu', 'Lin', 'July', 'July', 'Chu', 'July'],
>>>     'Year': [2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017],
>>>     'Salary': [10000, 2000, 4000, 5000, 18000, 25000, 3000, 4000],
>>>     'Bonus': [3000, 1000, 1000, 1200, 4000, 2300, 500, 1000]
>>> })
>>> salaries
   Name  Year  Salary  Bonus
0  July  2016   10000   3000
1   Chu  2016    2000   1000
2   Chu  2016    4000   1000
3   Lin  2016    5000   1200
4  July  2017   18000   4000
5  July  2017   25000   2300
6   Chu  2017    3000    500
7  July  2017    4000   1000

>>> group_by_name = salaries.groupby('Name')
>>> group_by_name
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11c48e550>

>>> group_by_name.aggregate(sum)
      Year  Salary  Bonus
Name
Chu   6049    9000   2500
July  8067   57000  10300
Lin   2016    5000   1200

>>> group_by_name.sum()
      Year  Salary  Bonus
Name
Chu   6049    9000   2500
July  8067   57000  10300
Lin   2016    5000   1200

>>> group_by_name_year = salaries.groupby(['Name', 'Year'])
>>> group_by_name_year.sum()
           Salary  Bonus
Name Year
Chu  2016    6000   2000
     2017    3000    500
July 2016   10000   3000
     2017   47000   7300
Lin  2016    5000   1200

>>> group_by_name_year.size()
Name  Year
Chu   2016    2
      2017    1
July  2016    1
      2017    3
Lin   2016    1
dtype: int64

2. 展示分组的各种统计信息

>>> group_by_name_year.describe()
          Salary                                                         \
           count          mean           std      min      25%      50%
Name Year
Chu  2016    2.0   3000.000000   1414.213562   2000.0   2500.0   3000.0
     2017    1.0   3000.000000           NaN   3000.0   3000.0   3000.0
July 2016    1.0  10000.000000           NaN  10000.0  10000.0  10000.0
     2017    3.0  15666.666667  10692.676622   4000.0  11000.0  18000.0
Lin  2016    1.0   5000.000000           NaN   5000.0   5000.0   5000.0

                            Bonus                                           \
               75%      max count         mean         std     min     25%
Name Year
Chu  2016   3500.0   4000.0   2.0  1000.000000     0.00000  1000.0  1000.0
     2017   3000.0   3000.0   1.0   500.000000         NaN   500.0   500.0
July 2016  10000.0  10000.0   1.0  3000.000000         NaN  3000.0  3000.0
     2017  21500.0  25000.0   3.0  2433.333333  1504.43788  1000.0  1650.0
Lin  2016   5000.0   5000.0   1.0  1200.000000         NaN  1200.0  1200.0


              50%     75%     max
Name Year
Chu  2016  1000.0  1000.0  1000.0
     2017   500.0   500.0   500.0
July 2016  3000.0  3000.0  3000.0
     2017  2300.0  3150.0  4000.0
Lin  2016  1200.0  1200.0  1200.0

8. 应用案例(计算每个工作日的骑车人数之和)

1. 读取 bike.csv 到 DataFrame

read_csv API reference

bike.csv 记录了 Montreal 自行车路线的数据,具体有 7 条路线,数据记录了每条自行车路线每天分别有多少人;

>>> pd.set_option('display.max_columns', 60)
>>> bikes = pd.read_csv('data/bikes.csv', encoding='latin1')
>>> bikes
    Date;Berri 1;Brébeuf (données non disponibles);Côte-Sainte-Catherine;Maisonneuve 1;Maisonneuve 2;du Parc;Pierre-Dupuy;Rachel1;St-Urbain (données non disponibles)
0                     01/01/2012;35;;0;38;51;26;10;16;
1                     02/01/2012;83;;1;68;153;53;6;43;
2                   03/01/2012;135;;2;104;248;89;3;58;
3                  04/01/2012;144;;1;116;318;111;8;61;
4                  05/01/2012;197;;2;124;330;97;13;95;
..                                                 ...
305      01/11/2012;2405;;1208;1701;3082;2076;165;2461
306        02/11/2012;1582;;737;1109;2277;1392;97;1888
307          03/11/2012;844;;380;612;1137;713;105;1302
308          04/11/2012;966;;446;710;1277;692;197;1374
309      05/11/2012;2247;;1170;1705;3221;2143;179;2430

[310 rows x 1 columns]

按分割好的列装载 bike.csv 到 DataFrame;

>>> bikes = pd.read_csv('data/bikes.csv', sep=';',
>>>                     parse_dates=['Date'], encoding='latin1', dayfirst=True, index_col='Date')
>>> bikes
            Berri 1  Brébeuf (données non disponibles)  Côte-Sainte-Catherine  \
Date
2012-01-01       35                                NaN                      0
2012-01-02       83                                NaN                      1
2012-01-03      135                                NaN                      2
2012-01-04      144                                NaN                      1
2012-01-05      197                                NaN                      2
...             ...                                ...                    ...
2012-11-01     2405                                NaN                   1208
2012-11-02     1582                                NaN                    737
2012-11-03      844                                NaN                    380
2012-11-04      966                                NaN                    446
2012-11-05     2247                                NaN                   1170

            Maisonneuve 1  Maisonneuve 2  du Parc  Pierre-Dupuy  Rachel1  \
Date
2012-01-01             38             51       26            10       16
2012-01-02             68            153       53             6       43
2012-01-03            104            248       89             3       58
2012-01-04            116            318      111             8       61
2012-01-05            124            330       97            13       95
...                   ...            ...      ...           ...      ...
2012-11-01           1701           3082     2076           165     2461
2012-11-02           1109           2277     1392            97     1888
2012-11-03            612           1137      713           105     1302
...
2012-11-03                                  NaN
2012-11-04                                  NaN
2012-11-05                                  NaN

[310 rows x 9 columns]

2. 查看数据样例

  1. 使用 head 与切片取前 5 行;
>>> bikes.head(5)
            Berri 1  Brébeuf (données non disponibles)  Côte-Sainte-Catherine  \
Date
2012-01-01       35                                NaN                      0
2012-01-02       83                                NaN                      1
2012-01-03      135                                NaN                      2
2012-01-04      144                                NaN                      1
2012-01-05      197                                NaN                      2

            Maisonneuve 1  Maisonneuve 2  du Parc  Pierre-Dupuy  Rachel1  \
Date
2012-01-01             38             51       26            10       16
2012-01-02             68            153       53             6       43
2012-01-03            104            248       89             3       58
2012-01-04            116            318      111             8       61
2012-01-05            124            330       97            13       95

            St-Urbain (données non disponibles)
Date
2012-01-01                                  NaN
2012-01-02                                  NaN
2012-01-03                                  NaN
2012-01-04                                  NaN
2012-01-05                                  NaN

>>> bikes[:5]
            Berri 1  Brébeuf (données non disponibles)  Côte-Sainte-Catherine  \
Date
2012-01-01       35                                NaN                      0
2012-01-02       83                                NaN                      1
2012-01-03      135                                NaN                      2
2012-01-04      144                                NaN                      1
2012-01-05      197                                NaN                      2

            Maisonneuve 1  Maisonneuve 2  du Parc  Pierre-Dupuy  Rachel1  \
Date
2012-01-01             38             51       26            10       16
2012-01-02             68            153       53             6       43
2012-01-03            104            248       89             3       58
2012-01-04            116            318      111             8       61
2012-01-05            124            330       97            13       95

            St-Urbain (données non disponibles)
Date
2012-01-01                                  NaN
2012-01-02                                  NaN
2012-01-03                                  NaN
2012-01-04                                  NaN
2012-01-05                                  NaN
  1. 使用 copy 复制选取的部分;
>>> berri_bikes = bikes[['Berri 1']].copy()
>>> berri_bikes.head()
            Berri 1
Date
2012-01-01       35
2012-01-02       83
2012-01-03      135
2012-01-04      144
2012-01-05      197

>>> berri_bikes.index
DatetimeIndex(['2012-01-01', '2012-01-02', '2012-01-03', '2012-01-04',
               '2012-01-05', '2012-01-06', '2012-01-07', '2012-01-08',
               '2012-01-09', '2012-01-10',
               ...
               '2012-10-27', '2012-10-28', '2012-10-29', '2012-10-30',
               '2012-10-31', '2012-11-01', '2012-11-02', '2012-11-03',
               '2012-11-04', '2012-11-05'],
              dtype='datetime64[ns]', name='Date', length=310, freq=None)

>>> # 查看日期是每月的第几天
>>> berri_bikes.index.day
Int64Index([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10,
            ...
            27, 28, 29, 30, 31,  1,  2,  3,  4,  5],
           dtype='int64', name='Date', length=310)

>>> # 查看日期是星期几
>>> berri_bikes.index.weekday
Int64Index([6, 0, 1, 2, 3, 4, 5, 6, 0, 1,
            ...
            5, 6, 0, 1, 2, 3, 4, 5, 6, 0],
           dtype='int64', name='Date', length=310)

3. dropna

  1. 删除所有带 NaN 的行;
>>> bikes.dropna()
Empty DataFrame
Columns: [Berri 1, Brébeuf (données non disponibles), Côte-Sainte-Catherine, Maisonneuve 1, Maisonneuve 2, du Parc, Pierre-Dupuy, Rachel1, St-Urbain (données non disponibles)]
Index: []
  1. 删除整行都为 NaN 的行;
>>> bikes.dropna(how='all').head()
            Berri 1  Brébeuf (données non disponibles)  Côte-Sainte-Catherine  \
Date
2012-01-01       35                                NaN                      0
2012-01-02       83                                NaN                      1
2012-01-03      135                                NaN                      2
2012-01-04      144                                NaN                      1
2012-01-05      197                                NaN                      2

            Maisonneuve 1  Maisonneuve 2  du Parc  Pierre-Dupuy  Rachel1  \
Date
2012-01-01             38             51       26            10       16
2012-01-02             68            153       53             6       43
2012-01-03            104            248       89             3       58
2012-01-04            116            318      111             8       61
2012-01-05            124            330       97            13       95

            St-Urbain (données non disponibles)
Date
2012-01-01                                  NaN
2012-01-02                                  NaN
2012-01-03                                  NaN
2012-01-04                                  NaN
2012-01-05                                  NaN
  1. 删除整列都为 NaN 的列;
>>> bikes.dropna(axis=1, how='all').head()
            Berri 1  Côte-Sainte-Catherine  Maisonneuve 1  Maisonneuve 2  \
Date
2012-01-01       35                      0             38             51
2012-01-02       83                      1             68            153
2012-01-03      135                      2            104            248
2012-01-04      144                      1            116            318
2012-01-05      197                      2            124            330

            du Parc  Pierre-Dupuy  Rachel1
Date
2012-01-01       26            10       16
2012-01-02       53             6       43
2012-01-03       89             3       58
2012-01-04      111             8       61
2012-01-05       97            13       95

3. fillna

  1. 填充缺失的数据(单行);
>>> row = bikes.iloc[0].copy()
>>> print(row)
Berri 1                                35.0
Brébeuf (données non disponibles)       NaN
Côte-Sainte-Catherine                   0.0
Maisonneuve 1                          38.0
Maisonneuve 2                          51.0
du Parc                                26.0
Pierre-Dupuy                           10.0
Rachel1                                16.0
St-Urbain (données non disponibles)     NaN
Name: 2012-01-01 00:00:00, dtype: float64

>>> # 平均值
>>> print(row.mean())
25.142857142857142

>>> print(row.fillna(row.mean()))
Berri 1                                35.000000
Brébeuf (données non disponibles)      25.142857
Côte-Sainte-Catherine                   0.000000
Maisonneuve 1                          38.000000
Maisonneuve 2                          51.000000
du Parc                                26.000000
Pierre-Dupuy                           10.000000
Rachel1                                16.000000
St-Urbain (données non disponibles)    25.142857
Name: 2012-01-01 00:00:00, dtype: float64
  1. 填充缺失数据(全量);
>>> # 求所有行的行平均值;
>>> m = bikes.mean(axis=1)
>>> print(m)
Date
2012-01-01      25.142857
2012-01-02      58.142857
2012-01-03      91.285714
2012-01-04     108.428571
2012-01-05     122.571429
                 ...
2012-11-01    1871.142857
2012-11-02    1297.428571
2012-11-03     727.571429
2012-11-04     808.857143
2012-11-05    1870.714286
Length: 310, dtype: float64

>>> # 将行中缺失的部分用其行均值填充:遍历各列,将各列的所有行中为 NaN 的部分用 m 的对应位填充
>>> for i, col in enumerate(bikes):
>>>     bikes.iloc[:, i] = bikes.iloc[:, i].fillna(m)
>>>     print(i, col)
0 Berri 1
1 Brébeuf (données non disponibles)
2 Côte-Sainte-Catherine
3 Maisonneuve 1
4 Maisonneuve 2
5 du Parc
6 Pierre-Dupuy
7 Rachel1
8 St-Urbain (données non disponibles)

>>> bikes.head()
            Berri 1  Brébeuf (données non disponibles)  Côte-Sainte-Catherine  \
Date
2012-01-01       35                          25.142857                      0
2012-01-02       83                          58.142857                      1
2012-01-03      135                          91.285714                      2
2012-01-04      144                         108.428571                      1
2012-01-05      197                         122.571429                      2

            Maisonneuve 1  Maisonneuve 2  du Parc  Pierre-Dupuy  Rachel1  \
Date
2012-01-01             38             51       26            10       16
2012-01-02             68            153       53             6       43
2012-01-03            104            248       89             3       58
2012-01-04            116            318      111             8       61
2012-01-05            124            330       97            13       95

            St-Urbain (données non disponibles)
Date
2012-01-01                            25.142857
2012-01-02                            58.142857
2012-01-03                            91.285714
2012-01-04                           108.428571
2012-01-05                           122.571429

4. 计算单条路线每个工作日的骑车人数之和

  1. 新增一个 weekday 列;
>>> berri_bikes.loc[:, 'weekday'] = berri_bikes.index.weekday
>>> berri_bikes[:5]
            Berri 1  weekday
Date
2012-01-01       35        6
2012-01-02       83        0
2012-01-03      135        1
2012-01-04      144        2
2012-01-05      197        3
  1. 按 weekday 分组求和;
>>> weekday_counts = berri_bikes.groupby('weekday').aggregate(sum)
>>> weekday_counts
         Berri 1
weekday
0         134298
1         135305
2         152972
3         160131
4         141771
5         101578
6          99310
  1. 使用星期名词替换默认 index;
>>> weekday_counts.index = ['Monday', 'Tuesday', 'Wednesday',
>>>                         'Thursday', 'Friday', 'Saturday', 'Sunday']
>>> weekday_counts
           Berri 1
Monday      134298
Tuesday     135305
Wednesday   152972
Thursday    160131
Friday      141771
Saturday    101578
Sunday       99310

5. 计算所有路线每个工作日的骑车人数之和

  1. 按日期求和(所有路线每天的人数之和);
>>> bikes = pd.read_csv('data/bikes.csv', sep=';',
>>>                     parse_dates=['Date'], encoding='latin1', dayfirst=True, index_col='Date')
>>> bikes_sum = bikes.sum(axis=1).to_frame()
>>> print(bikes_sum.head())
                0
Date
2012-01-01  176.0
2012-01-02  407.0
2012-01-03  639.0
2012-01-04  759.0
2012-01-05  858.0
  1. 增加 weekday 列;
>>> print(bikes_sum.index)
DatetimeIndex(['2012-01-01', '2012-01-02', '2012-01-03', '2012-01-04',
               '2012-01-05', '2012-01-06', '2012-01-07', '2012-01-08',
               '2012-01-09', '2012-01-10',
               ...
               '2012-10-27', '2012-10-28', '2012-10-29', '2012-10-30',
               '2012-10-31', '2012-11-01', '2012-11-02', '2012-11-03',
               '2012-11-04', '2012-11-05'],
              dtype='datetime64[ns]', name='Date', length=310, freq=None)

>>> bikes_sum.loc[:, 'weekday'] = bikes_sum.index.weekday
>>> bikes_sum.head()
                0  weekday
Date
2012-01-01  176.0        6
2012-01-02  407.0        0
2012-01-03  639.0        1
2012-01-04  759.0        2
2012-01-05  858.0        3
  1. 按工作日分钟求和(每个工作日骑车的人数之和);
>>> weekday_counts = bikes_sum.groupby('weekday').aggregate(sum)
>>> weekday_counts.index = ['Monday', 'Tuesday', 'Wednesday',
>>>                         'Thursday', 'Friday', 'Saturday', 'Sunday']
>>> weekday_counts
                  0
Monday     714963.0
Tuesday    698582.0
Wednesday  789722.0
Thursday   829069.0
Friday     738772.0
Saturday   516701.0
Sunday     518047.0

PS:欢迎各路道友阅读评论,感谢道友点赞关注收藏