Matplotlib学习---用matplotlib画箱线图（boxplot）

箱线图通过数据的四分位数来展示数据的分布情况。例如：数据的中心位置，数据间的离散程度，是否有异常值等。

把数据从小到大进行排列并等分成四份，第一分位数（Q1），第二分位数（Q2）和第三分位数（Q3）分别为数据的第25%，50%和75%的数字。

I-------------I o I-------------I o I-------------I o I-------------I

Q1 Q2 Q3

(lower quartile) (median) (upper quartile)

四分位间距（Interquartile range（IQR））=上分位数（upper quartile） - 下分位数（lower quartile）

箱线图分为两部分，分别是箱（box）和须（whisker）。箱（box）用来表示从第一分位到第三分位的数据，须（whisker）用来表示数据的范围。

箱线图从上到下各横线分别表示：数据上限（通常是Q3+1.5*IQR），第三分位数（Q3），第二分位数（中位数），第一分位数（Q1），数据下限（通常是Q1-1.5*IQR）。有时还有一些圆点，位于数据上下限之外，表示异常值（outliers）。

（注：如果数据上下限特别大，那么whisker将显示数据的最大值和最小值。）

下图展示了箱线图各部分的含义。（摘自：https://datavizcatalogue.com/methods/box_plot.html）

Matplotlib学习---用matplotlib画箱线图（boxplot）

下面利用Jake Vanderplas所著的《Python数据科学手册》一书中的数据，学习画图。

数据地址：https://raw.githubusercontent.com/jakevdp/data-CDCbirths/master/births.csv

这个数据文件在Matplotlib学习---用matplotlib画折线图（line chart）里已经用过，这里直接使用清洗过后的数据：

import pandas as pd
from matplotlib import pyplot as plt

birth=pd.read_csv(r"https://raw.githubusercontent.com/jakevdp/data-CDCbirths/master/births.csv")

fig,ax=plt.subplots()

birth=birth.iloc[:15067]

birth["day"]=birth["day"].astype(int)

birth["date"]=pd.to_datetime({"year":birth["year"],"month":birth["month"],"day":birth["day"]},errors='coerce')

birth=birth[birth["date"].notnull()]

这是清洗过后的数据的前5行：

       year  month  day gender  births       date

0      1969      1    1      F    4046 1969-01-01

1      1969      1    1      M    4440 1969-01-01

2      1969      1    2      F    4454 1969-01-02

3      1969      1    2      M    4548 1969-01-02

4      1969      1    3      F    4548 1969-01-03

数据展示的是美国1969年-1988年每天出生的男女人数。

让我们画一个箱线图，比较一下1986年，1987年和1988年男女每天出生人数的分布情况。

箱线图： ax.boxplot(x)

完整代码如下：

import numpy as np

import pandas as pd

from matplotlib import pyplot as plt

birth=pd.read_csv(r"https://raw.githubusercontent.com/jakevdp/data-CDCbirths/master/births.csv")

fig,ax=plt.subplots()

birth=birth.iloc[:15067]

birth["day"]=birth["day"].astype(int)

birth["date"]=pd.to_datetime({"year":birth["year"],"month":birth["month"],"day":birth["day"]},errors='coerce')

birth=birth[birth["date"].notnull()]

#提取1986年-1988年男女出生人数数据，并转换成numpy的array格式

birth1986_female=np.array(birth.births[(birth["year"]==1986) & (birth["gender"]=="F")])

birth1986_male=np.array(birth.births[(birth["year"]==1986) & (birth["gender"]=="M")])

birth1987_female=np.array(birth.births[(birth["year"]==1987) & (birth["gender"]=="F")])

birth1987_male=np.array(birth.births[(birth["year"]==1987) & (birth["gender"]=="M")])

birth1988_female=np.array(birth.births[(birth["year"]==1988) & (birth["gender"]=="F")])

birth1988_male=np.array(birth.births[(birth["year"]==1988) & (birth["gender"]=="M")])

#由于需要绘制多个箱线图，因此把这些数据放入一个列表

data=[birth1986_female,birth1986_male,birth1987_female,birth1987_male,birth1988_female,birth1988_male]

ax.boxplot(data,positions=[0,0.6,1.5,2.1,3,3.6]) #用positions参数设置各箱线图的位置

ax.set_xticklabels(["1986\nfemale","1986\nmale","1987\nfemale","1987\nmale","1988\nfemale","1988\nmale"]) #设置x轴刻度标签

plt.show()

图像如下：

Matplotlib学习---用matplotlib画箱线图（boxplot）

可以看出，这三个年份，男性每天出生的人数的中位数都比女性高。同时，箱体高度都差不多，说明数据离散程度相差不大。此外，箱体没有关于中位线对称，且中位线位于箱体中心偏上，说明数据成左偏态分布。最后，数据没有出现异常值。

箱线图也可以做成横向的，在boxplot命令里加上参数vert=False即可。图像如下：

Matplotlib学习---用matplotlib画箱线图（boxplot）

秒客网

Matplotlib学习---用matplotlib画箱线图（boxplot）

相关文章