用Pandas完成Excel中常见的任务

转自：http://python.jobbole.com/80870/ 本文由伯乐在线 - 艾凌风翻译，Daetalus 校稿。未经许可，禁止转载！
英文出处：pbpython.com。欢迎加入翻译组。

引言

本文的目的，是向您展示如何使用pandas 来执行一些常见的Excel任务。有些例子比较琐碎，但我觉得展示这些简单的东西与那些你可以在其他地方找到的复杂功能同等重要。作为额外的福利，我将会进行一些模糊字符串匹配，以此来展示一些小花样，以及展示pandas是如何利用完整的Python模块系统去做一些在Python中是简单，但在Excel中却很复杂的事情的。

有道理吧？让我们开始吧。

为某行添加求和项

我要介绍的第一项任务是把某几列相加然后添加一个总和栏。

首先我们将excel 数据导入到pandas数据框架中。

importpandasaspd
importnumpyasnp
df=pd.read_excel("excel-comp-data.xlsx")
df.head()

我们想要添加一个总和栏来显示Jan、Feb和Mar三个月的销售总额。

在Excel和pandas中这都是简单直接的。对于Excel，我在J列中添加了公式sum(G2:I2)。在Excel中看上去是这样的：

用Pandas完成Excel中常见的任务

下面，我们是这样在pandas中操作的：

df["total"]=df["Jan"]+df["Feb"]+df["Mar"]
df.head()

用Pandas完成Excel中常见的任务

接下来，让我们对各列计算一些汇总信息以及其他值。如下Excel表所示，我们要做这些工作：

用Pandas完成Excel中常见的任务

如你所见，我们在表示月份的列的第17行添加了SUM(G2:G16)，来取得每月的总和。

进行在pandas中进行列级别的分析很简单。下面是一些例子：

df["Jan"].sum(),df["Jan"].mean(),df["Jan"].min(),df["Jan"].max()

<span class="crayon-sy" style="font-family: inherit; white-space: pre; border: 0px; margin: 0px; padding: 0px; height: inherit; font-size: inherit !important; font-weight: inherit !important; line-height: inherit !important; color: rgb(51, 51, 51) !important; background-color: rgb(248, 248, 255);">(</span><span class="crayon-cn" style="font-family: inherit; white-space: pre; border: 0px; margin: 0px; padding: 0px; height: inherit; font-size: inherit !important; font-weight: inherit !important; line-height: inherit !important; color: rgb(0, 153, 153) !important; background-color: rgb(248, 248, 255);">1462000</span><span class="crayon-sy" style="font-family: inherit; white-space: pre; border: 0px; margin: 0px; padding: 0px; height: inherit; font-size: inherit !important; font-weight: inherit !important; line-height: inherit !important; color: rgb(51, 51, 51) !important; background-color: rgb(248, 248, 255);">,</span><span class="crayon-h" style="font-family: inherit; white-space: pre; border: 0px; margin: 0px; padding: 0px; height: inherit; font-size: inherit !important; font-weight: inherit !important; line-height: inherit !important; color: rgb(0, 111, 224) !important; background-color: rgb(248, 248, 255);"></span><span class="crayon-cn" style="font-family: inherit; white-space: pre; border: 0px; margin: 0px; padding: 0px; height: inherit; font-size: inherit !important; font-weight: inherit !important; line-height: inherit !important; color: rgb(0, 153, 153) !important; background-color: rgb(248, 248, 255);">97466.666666666672</span><span class="crayon-sy" style="font-family: inherit; white-space: pre; border: 0px; margin: 0px; padding: 0px; height: inherit; font-size: inherit !important; font-weight: inherit !important; line-height: inherit !important; color: rgb(51, 51, 51) !important; background-color: rgb(248, 248, 255);">,</span><span class="crayon-h" style="font-family: inherit; white-space: pre; border: 0px; margin: 0px; padding: 0px; height: inherit; font-size: inherit !important; font-weight: inherit !important; line-height: inherit !important; color: rgb(0, 111, 224) !important; background-color: rgb(248, 248, 255);"></span><span class="crayon-cn" style="font-family: inherit; white-space: pre; border: 0px; margin: 0px; padding: 0px; height: inherit; font-size: inherit !important; font-weight: inherit !important; line-height: inherit !important; color: rgb(0, 153, 153) !important; background-color: rgb(248, 248, 255);">10000</span><span class="crayon-sy" style="font-family: inherit; white-space: pre; border: 0px; margin: 0px; padding: 0px; height: inherit; font-size: inherit !important; font-weight: inherit !important; line-height: inherit !important; color: rgb(51, 51, 51) !important; background-color: rgb(248, 248, 255);">,</span><span class="crayon-h" style="font-family: inherit; white-space: pre; border: 0px; margin: 0px; padding: 0px; height: inherit; font-size: inherit !important; font-weight: inherit !important; line-height: inherit !important; color: rgb(0, 111, 224) !important; background-color: rgb(248, 248, 255);"></span><span class="crayon-cn" style="font-family: inherit; white-space: pre; border: 0px; margin: 0px; padding: 0px; height: inherit; font-size: inherit !important; font-weight: inherit !important; line-height: inherit !important; color: rgb(0, 153, 153) !important; background-color: rgb(248, 248, 255);">162000</span><span class="crayon-sy" style="font-family: inherit; white-space: pre; border: 0px; margin: 0px; padding: 0px; height: inherit; font-size: inherit !important; font-weight: inherit !important; line-height: inherit !important; color: rgb(51, 51, 51) !important; background-color: rgb(248, 248, 255);">)</span>

现在我们要把每月的总和相加得到它们的和。这里pandas和Excel有点不同。在Excel的单元格里把每个月的总和相加很简单。由于pandas需要维护整个DataFrame的完整性，所以需要一些额外的步骤。

首先，建立所有列的总和栏

sum_row=df[["Jan","Feb","Mar","total"]].sum()
sum_row

Jan      1462000Feb      1507000Mar      717000total    3686000dtype:int64

<span style="font-family: 'Microsoft YaHei', 宋体, 'Myriad Pro', Lato, 'Helvetica Neue', Helvetica, Arial, sans-serif; background-color: rgb(255, 255, 255);">这很符合直觉，不过如果你希望将总和值显示为表格中的单独一行，你还需要做一些微调。</span>

我们需要把数据进行变换，把这一系列数字转换为DataFrame，这样才能更加容易的把它合并进已经存在的数据中。T 函数可以让我们把按行排列的数据变换为按列排列。

df_sum=pd.DataFrame(data=sum_row).T
df_sum

用Pandas完成Excel中常见的任务

在计算总和之前我们要做的最后一件事情是添加丢失的列。我们使用reindex来帮助我们完成。技巧是添加全部的列然后让pandas去添加所有缺失的数据。

df_sum=df_sum.reindex(columns=df.columns)
df_sum

用Pandas完成Excel中常见的任务

现在我们已经有了一个格式良好的DataFrame，我们可以使用append来把它加入到已有的内容中。

df_final=df.append(df_sum,ignore_index=True)
df_final.tail()

用Pandas完成Excel中常见的任务

额外的数据变换

另外一个例子，让我们尝试给数据集添加状态的缩写。

对于Excel，最简单的方式是添加一个新的列，对州名使用vlookup函数并填充缩写栏。

我进行了这样的操作，下面是其结果的截图：

用Pandas完成Excel中常见的任务

你可以注意到，在进行了vlookup后，有一些数值并没有被正确的取得。这是因为我们拼错了一些州的名字。在Excel中处理这一问题是一个巨大的挑战（对于大型数据集而言）

幸运的是，使用pandas我们可以利用强大的python生态系统。考虑如何解决这类麻烦的数据问题，我考虑进行一些模糊文本匹配来决定正确的值。

幸运的是其他人已经做了很多这方面的工作。fuzzy wuzzy库包含一些非常有用的函数来解决这类问题。首先要确保你安装了他。

我们需要的另外一段代码是州名与其缩写的映射表。而不是亲自去输入它们，谷歌一下你就能找到这段代码code。

首先导入合适的fuzzywuzzy函数并且定义我们的州名映射表。

from fuzzywuzzy import fuzz
from fuzzywuzzy import process
state_to_code={"VERMONT":"VT","GEORGIA":"GA","IOWA":"IA","Armed Forces Pacific":"AP","GUAM":"GU",
                "KANSAS":"KS","FLORIDA":"FL","AMERICAN SAMOA":"AS","NORTH CAROLINA":"NC","HAWAII":"HI",
                "NEW YORK":"NY","CALIFORNIA":"CA","ALABAMA":"AL","IDAHO":"ID","FEDERATED STATES OF MICRONESIA":"FM",
                "Armed Forces Americas":"AA","DELAWARE":"DE","ALASKA":"AK","ILLINOIS":"IL",
                "Armed Forces Africa":"AE","SOUTH DAKOTA":"SD","CONNECTICUT":"CT","MONTANA":"MT","MASSACHUSETTS":"MA",
                "PUERTO RICO":"PR","Armed Forces Canada":"AE","NEW HAMPSHIRE":"NH","MARYLAND":"MD","NEW MEXICO":"NM",
                "MISSISSIPPI":"MS","TENNESSEE":"TN","PALAU":"PW","COLORADO":"CO","Armed Forces Middle East":"AE",
                "NEW JERSEY":"NJ","UTAH":"UT","MICHIGAN":"MI","WEST VIRGINIA":"WV","WASHINGTON":"WA",
                "MINNESOTA":"MN","OREGON":"OR","VIRGINIA":"VA","VIRGIN ISLANDS":"VI","MARSHALL ISLANDS":"MH",
                "WYOMING":"WY","OHIO":"OH","SOUTH CAROLINA":"SC","INDIANA":"IN","NEVADA":"NV","LOUISIANA":"LA",
                "NORTHERN MARIANA ISLANDS":"MP","NEBRASKA":"NE","ARIZONA":"AZ","WISCONSIN":"WI","NORTH DAKOTA":"ND",
                "Armed Forces Europe":"AE","PENNSYLVANIA":"PA","OKLAHOMA":"OK","KENTUCKY":"KY","RHODE ISLAND":"RI",
                "DISTRICT OF COLUMBIA":"DC","ARKANSAS":"AR","MISSOURI":"MO","TEXAS":"TX","MAINE":"ME"}

这里有些介绍模糊文本匹配函数如何工作的例子。

process.extractOne("Minnesotta",choices=state_to_code.keys())

('MINNESOTA', 95)

process.extractOne("AlaBAMMazzz",choices=state_to_code.keys(),score_cutoff=80)

现在我知道它是如何工作的了，我们创建自己的函数来接受州名这一列的数据然后把他转换为一个有效的缩写。这里我们使用score_cutoff的值为80。你可以做一些调整，看看哪个值对你的数据来说比较好。你会注意到，返回值要么是一个有效的缩写，要么是一个np.nan 所以域中会有一些有效的值。

def convert_state(row):
    abbrev=process.extractOne(row["state"],choices=state_to_code.keys(),score_cutoff=80)
    if abbrev:
        return state_to_code[abbrev[0]]
    return np.nan

把这列添加到我们想要填充的单元格，然后用NaN填充它

df_final.insert(6,"abbrev",np.nan)
df_final.head()

用Pandas完成Excel中常见的任务

我们使用apply 来把缩写添加到合适的列中。

df_final['abbrev']=df_final.apply(convert_state,axis=1)
df_final.tail()

用Pandas完成Excel中常见的任务我觉的这很酷。我们已经开发出了一个非常简单的流程来智能的清理数据。显然，当你只有15行左右数据的时候这没什么了不起的。但是如果是15000行呢？在Excel中你就必须进行一些人工清理了。

分类汇总

在本文的最后一节中，让我们按州来做一些分类汇总（subtotal）。

在Excel中，我们会用subtotal 工具来完成。
用Pandas完成Excel中常见的任务

输出如下：

用Pandas完成Excel中常见的任务
在pandas中创建分类汇总，是使用groupby 来完成的。

df_sub=df_final[["abbrev","Jan","Feb","Mar","total"]].groupby('abbrev').sum()
df_sub

用Pandas完成Excel中常见的任务

然后，我们想要通过对data frame中所有的值使用 applymap 来把数据单位格式化为货币。

def money(x):
    return"${:,.0f}".format(x)

formatted_df=df_sub.applymap(money)
formatted_df

用Pandas完成Excel中常见的任务

格式化看上去进行的很顺利，现在我们可以像之前那样获取总和了。

sum_row=df_sub[["Jan","Feb","Mar","total"]].sum()
sum_row

Jan      1462000
Feb      1507000
Mar      717000
total    3686000
dtype:int64

把值变换为列然后进行格式化。

df_sub_sum=pd.DataFrame(data=sum_row).T
df_sub_sum=df_sub_sum.applymap(money)
df_sub_sum

用Pandas完成Excel中常见的任务

最后，把总和添加到DataFrame中。

final_table=formatted_df.append(df_sub_sum)
final_table

用Pandas完成Excel中常见的任务

你可以注意到总和行的索引号是‘0’。我们想要使用rename 来重命名它。

final_table=final_table.rename(index={0:"Total"})
final_table

结论

到目前为止，大部分人都已经知道使用pandas可以对数据做很多复杂的操作——就如同Excel一样。因为我一直在学习pandas，但我发现我还是会尝试记忆我是如何在Excel中完成这些操作的而不是在pandas中。我意识到把它俩作对比似乎不是很公平——它们是完全不同的工具。但是，我希望能接触到哪些了解Excel并且想要学习一些可以满足分析他们数据需求的其他替代工具的那些人。我希望这些例子可以帮助到其他人，让他们有信心认为他们可以使用pandas来替换他们零碎复杂的Excel，进行数据操作。

我发现这个练习会帮助我加强记忆。我希望这对你来说同样有帮助。如果你有一些其他的Excel任务想知道如何用pandas来完成它，请通过评论来告诉我，我会尽力帮助你。

本文由伯乐在线 - 艾凌风翻译，Daetalus 校稿。未经许可，禁止转载！
英文出处：pbpython。欢迎加入翻译组。

介绍

读者对于本系列第一篇文章的回应，让我感到很兴奋。感谢大家正面的反馈。我想把本系列继续下去，重点介绍其他的一些你经常使用Excel完成的任务，并且展示给你如何在pandas中使用相同的功能。

在第一篇文章中，我着重介绍了Excel中常见的数学计算工作，以及在pandas如何完成这些工作。在本文中，我们将着重介绍一些常见的选择和筛选任务，并且介绍如何在pandas中完成同样的事情。

设置

如果您想要继续下去，您可以下载本excel文件。

导入pandas和numpy模块。

导入我们样本公司销售年销售额的Excel文件。

import pandas as pd
import numpy as np

df=pd.read_excel('sample-salesv3.xlsx')

快速浏览一下数据类型，以确保所以事情都能如预期一样运行。

df.dtypes

account number             int64name                      objectsku                       objectquantity                   int64unit price               float64ext price                float64date              datetime64[ns]dtype: object

你会注意到，我们的date列，显示的是一个通用 对象。我们准备把它转换为日期对象，来简化将来会用到的一些选择操作。

df['date']=pd.to_datetime(df['date'])

	account number	name	sku	quantity	unit price	ext price	date
0	740150	Barton LLC	B1-20000	39	86.69	3380.91	2014-01-01 07:21:51
1	714466	Trantow-Barrows	S2-77896	-1	63.16	-63.16	2014-01-01 10:00:47
2	218895	Kulas Inc	B1-69924	23	90.70	2086.10	2014-01-01 13:24:58
3	307599	Kassulke, Ondricka and Metz	S1-65481	41	21.05	863.05	2014-01-01 15:05:22
4	412290	Jerde-Hilpert	S2-34077	6	83.21	499.26	2014-01-01 23:26:55

现在，data变成了一个datetime类型的对象，这对于将来的操作是很有用的。

筛选数据

我认为在Excel中最方便的功能是筛选。我想几乎每一次有人拿到一个任意大小的Excel文件，当他们想要筛选数据的时候，都会使用这个功能。

如图，对本数据集使用该功能：

同Excel中的筛选功能一样，你可以使用pandas来筛选和选择某个特定数据的子集。

比方说，如果我们仅仅想查看一个特定的账号，我们可以简单是在Excel中完成，或是使用pandas完成操作。

下面是Excel的筛选解决方案：

在pandas中执行相关操作比Excel中更加直观。注意，我将会使用head 函数来显示前面几个结果。这仅仅是为了让本文保持简短。

df[df["account number"]==307599].head()

你还可以以数值为基准来进行筛选。我就不再举任何Excel的例子了。我相信你能明白。

df[df["quantity"]>22].head()

	account number	name	sku	quantity	unit price	ext price	date
0	740150	Barton LLC	B1-20000	39	86.69	3380.91	2014-01-01 07:21:51
2	218895	Kulas Inc	B1-69924	23	90.70	2086.10	2014-01-01 13:24:58
3	307599	Kassulke, Ondricka and Metz	S1-65481	41	21.05	863.05	2014-01-01 15:05:22
14	737550	Fritsch, Russel and Anderson	B1-53102	23	71.56	1645.88	2014-01-04 08:57:48
15	239344	Stokes LLC	S1-06532	34	71.51	2431.34	2014-01-04 11:34:58

如果我们想要更多复杂的筛选，我们可以可以使用map 来以多重标准进行筛选。在这个例子中，从B1中查找以“sku”中起始的项目。

df[df["sku"].map(lambdax:x.startswith('B1'))].head()

	account number	name	sku	quantity	unit price	ext price	date
0	740150	Barton LLC	B1-20000	39	86.69	3380.91	2014-01-01 07:21:51
2	218895	Kulas Inc	B1-69924	23	90.70	2086.10	2014-01-01 13:24:58
6	218895	Kulas Inc	B1-65551	2	31.10	62.20	2014-01-02 10:57:23
14	737550	Fritsch, Russel and Anderson	B1-53102	23	71.56	1645.88	2014-01-04 08:57:48
17	239344	Stokes LLC	B1-50809	14	16.23	227.22	2014-01-04 22:14:32

把两个或更多的语句连接起来很简单，用&就可以。

df[df["sku"].map(lambdax:x.startswith('B1'))&(df["quantity"]>22)].head()

	account number	name	sku	quantity	unit price	ext price	date
0	740150	Barton LLC	B1-20000	39	86.69	3380.91	2014-01-01 07:21:51
2	218895	Kulas Inc	B1-69924	23	90.70	2086.10	2014-01-01 13:24:58
14	737550	Fritsch, Russel and Anderson	B1-53102	23	71.56	1645.88	2014-01-04 08:57:48
26	737550	Fritsch, Russel and Anderson	B1-53636	42	42.06	1766.52	2014-01-08 00:02:11
31	714466	Trantow-Barrows	B1-33087	32	19.56	625.92	2014-01-09 10:16:32

pandas支持的另外一个很有用的函数是isin。它使得我们可以定义一个列表，里面包含我们所希望查找的值

在这个例子中，我们查找包含两个特定account number值的全部项目。

df[df["accountnumber"].isin([714466,218895])].head()

	account number	name	sku	quantity	unit price	ext price	date
1	714466	Trantow-Barrows	S2-77896	-1	63.16	-63.16	2014-01-01 10:00:47
2	218895	Kulas Inc	B1-69924	23	90.70	2086.10	2014-01-01 13:24:58
5	714466	Trantow-Barrows	S2-77896	17	87.63	1489.71	2014-01-02 10:07:15
6	218895	Kulas Inc	B1-65551	2	31.10	62.20	2014-01-02 10:57:23
8	714466	Trantow-Barrows	S1-50961	22	84.09	1849.98	2014-01-03 11:29:02

pandas支持的另外一个函数叫做query，它使得我们可以有效的再数据集中选择数据。使用它需要安装numexpr ，所以请确保你在进行下面步骤前已经进行了安装。

如果你想要通过名字来得到一个消费者列表，你可以使用query来完成，和前面展示的python语法类似。

df.query('name == ["Kulas Inc","Barton LLC"]').head()

	account number	name	sku	quantity	unit price	ext price	date
0	740150	Barton LLC	B1-20000	39	86.69	3380.91	2014-01-01 07:21:51
2	218895	Kulas Inc	B1-69924	23	90.70	2086.10	2014-01-01 13:24:58
6	218895	Kulas Inc	B1-65551	2	31.10	62.20	2014-01-02 10:57:23
33	218895	Kulas Inc	S1-06532	3	22.36	67.08	2014-01-09 23:58:27
36	218895	Kulas Inc	S2-34077	16	73.04	1168.64	2014-01-10 12:07:30

这里只是做个简单的示例，query函数能做到的还不止这些。我在此展示这些函数的用法，以便当你有需要的时候，会意识到可以用它。

处理日期

使用pandas，你可以对日期进行更加复杂的筛选。在我们处理日期前，我建议你把日期栏进行一个排序，以便返回的结果如你所愿。

df=df.sort('date')
df.head()

	account number	name	sku	quantity	unit price	ext price	date
0	740150	Barton LLC	B1-20000	39	86.69	3380.91	2014-01-01 07:21:51
1	714466	Trantow-Barrows	S2-77896	-1	63.16	-63.16	2014-01-01 10:00:47
2	218895	Kulas Inc	B1-69924	23	90.70	2086.10	2014-01-01 13:24:58
3	307599	Kassulke, Ondricka and Metz	S1-65481	41	21.05	863.05	2014-01-01 15:05:22
4	412290	Jerde-Hilpert	S2-34077	6	83.21	499.26	2014-01-01 23:26:55

在操作日期前，为您展示python的筛选语法。

df[df['date']>='20140905'].head()

	account number	name	sku	quantity	unit price	ext price	date
1042	163416	Purdy-Kunde	B1-38851	41	98.69	4046.29	2014-09-05 01:52:32
1043	714466	Trantow-Barrows	S1-30248	1	37.16	37.16	2014-09-05 06:17:19
1044	729833	Koepp Ltd	S1-65481	48	16.04	769.92	2014-09-05 08:54:41
1045	729833	Koepp Ltd	S2-11481	6	26.50	159.00	2014-09-05 16:33:15
1046	737550	Fritsch, Russel and Anderson	B1-33364	4	76.44	305.76	2014-09-06 08:59:08

pandas的一个特别棒的特性是它能够理解日期，所以它允许我们进行部分筛选。如果我只想要查看最近几个月的日期数据，我可以这样做。

df[df['date']>='2014-03'].head()

	account number	name	sku	quantity	unit price	ext price	date
242	163416	Purdy-Kunde	S1-30248	19	65.03	1235.57	2014-03-01 16:07:40
243	527099	Sanford and Sons	S2-82423	3	76.21	228.63	2014-03-01 17:18:01
244	527099	Sanford and Sons	B1-50809	8	70.78	566.24	2014-03-01 18:53:09
245	737550	Fritsch, Russel and Anderson	B1-50809	20	50.11	1002.20	2014-03-01 23:47:17
246	688981	Keeling LLC	B1-86481	-1	97.16	-97.16	2014-03-02 01:46:44

当然，你可以把筛选标准链接起来。

df[(df['date']>='20140701')&(df['date']<='20140715')].head()

	account number	name	sku	quantity	unit price	ext price	date
778	737550	Fritsch, Russel and Anderson	S1-65481	35	70.51	2467.85	2014-07-01 00:21:58
779	218895	Kulas Inc	S1-30248	9	16.56	149.04	2014-07-01 00:52:38
780	163416	Purdy-Kunde	S2-82423	44	68.27	3003.88	2014-07-01 08:15:52
781	672390	Kuhn-Gusikowski	B1-04202	48	99.39	4770.72	2014-07-01 11:12:13
782	642753	Pollich LLC	S2-23246	1	51.29	51.29	2014-07-02 04:02:39

由于pandas可以理解日期列，所以可以将日期值设为不同的格式，都会得到正确的结果。

	account number	name	sku	quantity	unit price	ext price	date
1168	307599	Kassulke, Ondricka and Metz	S2-23246	6	88.90	533.40	2014-10-08 06:19:50
1169	424914	White-Trantow	S2-10342	25	58.54	1463.50	2014-10-08 07:31:40
1170	163416	Purdy-Kunde	S1-27722	22	34.41	757.02	2014-10-08 09:01:18
1171	163416	Purdy-Kunde	B1-33087	7	79.29	555.03	2014-10-08 15:39:13
1172	672390	Kuhn-Gusikowski	B1-38851	30	94.64	2839.20	2014-10-09 00:22:33

df[df['date']>='10-10-2014'].head()

	account number	name	sku	quantity	unit price	ext price	date
1174	257198	Cronin, Oberbrunner and Spencer	S2-34077	13	12.24	159.12	2014-10-10 02:59:06
1175	740150	Barton LLC	S1-65481	28	53.00	1484.00	2014-10-10 15:08:53
1176	146832	Kiehn-Spinka	S1-27722	15	64.39	965.85	2014-10-10 18:24:01
1177	257198	Cronin, Oberbrunner and Spencer	S2-16558	3	35.34	106.02	2014-10-11 01:48:13
1178	737550	Fritsch, Russel and Anderson	B1-53636	10	56.95	569.50	2014-10-11 10:25:53

当操作时间序列数据时，如果你把数据进行转化，以日期作为索引，我们可以做一些变相的筛选。

使用set_index 来设置新的索引。

df2=df.set_index(['date'])
df2.head()

	account number	name	sku	quantity	unit price	ext price
date
2014-01-01 07:21:51	740150	Barton LLC	B1-20000	39	86.69	3380.91
2014-01-01 10:00:47	714466	Trantow-Barrows	S2-77896	-1	63.16	-63.16
2014-01-01 13:24:58	218895	Kulas Inc	B1-69924	23	90.70	2086.10
2014-01-01 15:05:22	307599	Kassulke, Ondricka and Metz	S1-65481	41	21.05	863.05
2014-01-01 23:26:55	412290	Jerde-Hilpert	S2-34077	6	83.21	499.26

你可以通过切分数据来获取一段区间。

df2["20140101":"20140201"].head()

	account number	name	sku	quantity	unit price	ext price
date
2014-01-01 07:21:51	740150	Barton LLC	B1-20000	39	86.69	3380.91
2014-01-01 10:00:47	714466	Trantow-Barrows	S2-77896	-1	63.16	-63.16
2014-01-01 13:24:58	218895	Kulas Inc	B1-69924	23	90.70	2086.10
2014-01-01 15:05:22	307599	Kassulke, Ondricka and Metz	S1-65481	41	21.05	863.05
2014-01-01 23:26:55	412290	Jerde-Hilpert	S2-34077	6	83.21	499.26

再一次的，我们可以使用不同的日期表示方法来避免模棱两可的日期命名惯例。

df2["2014-Jan-1":"2014-Feb-1"].head()

	account number	name	sku	quantity	unit price	ext price
date
2014-01-01 07:21:51	740150	Barton LLC	B1-20000	39	86.69	3380.91
2014-01-01 10:00:47	714466	Trantow-Barrows	S2-77896	-1	63.16	-63.16
2014-01-01 13:24:58	218895	Kulas Inc	B1-69924	23	90.70	2086.10
2014-01-01 15:05:22	307599	Kassulke, Ondricka and Metz	S1-65481	41	21.05	863.05
2014-01-01 23:26:55	412290	Jerde-Hilpert	S2-34077	6	83.21	499.26

df2["2014-Jan-1":"2014-Feb-1"].tail()

	account number	name	sku	quantity	unit price	ext price
date
2014-01-31 22:51:18	383080	Will LLC	B1-05914	43	80.17	3447.31
2014-02-01 09:04:59	383080	Will LLC	B1-20000	7	33.69	235.83
2014-02-01 11:51:46	412290	Jerde-Hilpert	S1-27722	11	21.12	232.32
2014-02-01 17:24:32	412290	Jerde-Hilpert	B1-86481	3	35.99	107.97
2014-02-01 19:56:48	412290	Jerde-Hilpert	B1-20000	23	78.90	1814.70

df2["2014"].head()

	account number	name	sku	quantity	unit price	ext price
date
2014-01-01 07:21:51	740150	Barton LLC	B1-20000	39	86.69	3380.91
2014-01-01 10:00:47	714466	Trantow-Barrows	S2-77896	-1	63.16	-63.16
2014-01-01 13:24:58	218895	Kulas Inc	B1-69924	23	90.70	2086.10
2014-01-01 15:05:22	307599	Kassulke, Ondricka and Metz	S1-65481	41	21.05	863.05
2014-01-01 23:26:55	412290	Jerde-Hilpert	S2-34077	6	83.21	499.26

df2["2014-Dec"].head()

	account number	name	sku	quantity	unit price	ext price
date
2014-12-01 20:15:34	714466	Trantow-Barrows	S1-82801	3	77.97	233.91
2014-12-02 20:00:04	146832	Kiehn-Spinka	S2-23246	37	57.81	2138.97
2014-12-03 04:43:53	218895	Kulas Inc	S2-77896	30	77.44	2323.20
2014-12-03 06:05:43	141962	Herman LLC	B1-53102	20	26.12	522.40
2014-12-03 14:17:34	642753	Pollich LLC	B1-53636	19	71.21	1352.99

正如你所见到的那样，在进行基于日期的排序或者筛选时，可以有很多选择。

额外的字符串方法

Pandas同样已经支持了矢量字符串方法。

如果我们想识别出sku栏中包含某一特定值的全部值。我们可以使用str.contains。在这个例子中，我们已知sku总是以一种相同的方式表示，所以B1仅会出现在sku的前面。你需要理解你的数据来保证你能够得到你想要的结果。

df[df['sku'].str.contains('B1')].head()

	account number	name	sku	quantity	unit price	ext price	date
0	740150	Barton LLC	B1-20000	39	86.69	3380.91	2014-01-01 07:21:51
2	218895	Kulas Inc	B1-69924	23	90.70	2086.10	2014-01-01 13:24:58
6	218895	Kulas Inc	B1-65551	2	31.10	62.20	2014-01-02 10:57:23
14	737550	Fritsch, Russel and Anderson	B1-53102	23	71.56	1645.88	2014-01-04 08:57:48
17	239344	Stokes LLC	B1-50809	14	16.23	227.22	2014-01-04 22:14:32

我们可以把查询连接起来并且使用排序来控制数据的顺序。

df[(df['sku'].str.contains('B1-531')) & (df['quantity']>40)].sort(columns=['quantity','name'],ascending=[0,1])

	account number	name	sku	quantity	unit price	ext price	date
684	642753	Pollich LLC	B1-53102	46	26.07	1199.22	2014-06-08 19:33:33
792	688981	Keeling LLC	B1-53102	45	41.19	1853.55	2014-07-04 21:42:22
176	383080	Will LLC	B1-53102	45	89.22	4014.90	2014-02-11 04:14:09
1213	604255	Halvorson, Crona and Champlin	B1-53102	41	55.05	2257.05	2014-10-18 19:27:01
1215	307599	Kassulke, Ondricka and Metz	B1-53102	41	93.70	3841.70	2014-10-18 23:25:10
1128	714466	Trantow-Barrows	B1-53102	41	55.68	2282.88	2014-09-27 10:42:48
1001	424914	White-Trantow	B1-53102	41	81.25	3331.25	2014-08-26 11:44:30

彩蛋任务

在Excel中，我发现我自己经常会尝试从一个冗长的列表中，得到一个包含不重复项的小列表。在Excel中这件事情需要分几步来完成，但是在Pandas中却非常简单。有一种方式是使用Excel中提供的高级筛选工具来完成。

在pandas中，我们对某列使用这个unique函数来获取这个列表。

df["name"].unique()

如果我们想要包含账户号，我们可以使用 drop_duplicates。

df.drop_duplicates(subset=["account number","name"]).head()

	account number	name	sku	quantity	unit price	ext price	date
0	740150	Barton LLC	B1-20000	39	86.69	3380.91	2014-01-01 07:21:51
1	714466	Trantow-Barrows	S2-77896	-1	63.16	-63.16	2014-01-01 10:00:47
2	218895	Kulas Inc	B1-69924	23	90.70	2086.10	2014-01-01 13:24:58
3	307599	Kassulke, Ondricka and Metz	S1-65481	41	21.05	863.05	2014-01-01 15:05:22
4	412290	Jerde-Hilpert	S2-34077	6	83.21	499.26	2014-01-01 23:26:55

很显然我们放入了的数据超过了我们的需要，得到了一些无用的信息，因此，使用ix 来仅仅选择第一第二列。

df.drop_duplicates(subset=["account number","name"]).ix[:,[0,1]]

	account number	name
0	740150	Barton LLC
1	714466	Trantow-Barrows
2	218895	Kulas Inc
3	307599	Kassulke, Ondricka and Metz
4	412290	Jerde-Hilpert
7	729833	Koepp Ltd
9	737550	Fritsch, Russel and Anderson
10	146832	Kiehn-Spinka
11	688981	Keeling LLC
12	786968	Frami, Hills and Schmidt
15	239344	Stokes LLC
16	672390	Kuhn-Gusikowski
18	141962	Herman LLC
20	424914	White-Trantow
21	527099	Sanford and Sons
30	642753	Pollich LLC
37	383080	Will LLC
51	257198	Cronin, Oberbrunner and Spencer
67	604255	Halvorson, Crona and Champlin
106	163416	Purdy-Kunde

我认为这个记住这个单独的命令比记忆Excel的各步操作更容易。

如果你想要查看我的笔记请随意下载。

结论

在我发表了我的第一篇文章之后，Dave Proffer在Twitter上转发了我的文章并评论到“打破你#excel沉迷的一些好技巧”。我觉得这句话非常准确，它描述了在我们的生活中使用Excel是有多么的频繁。大多数的人只管伸手去用却从来没有意识到它的局限性。我希望这个系列的文章可以帮助大家认识到我们还有其他的替代工具，Python+Pandas是一个极其强大的组合。

秒客网

用Pandas完成Excel中常见的任务

为某行添加求和项

额外的数据变换

分类汇总

结论

设置

筛选数据

处理日期

额外的字符串方法

彩蛋任务

结论

相关文章