1. 便捷数据获取
1.1 本地数据获取:文件的打开,读写和关闭(另外的单独章节)
1.2 网络数据获取:
1.2.1 urllib, urllib2, httplib, httplib2 (python3中为urllib.request, http.client)
正则表达式(另外的单数章节)
1.2.2 通过matplotlib.finace模块获取雅虎财经上的数据
In [7]: from matplotlib.finance import quotes_historical_yahoo_ochl In [8]: from datetime import date In [9]: from datetime import datetime In [10]: import pandas as pd In [11]: today = date.today() In [12]: start = (today.year-1, today.month, today.day) In [14]: quotes = quotes_historical_yahoo_ochl('AXP', start, today) # 获取数据 In [15]: fields = ['date', 'open', 'close', 'high', 'low', 'volume'] In [16]: list1 = [] In [18]: for i in range(0,len(quotes)): ...: x = date.fromordinal(int(quotes[i][0])) # 取每一行的第一列,通过date.fromordinal设置为日期数据类型 ...: y = datetime.strftime(x,'%Y-%m-%d') # 通过datetime.strftime把日期设置为指定格式 ...: list1.append(y) # 将日期放入列表中 ...: In [19]: quotesdf = pd.DataFrame(quotes,index=list1,columns=fields) # index设置为日期,columns设置为字段 In [20]: quotesdf = quotesdf.drop(['date'],axis=1) # 删除date列 In [21]: print quotesdf open close high low volume 2016-01-20 60.374146 61.835916 62.336256 60.128882 9043800.0 2016-01-21 61.806486 61.453305 63.101479 61.325767 8992300.0 2016-01-22 57.283819 54.016907 57.774347 53.114334 43783400.0
1.2.3 通过自然语言工具包NLTK获取语料库等数据
1. 下载nltk:pip install nltk
2. 下载语料库:
In [1]: import nltk In [2]: nltk.download() NLTK Downloader --------------------------------------------------------------------------- d) Download l) List u) Update c) Config h) Help q) Quit --------------------------------------------------------------------------- Downloader> d Download which package (l=list; x=cancel)? Identifier> gutenberg Downloading package gutenberg to /root/nltk_data... Package gutenberg is already up-to-date!
3. 获取数据:
In [3]: from nltk.corpus import gutenberg In [4]: print gutenberg.fileids() [u'austen-emma.txt', u'austen-persuasion.txt', u'austen-sense.txt', u'bible-kjv.txt', u'blake-poems.txt', u'bryant-stories.txt', u'burgess-busterbrown.txt', u'carroll-alice.txt', u'chesterton-ball.txt', u'chesterton-brown.txt', u'chesterton-thursday.txt', u'edgeworth-parents.txt', u'melville-moby_dick.txt', u'milton-paradise.txt', u'shakespeare-caesar.txt', u'shakespeare-hamlet.txt', u'shakespeare-macbeth.txt', u'whitman-leaves.txt'] In [5]: texts = gutenberg.words('shakespeare-hamlet.txt') In [6]: texts Out[6]: [u'[', u'The', u'Tragedie', u'of', u'Hamlet', u'by', ...]
2. 数据准备和整理
2.1 quotes数据加入[ 列 ]属性名
In [79]: quotesdf = pd.DataFrame(quotes) In [80]: quotesdf Out[80]: 0 1 2 3 4 5 0 735983.0 60.374146 61.835916 62.336256 60.128882 9043800.0 1 735984.0 61.806486 61.453305 63.101479 61.325767 8992300.0 2 735985.0 57.283819 54.016907 57.774347 53.114334 43783400.0 3 735988.0 53.428272 53.977664 54.713455 53.114334 18498300.0 [253 rows x 6 columns] In [81]: fields = ['date','open','close','high','low','volume'] In [82]: quotesdf = pd.DataFrame(quotes,columns=fields) # 设置列属性名称 In [83]: quotesdf Out[83]: date open close high low volume 0 735983.0 60.374146 61.835916 62.336256 60.128882 9043800.0 1 735984.0 61.806486 61.453305 63.101479 61.325767 8992300.0 2 735985.0 57.283819 54.016907 57.774347 53.114334 43783400.0 3 735988.0 53.428272 53.977664 54.713455 53.114334 18498300.0
2.2 quotes数据加入[ index ]属性名
In [84]: quotesdf Out[84]: date open close high low volume 0 735983.0 60.374146 61.835916 62.336256 60.128882 9043800.0 1 735984.0 61.806486 61.453305 63.101479 61.325767 8992300.0 2 735985.0 57.283819 54.016907 57.774347 53.114334 43783400.0 [253 rows x 6 columns] In [85]: quotesdf = pd.DataFrame(quotes, index=range(1,len(quotes)+1),columns=fields) # 把index属性从0,1,2...改为1,2,3... In [86]: quotesdf Out[86]: date open close high low volume 1 735983.0 60.374146 61.835916 62.336256 60.128882 9043800.0 2 735984.0 61.806486 61.453305 63.101479 61.325767 8992300.0 3 735985.0 57.283819 54.016907 57.774347 53.114334 43783400.0
2.3 日期转换:Gregorian日历表示法 => 普通表示方法
In [88]: from datetime import date In [89]: firstday = date.fromordinal(735190) In [93]: firstday Out[93]: datetime.date(2013, 11, 18) In [95]: firstday = datetime.strftime(firstday,'%Y-%m-%d') In [96]: firstday Out[96]: '2013-11-18'
2.4 创建时间序列:
In [120]: import pandas as pd In [121]: dates = pd.date_range('20170101', periods=7) # 根据起始日期和长度生成日期序列 In [122]: dates Out[122]: DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04','2017-01-05', '2017-01-06', '2017-01-07'],dtype='datetime64[ns]', freq='D') In [123]: import numpy as np In [124]: dates = pd.DataFrame(np.random.randn(7,3), index=dates, columns=list('ABC')) # 时间序列当作index,ABC当作列的name属性,表内容为七行三列随机数 In [125]: dates Out[125]: A B C 2017-01-01 0.705927 0.311453 1.455362 2017-01-02 -0.331531 -0.358449 0.175375 2017-01-03 -0.284583 -1.760700 -0.582880 2017-01-04 -0.759392 -2.080658 -2.015328 2017-01-05 -0.517370 0.906072 -0.106568 2017-01-06 -0.252802 -2.135604 -0.692153 2017-01-07 -0.275184 0.142973 -1.262126
2.5 练习
In [101]: datetime.now() # 显示当前日期和时间 Out[101]: datetime.datetime(2017, 1, 20, 16, 11, 50, 43258) ========================================= In [108]: datetime.now().month # 显示当前月份 Out[108]: 1 ========================================= In [126]: import pandas as pd In [127]: dates = pd.date_range('2015-02-01',periods=10) In [128]: dates Out[128]: DatetimeIndex(['2015-02-01', '2015-02-02', '2015-02-03', '2015-02-04','2015-02-05', '2015-02-06', '2015-02-07', '2015-02-08','2015-02-09', '2015-02-10'],dtype='datetime64[ns]', freq='D') In [133]: res = pd.DataFrame(range(1,11),index=dates,columns=['value']) In [134]: res Out[134]: value 2015-02-01 1 2015-02-02 2 2015-02-03 3 2015-02-04 4 2015-02-05 5 2015-02-06 6 2015-02-07 7 2015-02-08 8 2015-02-09 9 2015-02-10 10
3. 数据显示
3.1 显示方式:
In [180]: quotesdf2.index # 显示索引 Out[180]: Index([u'2016-01-20', u'2016-01-21', u'2016-01-22', u'2016-01-25', ... u'2017-01-11', u'2017-01-12', u'2017-01-13', u'2017-01-17', u'2017-01-18', u'2017-01-19'], dtype='object', length=253) In [181]: quotesdf2.columns # 显示列名 Out[181]: Index([u'open', u'close', u'high', u'low', u'volume'], dtype='object') In [182]: quotesdf2.values # 显示数据的值 Out[182]: array([[ 6.03741455e+01, 6.18359160e+01, 6.23362562e+01, 6.01288817e+01, 9.04380000e+06], ..., [ 7.76100010e+01, 7.66900020e+01, 7.77799990e+01, 7.66100010e+01, 7.79110000e+06]]) In [183]: quotesdf2.describe # 显示数据描述 Out[183]: <bound method DataFrame.describe of open close high low volume 2016-01-20 60.374146 61.835916 62.336256 60.128882 9043800.0 2016-01-21 61.806486 61.453305 63.101479 61.325767 8992300.0 2016-01-22 57.283819 54.016907 57.774347 53.114334 43783400.0
3.2 索引的格式:u 表示unicode编码
3.3 显示行:
In [193]: quotesdf.head(2) # 专用方式显示头两行 Out[193]: date open close high low volume 1 735983.0 60.374146 61.835916 62.336256 60.128882 9043800.0 2 735984.0 61.806486 61.453305 63.101479 61.325767 8992300.0 In [194]: quotesdf.tail(2) # 专用方式显示尾两行 Out[194]: date open close high low volume 252 736347.0 77.110001 77.489998 77.610001 76.510002 5988400.0 253 736348.0 77.610001 76.690002 77.779999 76.610001 7791100.0 In [195]: quotesdf[:2] # 切片方式显示头两行 Out[195]: date open close high low volume 1 735983.0 60.374146 61.835916 62.336256 60.128882 9043800.0 2 735984.0 61.806486 61.453305 63.101479 61.325767 8992300.0 In [197]: quotesdf[251:] # 切片方式显示尾两行 Out[197]: date open close high low volume 252 736347.0 77.110001 77.489998 77.610001 76.510002 5988400.0 253 736348.0 77.610001 76.690002 77.779999 76.610001 7791100.0
4. 数据选择
5. 简单统计与处理
6. Grouping
7. Merge