Pandas 数据结构(Series,DataFrame)

时间:2022-11-15 21:28:32
In [1]: import numpy as np

In [2]: import pandas as pd

1、Series

In [3]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [4]: s
Out[4]:
a -2.7828
b 0.4264
c -0.6505
d 1.1465
e -0.6631
dtype: float64
In [6]: pd.Series(np.random.randn(5))
Out[6]:
0 0.2939
1 -0.4049
2 1.1665
3 0.8420
4 0.5398

数据也可以为字典dict数据

In [7]: d = {'a' : 0., 'b' : 1., 'c' : 2.}

需要注意的是,如果指定index,则将会替代字典中的key作为index

以下为series的基本操作

In [11]: s[0]
Out[11]: -2.7827595933769937

In [12]: s[:3]
Out[12]:
a -2.7828
b 0.4264
c -0.6505
dtype: float64

In [13]: s[s > s.median()]
Out[13]:
b 0.4264
d 1.1465
dtype: float64

In [14]: s[[4, 3, 1]]
Out[14]:
e -0.6631
d 1.1465
b 0.4264
dtype: float64

In [15]: np.exp(s)
Out[15]:
a 0.0619
b 1.5318
c 0.5218
d 3.1472
e 0.5153
dtype: float64
In [16]: s['a']
Out[16]: -2.7827595933769937

In [19]: 'e' in s
Out[19]: True

In [20]: 'f' in s
Out[20]: False

Series和ndarray的一个主要区别是Series可以基于标签进行操作,可以自动匹配标签,如下所示:

In [26]: s[1:] + s[:-1]
Out[26]:
a NaN
b 0.8529
c -1.3010
d 2.2930
e NaN
dtype: float64

(其中s[1:]为bcde行, s[:-1]为abcd行)

2、DataFrame

是pandas中最常用的数据结构

声明dataframe:

In [32]: d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
....: 'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
....:

In [33]: df = pd.DataFrame(d)

In [34]: df
Out[34]:
one two
a 1.0 1.0
b 2.0 2.0
c 3.0 3.0
d NaN 4.0
In [39]: d = {'one' : [1., 2., 3., 4.],
....: 'two' : [4., 3., 2., 1.]}

....:

In [40]: pd.DataFrame(d)

In [41]: pd.DataFrame(d, index=['a', 'b', 'c', 'd'])
结果同上
In [47]: data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]

In [48]: pd.DataFrame(data2)
Out[48]:
a b c
0 1 2 NaN
1 5 10 20.0
In [81]: index = pd.date_range('1/1/2000', periods=8)

In [82]: df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=list('ABC'))
In [83]: df
Out[83]:
A B C
2000-01-01 0.0627 -0.0284 0.4436
2000-01-02 -0.2688 -1.5776 1.8502
2000-01-03 0.6381 -0.5566 -0.0712
2000-01-04 -0.5114 0.1563 -1.0756
2000-01-05 1.6636 -0.4377 -0.0773
2000-01-06 0.0292 0.1790 1.7401
2000-01-07 -0.7290 -0.8980 -0.3142
2000-01-08 -0.0481 -0.8756 0.1691

以下为dataframe的基本操作

1、选取/新增列操作

Pandas 数据结构(Series,DataFrame)

In [56]: df['one']
Out[56]:
a 1.0
b 2.0
c 3.0
d NaN
Name: one, dtype: float64

In [57]: df['three'] = df['one'] * df['two']

新增flag列
In [58]: df['flag'] = df['one'] > 2

In [59]: df
Out[59]:
one two three flag
a 1.0 1.0 1.0 False
b 2.0 2.0 4.0 False
c 3.0 3.0 9.0 True
d NaN 4.0 NaN False

2、删除/pop操作

In [60]: del df['two']

In [61]: three = df.pop('three')

3、 插入操作

insert插入到指定位置

In [67]: df.insert(1, 'bar', df['one'])

In [68]: df
Out[68]:
one bar flag foo one_trunc
a 1.0 1.0 False bar 1.0
b 2.0 2.0 False bar 2.0
c 3.0 3.0 True bar NaN
d NaN NaN False bar NaN

assign得到的是一个新的dataframe,而不是在原dataframe上操作

In [71]: (iris.assign(sepal_ratio = iris['SepalWidth'] / iris['SepalLength']))

In [72]: iris.assign(sepal_ratio = lambda x: (x['SepalWidth'] /
....: x['SepalLength'])).head()
....:

In [73]: (iris.query('SepalLength > 5')
....: .assign(SepalRatio = lambda x: x.SepalWidth / x.SepalLength,
....: PetalRatio = lambda x: x.PetalWidth / x.PetalLength)
....: .plot(kind='scatter', x='SepalRatio', y='PetalRatio'))

4、计算

In [86]: df * 5 + 2

In [87]: 1 / df

In [88]: df ** 4

In [95]: df[:5].T #转置

与numpy结合进行计算
In [96]: np.exp(df)

In [97]: np.asarray(df)
Out[97]:
array([[ 0.0627, -0.0284, 0.4436],
[-0.2688, -1.5776, 1.8502],
[ 0.6381, -0.5566, -0.0712],
[-0.5114, 0.1563, -1.0756],
[ 1.6636, -0.4377, -0.0773],
[ 0.0292, 0.179 , 1.7401],
[-0.729 , -0.898 , -0.3142],
[-0.0481, -0.8756, 0.1691]]
)