前言
最近工作工作中遇到一个需求,是要根据CDN日志过滤一些数据,例如流量、状态码统计,TOP IP、URL、UA、Referer等。以前都是用 bash shell 实现的,但是当日志量较大,日志文件数G、行数达数千万亿级时,通过 shell 处理有些力不从心,处理时间过长。于是研究了下Python pandas这个数据处理库的使用。一千万行日志,处理完成在40s左右。
代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
|
#!/usr/bin/python
# -*- coding: utf-8 -*-
# sudo pip install pandas
__author__ = 'Loya Chen'
import sys
import pandas as pd
from collections import OrderedDict
"""
Description: This script is used to analyse qiniu cdn log.
================================================================================
日志格式
IP - ResponseTime [time +0800] "Method URL HTTP/1.1" code size "referer" "UA"
================================================================================
日志示例
[0] [1][2] [3] [4] [5]
101.226.66.179 - 68 [16/Nov/2016:04:36:40 +0800] "GET http://www.qn.com/1.jpg -"
[6] [7] [8] [9]
200 502 "-" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"
================================================================================
"""
if len (sys.argv) ! = 2 :
print ( 'Usage:' , sys.argv[ 0 ], 'file_of_log' )
exit()
else :
log_file = sys.argv[ 1 ]
# 需统计字段对应的日志位置
ip = 0
url = 5
status_code = 6
size = 7
referer = 8
ua = 9
# 将日志读入DataFrame
reader = pd.read_table(log_file, sep = ' ' , names = [i for i in range ( 10 )], iterator = True )
loop = True
chunkSize = 10000000
chunks = []
while loop:
try :
chunk = reader.get_chunk(chunkSize)
chunks.append(chunk)
except StopIteration:
#Iteration is stopped.
loop = False
df = pd.concat(chunks, ignore_index = True )
byte_sum = df[size]. sum () #流量统计
top_status_code = pd.DataFrame(df[ 6 ].value_counts()) #状态码统计
top_ip = df[ip].value_counts().head( 10 ) #TOP IP
top_referer = df[referer].value_counts().head( 10 ) #TOP Referer
top_ua = df[ua].value_counts().head( 10 ) #TOP User-Agent
top_status_code[ 'persent' ] = pd.DataFrame(top_status_code / top_status_code. sum () * 100 )
top_url = df[url].value_counts().head( 10 ) #TOP URL
top_url_byte = df[[url,size]].groupby(url). sum (). apply ( lambda x:x.astype( float ) / 1024 / 1024 ) \
. round (decimals = 3 ).sort_values(by = [size], ascending = False )[size].head( 10 ) #请求流量最大的URL
top_ip_byte = df[[ip,size]].groupby(ip). sum (). apply ( lambda x:x.astype( float ) / 1024 / 1024 ) \
. round (decimals = 3 ).sort_values(by = [size], ascending = False )[size].head( 10 ) #请求流量最多的IP
# 将结果有序存入字典
result = OrderedDict([( "流量总计[单位:GB]:" , byte_sum / 1024 / 1024 / 1024 ),
( "状态码统计[次数|百分比]:" , top_status_code),
( "IP TOP 10:" , top_ip),
( "Referer TOP 10:" , top_referer),
( "UA TOP 10:" , top_ua),
( "URL TOP 10:" , top_url),
( "请求流量最大的URL TOP 10[单位:MB]:" , top_url_byte),
( "请求流量最大的IP TOP 10[单位:MB]:" , top_ip_byte)
])
# 输出结果
for k,v in result.items():
print (k)
print (v)
print ( '=' * 80 )
|
pandas 学习笔记
Pandas 中有两种基本的数据结构,Series 和 Dataframe。 Series 是一种类似于一维数组的对象,由一组数据和索引组成。 Dataframe 是一个表格型的数据结构,既有行索引也有列索引。
1
2
|
from pandas import Series, DataFrame
import pandas as pd
|
Series
1
2
3
4
5
6
7
|
In [ 1 ]: obj = Series([ 4 , 7 , - 5 , 3 ])
In [ 2 ]: obj
Out[ 2 ]:
0 4
1 7
2 - 5
3 3
|
Series的字符串表现形式为:索引在左边,值在右边。没有指定索引时,会自动创建一个0到N-1(N为数据的长度)的整数型索引。可以通过Series的values和index属性获取其数组表示形式和索引对象:
1
2
3
4
|
In [ 3 ]: obj.values
Out[ 3 ]: array([ 4 , 7 , - 5 , 3 ])
In [ 4 ]: obj.index
Out[ 4 ]: RangeIndex(start = 0 , stop = 4 , step = 1 )
|
通常创建Series时会指定索引:
1
2
3
4
5
6
7
|
In [ 5 ]: obj2 = Series([ 4 , 7 , - 5 , 3 ], index = [ 'd' , 'b' , 'a' , 'c' ])
In [ 6 ]: obj2
Out[ 6 ]:
d 4
b 7
a - 5
c 3
|
通过索引获取Series中的单个或一组值:
1
2
3
4
5
6
|
In [ 7 ]: obj2[ 'a' ]
Out[ 7 ]: - 5
In [ 8 ]: obj2[[ 'c' , 'd' ]]
Out[ 8 ]:
c 3
d 4
|
排序
1
2
3
4
5
6
7
8
9
10
11
12
|
In [ 9 ]: obj2.sort_index()
Out[ 9 ]:
a - 5
b 7
c 3
d 4
In [ 10 ]: obj2.sort_values()
Out[ 10 ]:
a - 5
c 3
d 4
b 7
|
筛选运算
1
2
3
4
5
6
7
8
9
10
11
|
In [ 11 ]: obj2[obj2 > 0 ]
Out[ 11 ]:
d 4
b 7
c 3
In [ 12 ]: obj2 * 2
Out[ 12 ]:
d 8
b 14
a - 10
c 6
|
成员
1
2
3
4
|
In [ 13 ]: 'b' in obj2
Out[ 13 ]: True
In [ 14 ]: 'e' in obj2
Out[ 14 ]: False
|
通过字典创建Series
1
2
3
4
5
6
7
8
|
In [ 15 ]: sdata = { 'Shanghai' : 35000 , 'Beijing' : 40000 , 'Nanjing' : 26000 , 'Hangzhou' : 30000 }
In [ 16 ]: obj3 = Series(sdata)
In [ 17 ]: obj3
Out[ 17 ]:
Beijing 40000
Hangzhou 30000
Nanjing 26000
Shanghai 35000
|
如果只传入一个字典,则结果Series中的索引就是原字典的键(有序排列)
1
2
3
4
5
6
7
8
|
In [ 18 ]: states = [ 'Beijing' , 'Hangzhou' , 'Shanghai' , 'Suzhou' ]
In [ 19 ]: obj4 = Series(sdata, index = states)
In [ 20 ]: obj4
Out[ 20 ]:
Beijing 40000.0
Hangzhou 30000.0
Shanghai 35000.0
Suzhou NaN
|
当指定index时,sdata中跟states索引相匹配的3个值会被找出并放到响应的位置上,但由于‘Suzhou'所对应的sdata值找不到,所以其结果为NaN(not a number),pandas中用于表示缺失或NA值
pandas的isnull和notnull函数可以用于检测缺失数据:
1
2
3
4
5
6
7
8
9
10
11
12
|
In [ 21 ]: pd.isnull(obj4)
Out[ 21 ]:
Beijing False
Hangzhou False
Shanghai False
Suzhou True
In [ 22 ]: pd.notnull(obj4)
Out[ 22 ]:
Beijing True
Hangzhou True
Shanghai True
Suzhou False
|
Series也有类似的实例方法
1
2
3
4
5
6
|
In [ 23 ]: obj4.isnull()
Out[ 23 ]:
Beijing False
Hangzhou False
Shanghai False
Suzhou True
|
Series的一个重要功能是,在数据运算中,自动对齐不同索引的数据
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
In [ 24 ]: obj3
Out[ 24 ]:
Beijing 40000
Hangzhou 30000
Nanjing 26000
Shanghai 35000
In [ 25 ]: obj4
Out[ 25 ]:
Beijing 40000.0
Hangzhou 30000.0
Shanghai 35000.0
Suzhou NaN
In [ 26 ]: obj3 + obj4
Out[ 26 ]:
Beijing 80000.0
Hangzhou 60000.0
Nanjing NaN
Shanghai 70000.0
Suzhou NaN
|
Series的索引可以通过复制的方式就地修改
1
2
3
4
5
6
7
|
In [ 27 ]: obj.index = [ 'Bob' , 'Steve' , 'Jeff' , 'Ryan' ]
In [ 28 ]: obj
Out[ 28 ]:
Bob 4
Steve 7
Jeff - 5
Ryan 3
|
DataFrame
pandas读取文件
1
2
3
4
5
6
7
8
|
In [ 29 ]: df = pd.read_table( 'pandas_test.txt' ,sep = ' ' , names = [ 'name' , 'age' ])
In [ 30 ]: df
Out[ 30 ]:
name age
0 Bob 26
1 Loya 22
2 Denny 20
3 Mars 25
|
DataFrame列选取
1
|
df[name]
|
1
2
3
4
5
6
7
|
In [ 31 ]: df[ 'name' ]
Out[ 31 ]:
0 Bob
1 Loya
2 Denny
3 Mars
Name: name, dtype: object
|
DataFrame行选取
1
2
|
df.iloc[ 0 ,:] #第一个参数是第几行,第二个参数是列。这里指第0行全部列
df.iloc[:, 0 ] #全部行,第0列
|
1
2
3
4
5
6
7
8
9
10
11
12
|
In [ 32 ]: df.iloc[ 0 ,:]
Out[ 32 ]:
name Bob
age 26
Name: 0 , dtype: object
In [ 33 ]: df.iloc[:, 0 ]
Out[ 33 ]:
0 Bob
1 Loya
2 Denny
3 Mars
Name: name, dtype: object
|
获取一个元素,可以通过iloc,更快的方式是iat
1
2
3
4
|
In [ 34 ]: df.iloc[ 1 , 1 ]
Out[ 34 ]: 22
In [ 35 ]: df.iat[ 1 , 1 ]
Out[ 35 ]: 22
|
DataFrame块选取
1
2
3
4
5
|
In [ 36 ]: df.loc[ 1 : 2 ,[ 'name' , 'age' ]]
Out[ 36 ]:
name age
1 Loya 22
2 Denny 20
|
根据条件过滤行
在方括号中加入判断条件来过滤行,条件必需返回 True 或者 False
1
2
3
4
5
6
7
8
9
10
11
|
In [ 37 ]: df[(df.index > = 1 ) & (df.index < = 3 )]
Out[ 37 ]:
name age city
1 Loya 22 Shanghai
2 Denny 20 Hangzhou
3 Mars 25 Nanjing
In [ 38 ]: df[df[ 'age' ] > 22 ]
Out[ 38 ]:
name age city
0 Bob 26 Beijing
3 Mars 25 Nanjing
|
增加列
1
2
3
4
5
6
7
8
|
In [ 39 ]: df[ 'city' ] = [ 'Beijing' , 'Shanghai' , 'Hangzhou' , 'Nanjing' ]
In [ 40 ]: df
Out[ 40 ]:
name age city
0 Bob 26 Beijing
1 Loya 22 Shanghai
2 Denny 20 Hangzhou
3 Mars 25 Nanjing
|
排序
按指定列排序
1
2
3
4
5
6
7
|
In [ 41 ]: df.sort_values(by = 'age' )
Out[ 41 ]:
name age city
2 Denny 20 Hangzhou
1 Loya 22 Shanghai
3 Mars 25 Nanjing
0 Bob 26 Beijing
|
1
2
|
# 引入numpy 构建 DataFrame
import numpy as np
|
1
2
3
4
5
6
|
In [ 42 ]: df = pd.DataFrame(np.arange( 8 ).reshape(( 2 , 4 )), index = [ 'three' , 'one' ], columns = [ 'd' , 'a' , 'b' , 'c' ])
In [ 43 ]: df
Out[ 43 ]:
d a b c
three 0 1 2 3
one 4 5 6 7
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
# 以索引排序
In [ 44 ]: df.sort_index()
Out[ 44 ]:
d a b c
one 4 5 6 7
three 0 1 2 3
In [ 45 ]: df.sort_index(axis = 1 )
Out[ 45 ]:
a b c d
three 1 2 3 0
one 5 6 7 4
# 降序
In [ 46 ]: df.sort_index(axis = 1 , ascending = False )
Out[ 46 ]:
d c b a
three 0 3 2 1
one 4 7 6 5
|
查看
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
# 查看表头5行
df.head( 5 )
# 查看表末5行
df.tail( 5 )
# 查看列的名字
In [ 47 ]: df.columns
Out[ 47 ]: Index([ 'name' , 'age' , 'city' ], dtype = 'object' )
# 查看表格当前的值
In [ 48 ]: df.values
Out[ 48 ]:
array([[ 'Bob' , 26 , 'Beijing' ],
[ 'Loya' , 22 , 'Shanghai' ],
[ 'Denny' , 20 , 'Hangzhou' ],
[ 'Mars' , 25 , 'Nanjing' ]], dtype = object )
|
转置
1
2
3
4
5
6
|
df.T
Out[ 49 ]:
0 1 2 3
name Bob Loya Denny Mars
age 26 22 20 25
city Beijing Shanghai Hangzhou Nanjing
|
使用isin
1
2
3
4
5
6
|
In [ 50 ]: df2 = df.copy()
In [ 51 ]: df2[df2[ 'city' ].isin([ 'Shanghai' , 'Nanjing' ])]
Out[ 52 ]:
name age city
1 Loya 22 Shanghai
3 Mars 25 Nanjing
|
运算操作:
1
2
3
4
5
6
7
8
9
|
In [ 53 ]: df = pd.DataFrame([[ 1.4 , np.nan], [ 7.1 , - 4.5 ], [np.nan, np.nan], [ 0.75 , - 1.3 ]],
...: index = [ 'a' , 'b' , 'c' , 'd' ], columns = [ 'one' , 'two' ])
In [ 54 ]: df
Out[ 54 ]:
one two
a 1.40 NaN
b 7.10 - 4.5
c NaN NaN
d 0.75 - 1.3
|
1
2
3
4
5
6
7
8
9
10
11
12
|
#按列求和
In [ 55 ]: df. sum ()
Out[ 55 ]:
one 9.25
two - 5.80
# 按行求和
In [ 56 ]: df. sum (axis = 1 )
Out[ 56 ]:
a 1.40
b 2.60
c NaN
d - 0.55
|
group
group 指的如下几步:
- Splitting the data into groups based on some criteria
- Applying a function to each group independently
- Combining the results into a data structure
See the Grouping section
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
In [ 57 ]: df = pd.DataFrame({ 'A' : [ 'foo' , 'bar' , 'foo' , 'bar' ,
....: 'foo' , 'bar' , 'foo' , 'foo' ],
....: 'B' : [ 'one' , 'one' , 'two' , 'three' ,
....: 'two' , 'two' , 'one' , 'three' ],
....: 'C' : np.random.randn( 8 ),
....: 'D' : np.random.randn( 8 )})
....:
In [ 58 ]: df
Out[ 58 ]:
A B C D
0 foo one - 1.202872 - 0.055224
1 bar one - 1.814470 2.395985
2 foo two 1.018601 1.552825
3 bar three - 0.595447 0.166599
4 foo two 1.395433 0.047609
5 bar two - 0.392670 - 0.136473
6 foo one 0.007207 - 0.561757
7 foo three 1.928123 - 1.623033
|
group一下,然后应用sum函数
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
In [ 59 ]: df.groupby( 'A' ). sum ()
Out[ 59 ]:
C D
A
bar - 2.802588 2.42611
foo 3.146492 - 0.63958
In [ 60 ]: df.groupby([ 'A' , 'B' ]). sum ()
Out[ 60 ]:
C D
A B
bar one - 1.814470 2.395985
three - 0.595447 0.166599
two - 0.392670 - 0.136473
foo one - 1.195665 - 0.616981
three 1.928123 - 1.623033
two 2.414034 1.600434
|
总结
以上就是关于利用Python中的pandas库进行cdn日志分析的全部内容了,希望本文的内容对大家的学习或者工作能带来一定的帮助,如果有疑问大家可以留言交流,谢谢大家对服务器之家的支持。
原文链接:http://qingkang.me/python-pandas.html