利用Python中的pandas库对cdn日志进行分析详解

前言

最近工作工作中遇到一个需求,是要根据CDN日志过滤一些数据,例如流量、状态码统计,TOP IP、URL、UA、Referer等。以前都是用 bash shell 实现的,但是当日志量较大,日志文件数G、行数达数千万亿级时,通过 shell 处理有些力不从心,处理时间过长。于是研究了下Python pandas这个数据处理库的使用。一千万行日志,处理完成在40s左右。

代码

?

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

#!/usr/bin/python

# -*- coding: utf-8 -*-

# sudo pip install pandas

__author__ = 'Loya Chen'

import sys

import pandas as pd

from collections import OrderedDict

"""

Description: This script is used to analyse qiniu cdn log.

================================================================================

日志格式

IP - ResponseTime [time +0800] "Method URL HTTP/1.1" code size "referer" "UA"

================================================================================

日志示例

[0] [1][2] [3] [4] [5]

101.226.66.179 - 68 [16/Nov/2016:04:36:40 +0800] "GET http://www.qn.com/1.jpg -"

[6] [7] [8] [9]

200 502 "-" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"

================================================================================

"""

if len(sys.argv) != 2:

print('Usage:', sys.argv[0], 'file_of_log')

exit()

else:

log_file = sys.argv[1]

# 需统计字段对应的日志位置

ip = 0

url = 5

status_code = 6

size = 7

referer = 8

ua = 9

# 将日志读入DataFrame

reader = pd.read_table(log_file, sep=' ', names=[i for i in range(10)], iterator=True)

loop = True

chunkSize = 10000000

chunks = []

while loop:

try:

chunk = reader.get_chunk(chunkSize)

chunks.append(chunk)

except StopIteration:

#Iteration is stopped.

loop = False

df = pd.concat(chunks, ignore_index=True)

byte_sum = df[size].sum() #流量统计

top_status_code = pd.DataFrame(df[6].value_counts()) #状态码统计

top_ip = df[ip].value_counts().head(10) #TOP IP

top_referer = df[referer].value_counts().head(10) #TOP Referer

top_ua = df[ua].value_counts().head(10) #TOP User-Agent

top_status_code['persent'] = pd.DataFrame(top_status_code/top_status_code.sum()*100)

top_url = df[url].value_counts().head(10) #TOP URL

top_url_byte = df[[url,size]].groupby(url).sum().apply(lambda x:x.astype(float)/1024/1024) \

.round(decimals = 3).sort_values(by=[size], ascending=False)[size].head(10) #请求流量最大的URL

top_ip_byte = df[[ip,size]].groupby(ip).sum().apply(lambda x:x.astype(float)/1024/1024) \

.round(decimals = 3).sort_values(by=[size], ascending=False)[size].head(10) #请求流量最多的IP

# 将结果有序存入字典

result = OrderedDict([("流量总计[单位:GB]:" , byte_sum/1024/1024/1024),

("状态码统计[次数|百分比]:" , top_status_code),

("IP TOP 10:" , top_ip),

("Referer TOP 10:" , top_referer),

("UA TOP 10:" , top_ua),

("URL TOP 10:" , top_url),

("请求流量最大的URL TOP 10[单位:MB]:" , top_url_byte),

("请求流量最大的IP TOP 10[单位:MB]:" , top_ip_byte)

])

# 输出结果

for k,v in result.items():

print(k)

print(v)

print('='*80)

pandas 学习笔记

Pandas 中有两种基本的数据结构,Series 和 Dataframe。 Series 是一种类似于一维数组的对象,由一组数据和索引组成。 Dataframe 是一个表格型的数据结构,既有行索引也有列索引。

?

1

2

from pandas import Series, DataFrame

import pandas as pd

Series

?

1

2

3

4

5

6

7

In [1]: obj = Series([4, 7, -5, 3])

In [2]: obj

Out[2]:

0 4

1 7

2 -5

3 3

Series的字符串表现形式为:索引在左边,值在右边。没有指定索引时,会自动创建一个0到N-1(N为数据的长度)的整数型索引。可以通过Series的values和index属性获取其数组表示形式和索引对象:

?

1

2

3

4

In [3]: obj.values

Out[3]: array([ 4, 7, -5, 3])

In [4]: obj.index

Out[4]: RangeIndex(start=0, stop=4, step=1)

通常创建Series时会指定索引:

?

1

2

3

4

5

6

7

In [5]: obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])

In [6]: obj2

Out[6]:

d 4

b 7

a -5

c 3

通过索引获取Series中的单个或一组值:

?

1

2

3

4

5

6

In [7]: obj2['a']

Out[7]: -5

In [8]: obj2[['c','d']]

Out[8]:

c 3

d 4

排序

?

1

2

3

4

5

6

7

8

9

10

11

12

In [9]: obj2.sort_index()

Out[9]:

a -5

b 7

c 3

d 4

In [10]: obj2.sort_values()

Out[10]:

a -5

c 3

d 4

b 7

筛选运算

?

1

2

3

4

5

6

7

8

9

10

11

In [11]: obj2[obj2 > 0]

Out[11]:

d 4

b 7

c 3

In [12]: obj2 * 2

Out[12]:

d 8

b 14

a -10

c 6

成员

?

1

2

3

4

In [13]: 'b' in obj2

Out[13]: True

In [14]: 'e' in obj2

Out[14]: False

通过字典创建Series

?

1

2

3

4

5

6

7

8

In [15]: sdata = {'Shanghai':35000, 'Beijing':40000, 'Nanjing':26000, 'Hangzhou':30000}

In [16]: obj3 = Series(sdata)

In [17]: obj3

Out[17]:

Beijing 40000

Hangzhou 30000

Nanjing 26000

Shanghai 35000

如果只传入一个字典,则结果Series中的索引就是原字典的键(有序排列)

?

1

2

3

4

5

6

7

8

In [18]: states = ['Beijing', 'Hangzhou', 'Shanghai', 'Suzhou']

In [19]: obj4 = Series(sdata, index=states)

In [20]: obj4

Out[20]:

Beijing 40000.0

Hangzhou 30000.0

Shanghai 35000.0

Suzhou NaN

当指定index时,sdata中跟states索引相匹配的3个值会被找出并放到响应的位置上,但由于‘Suzhou'所对应的sdata值找不到,所以其结果为NaN(not a number),pandas中用于表示缺失或NA值

pandas的isnull和notnull函数可以用于检测缺失数据:

?

1

2

3

4

5

6

7

8

9

10

11

12

In [21]: pd.isnull(obj4)

Out[21]:

Beijing False

Hangzhou False

Shanghai False

Suzhou True

In [22]: pd.notnull(obj4)

Out[22]:

Beijing True

Hangzhou True

Shanghai True

Suzhou False

Series也有类似的实例方法

?

1

2

3

4

5

6

In [23]: obj4.isnull()

Out[23]:

Beijing False

Hangzhou False

Shanghai False

Suzhou True

Series的一个重要功能是,在数据运算中,自动对齐不同索引的数据

?

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

In [24]: obj3

Out[24]:

Beijing 40000

Hangzhou 30000

Nanjing 26000

Shanghai 35000

In [25]: obj4

Out[25]:

Beijing 40000.0

Hangzhou 30000.0

Shanghai 35000.0

Suzhou NaN

In [26]: obj3 + obj4

Out[26]:

Beijing 80000.0

Hangzhou 60000.0

Nanjing NaN

Shanghai 70000.0

Suzhou NaN

Series的索引可以通过复制的方式就地修改

?

1

2

3

4

5

6

7

In [27]: obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']

In [28]: obj

Out[28]:

Bob 4

Steve 7

Jeff -5

Ryan 3

DataFrame

pandas读取文件

?

1

2

3

4

5

6

7

8

In [29]: df = pd.read_table('pandas_test.txt',sep=' ', names=['name', 'age'])

In [30]: df

Out[30]:

name age

0 Bob 26

1 Loya 22

2 Denny 20

3 Mars 25

DataFrame列选取

?

1

df[name]

?

1

2

3

4

5

6

7

In [31]: df['name']

Out[31]:

0 Bob

1 Loya

2 Denny

3 Mars

Name: name, dtype: object

DataFrame行选取

?

1

2

df.iloc[0,:] #第一个参数是第几行,第二个参数是列。这里指第0行全部列

df.iloc[:,0] #全部行,第0列

?

1

2

3

4

5

6

7

8

9

10

11

12

In [32]: df.iloc[0,:]

Out[32]:

name Bob

age 26

Name: 0, dtype: object

In [33]: df.iloc[:,0]

Out[33]:

0 Bob

1 Loya

2 Denny

3 Mars

Name: name, dtype: object

获取一个元素,可以通过iloc,更快的方式是iat

?

1

2

3

4

In [34]: df.iloc[1,1]

Out[34]: 22

In [35]: df.iat[1,1]

Out[35]: 22

DataFrame块选取

?

1

2

3

4

5

In [36]: df.loc[1:2,['name','age']]

Out[36]:

name age

1 Loya 22

2 Denny 20

根据条件过滤行

在方括号中加入判断条件来过滤行,条件必需返回 True 或者 False

?

1

2

3

4

5

6

7

8

9

10

11

In [37]: df[(df.index >= 1) & (df.index <= 3)]

Out[37]:

name age city

1 Loya 22 Shanghai

2 Denny 20 Hangzhou

3 Mars 25 Nanjing

In [38]: df[df['age'] > 22]

Out[38]:

name age city

0 Bob 26 Beijing

3 Mars 25 Nanjing

增加列

?

1

2

3

4

5

6

7

8

In [39]: df['city'] = ['Beijing', 'Shanghai', 'Hangzhou', 'Nanjing']

In [40]: df

Out[40]:

name age city

0 Bob 26 Beijing

1 Loya 22 Shanghai

2 Denny 20 Hangzhou

3 Mars 25 Nanjing

排序

按指定列排序

?

1

2

3

4

5

6

7

In [41]: df.sort_values(by='age')

Out[41]:

name age city

2 Denny 20 Hangzhou

1 Loya 22 Shanghai

3 Mars 25 Nanjing

0 Bob 26 Beijing

?

1

2

# 引入numpy 构建 DataFrame

import numpy as np

?

1

2

3

4

5

6

In [42]: df = pd.DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'], columns=['d', 'a', 'b', 'c'])

In [43]: df

Out[43]:

d a b c

three 0 1 2 3

one 4 5 6 7

?

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

# 以索引排序

In [44]: df.sort_index()

Out[44]:

d a b c

one 4 5 6 7

three 0 1 2 3

In [45]: df.sort_index(axis=1)

Out[45]:

a b c d

three 1 2 3 0

one 5 6 7 4

# 降序

In [46]: df.sort_index(axis=1, ascending=False)

Out[46]:

d c b a

three 0 3 2 1

one 4 7 6 5

查看

?

1

2

3

4

5

6

7

8

9

10

11

12

13

14

# 查看表头5行

df.head(5)

# 查看表末5行

df.tail(5)

# 查看列的名字

In [47]: df.columns

Out[47]: Index(['name', 'age', 'city'], dtype='object')

# 查看表格当前的值

In [48]: df.values

Out[48]:

array([['Bob', 26, 'Beijing'],

['Loya', 22, 'Shanghai'],

['Denny', 20, 'Hangzhou'],

['Mars', 25, 'Nanjing']], dtype=object)

转置

?

1

2

3

4

5

6

df.T

Out[49]:

0 1 2 3

name Bob Loya Denny Mars

age 26 22 20 25

city Beijing Shanghai Hangzhou Nanjing

使用isin

?

1

2

3

4

5

6

In [50]: df2 = df.copy()

In [51]: df2[df2['city'].isin(['Shanghai','Nanjing'])]

Out[52]:

name age city

1 Loya 22 Shanghai

3 Mars 25 Nanjing

运算操作:

?

1

2

3

4

5

6

7

8

9

In [53]: df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]],

...: index=['a', 'b', 'c', 'd'], columns=['one', 'two'])

In [54]: df

Out[54]:

one two

a 1.40 NaN

b 7.10 -4.5

c NaN NaN

d 0.75 -1.3

?

1

2

3

4

5

6

7

8

9

10

11

12

#按列求和

In [55]: df.sum()

Out[55]:

one 9.25

two -5.80

# 按行求和

In [56]: df.sum(axis=1)

Out[56]:

a 1.40

b 2.60

c NaN

d -0.55

group

group 指的如下几步:

  • Splitting the data into groups based on some criteria
  • Applying a function to each group independently
  • Combining the results into a data structure

See the Grouping section

?

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

In [57]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',

....: 'foo', 'bar', 'foo', 'foo'],

....: 'B' : ['one', 'one', 'two', 'three',

....: 'two', 'two', 'one', 'three'],

....: 'C' : np.random.randn(8),

....: 'D' : np.random.randn(8)})

....:

In [58]: df

Out[58]:

A B C D

0 foo one -1.202872 -0.055224

1 bar one -1.814470 2.395985

2 foo two 1.018601 1.552825

3 bar three -0.595447 0.166599

4 foo two 1.395433 0.047609

5 bar two -0.392670 -0.136473

6 foo one 0.007207 -0.561757

7 foo three 1.928123 -1.623033

group一下,然后应用sum函数

?

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

In [59]: df.groupby('A').sum()

Out[59]:

C D

A

bar -2.802588 2.42611

foo 3.146492 -0.63958

In [60]: df.groupby(['A','B']).sum()

Out[60]:

C D

A B

bar one -1.814470 2.395985

three -0.595447 0.166599

two -0.392670 -0.136473

foo one -1.195665 -0.616981

three 1.928123 -1.623033

two 2.414034 1.600434

总结

以上就是关于利用Python中的pandas库进行cdn日志分析的全部内容了,希望本文的内容对大家的学习或者工作能带来一定的帮助,如果有疑问大家可以留言交流,谢谢大家对服务器之家的支持。

原文链接:http://qingkang.me/python-pandas.html

本文链接:https://my.lmcjl.com/post/14862.html

展开阅读全文

4 评论

留下您的评论.