【Python学习笔记】Coursera课程《Using Python to Access Web Data 》密歇根大学 Charles Severance——Week2 Regular Expressions课堂笔记

Coursera课程《Using Python to Access Web Data》密歇根大学 Charles Severance

**Week2 Regular Expressions **

11.1 Regular Expressions

11.1.1 Python Regular Expression Quick Guide

^	匹配一行的开头
$	匹配一行的末尾
.	匹配任何字符
\s	匹配空白字符
\S	匹配任何非空白字符
*****	重复一个字符0次或多次
*?	重复一个字符0次或多次(non-greedy)
+	重复一个字符一次或多次
+?	重复一个字符一次或多次(non-greedy)
[aeiou]	匹配被列出来的一个单字符
[^XYZ]	匹配没有被列出来的一个单字符
[a-z0-9]	设置可以包含的字符
()	表示提取字符串的开头处
)	表示提取字符串的结尾处

【注】non-greedy模式表示尽可能少的匹配字符

11.1.2 The Regular Expression Module

在程序里使用正则表达式之前，必须使用'import re'引入一个模块。

然后可以使用re.search()来查看，是否一个字符串匹配正则表达式，和find()有点相似。

也可以使用re.findall()来提取一个字符串的部分来匹配正则表达式，这和find()与切片var[5:10]很相似。

11.1.3 Using re.search() Like find()

使用find()的代码

hand = open('mobox-short.txt')

for line in hand:

    line = line.restrip()

    if line.find('From:') >= 0:

        print(line)

使用re.search()的代码

import re

hand = open('mbox-short.txt')

for line in hand:

    line = line.rstrip()

    if re.search('From:', line):

        print(line)

11.1.4 Using re.search() Like startswith()

使用startswith()的代码

hand = open('mbox-short.txt')

for line in hand:

    line = line.rstrip()

    if line.startswith('From:'):

        print(line)

使用re.search()的代码

import re

hand = open('mbox-short.txt')

for line in hand:

    line = line.rstrip()

    if re.search('From:', line):

        print(line)

11.1.5 Wild-Card Characters

点号可以匹配任何字符。但如果加上了星号，那么这个字符可以出现任何次。

所以正则表达式^X.*：表示，查找以X开头的字符串，X后面可以接任何字符，而且任意长度。

那么例如我们可能会返回这样的

X-Sieve: CMU Sieve 2.3

X-DSPAM-Result: Innocent

X-Plane is behind schedule: two weeks

11.1.6 Fine-Tuning Your Match

为了更精准地匹配到我们想要的东西。我们可以稍作改进。

比如改成^X-\S+:表示，查找以X开头的字符串，X后面可以接任何不含空格的字符，而且字符数大于等于1个。

那么我们会上面的两行数据，而不会返回第三行。

11.2 Extracting Data

使用[0-9]+，表示查找一个或多个数字。

>>> import re

>>> x = 'My 2 favorite numbers are 19 and 42'

>>> y = re.findall('[0-9]+', x)

>>> print(y)

['2', '19', '42']

11.2.1 Warning: Greedy Matching

之前说的Greedy模式，其实就是匹配符合条件的最长的字符。

比如说

>>> import re

>>> x = 'From: Using the : character'

>>> y = re.findall('^F.+:', x)

>>> print(y)

['From: Using the :']

因为是Greedy模式，所以不是匹配的'From:'。

11.2.2 Non-Greedy Matching

而如果在+或后加上一个？，则可以切换到Non-Greedy*模式。

>>> import re

>>> x = 'From: Using the : character'

>>> y = re.findall('^F.+?:', x)

>>> print(y)

['From:']

11.2.3 Fine-Tuning String Extraction

如果我们要定位下面这段中的邮件地址。

From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008

那么我们可以这样

>>> y = re.findall('\S+@\S+', x)

>>> print(y)

['stephen.marquard@uct.ac.za']

使用括号，我们可以规定我们想要提取的文本的起始。比如这样

>>> y = re.findall('From (\S+@\S+)', x)

>>> print(y)

['stephen.marquard@uct.ac.za']

11.2.4 Spam Confidence

一个例子。

import re

hand = open('mbox-short.txt')

numlist = list()

for line in hand:

    line = line.rstrip()

    stuff = re.findall('X-DSPAM-Confidence: ([0-9.]+)', line)

    if len(stuff) != 1: continue

    num = float(stuff[0])

    numlist.append(num)

print('Maximum:', max(numlist))

Assignment

import re

hand = open('actual.txt')

numlist = list()

counts = dict()

for line in hand:

    line = line.rstrip()

    stuff = re.findall('[0-9]+', line)

    if len(stuff) == 0: continue

    for i in range(len(stuff)):

        num = int(stuff[i])

        numlist.append(num)

print(len(numlist))

print(sum(numlist))

秒客网

【Python学习笔记】Coursera课程《Using Python to Access Web Data 》密歇根大学 Charles Severance——Week2 Regular Expressions课堂笔记

11.1 Regular Expressions

11.1.1 Python Regular Expression Quick Guide

11.1.2 The Regular Expression Module

11.1.3 Using re.search() Like find()

11.1.4 Using re.search() Like startswith()

11.1.5 Wild-Card Characters

11.1.6 Fine-Tuning Your Match

11.2 Extracting Data

11.2.1 Warning: Greedy Matching

11.2.2 Non-Greedy Matching

11.2.3 Fine-Tuning String Extraction

11.2.4 Spam Confidence

Assignment

相关文章

【Python学习笔记】Coursera课程《Using Python to Access Web Data 》 密歇根大学 Charles Severance——Week2 Regular Expressions课堂笔记

11.1 Regular Expressions

11.1.1 Python Regular Expression Quick Guide

11.1.2 The Regular Expression Module

11.1.3 Using re.search() Like find()

11.1.4 Using re.search() Like startswith()

11.1.5 Wild-Card Characters

11.1.6 Fine-Tuning Your Match

11.2 Extracting Data

11.2.1 Warning: Greedy Matching

11.2.2 Non-Greedy Matching

11.2.3 Fine-Tuning String Extraction

11.2.4 Spam Confidence

Assignment

相关文章

【Python学习笔记】Coursera课程《Using Python to Access Web Data 》密歇根大学 Charles Severance——Week2 Regular Expressions课堂笔记