语法:
正则表达式是处理字符串的函数,我们在Excel函数中也有很多这样的公式,因为学过一些Excel,所以看一下有什么不同的方法。
import re #导入re模块,处理正则表达式的模块
p = re.compile("^[0-9]") #生成要匹配的正则对象,^代表从头开始匹配,[0-9]代表匹配0至9的任意一个数字,所以这里的意思是对传进来的字符串进行匹配,如果这个字符串的开头第一个字符是数字,就代表匹配上了
m = p.match("12344abc") #按上面生成的正则对象去匹配字符串,如果匹配成功,这个m就会有值,否则m为None
pirnt(m.group()) #m.group()返回匹配上的结果,此处为1,因为匹配上的是1这个字符。
上面的第2 和第3行也可以合并成一行来写:
m = p.match(
"^[0-9]"
,
'14534Abc'
)
效果是一样的,区别在于,第一种方式是提前对要匹配的格式进行了编译(对匹配公式进行解析),这样再去匹配的时候就不用在编译匹配的格式,第2种简写是每次匹配的时候都要进行一次匹配公式的编译,所以,如果你需要从一个5w行的文件中匹配出所有以数字开头的行,建议先把正则公式进行编译再匹配,这样速度会快点。
字符:
. 匹配除换行符以外的任意字符
\w 匹配字母或数字或下划线或汉字
\s 匹配任意的空白符
\d 匹配数字
\b 匹配单词的开始或结束
^ 匹配字符串的开始
$ 匹配字符串的结束
次数:
* 重复零次或更多次
+ 重复一次或更多次
? 重复零次或一次
{n} 重复n次
{n,} 重复n次或更多次
{n,m} 重复n到m次
re模块里面的方法:
1.match(pattern,string,flags=0)
def match(pattern, string, flags=0):
# match,从起始位置开始匹配,匹配成功返回一个对象,未匹配成功返回None
"""Try to apply the pattern at the start of the string, returning
a match object, or None if no match was found."""
return _compile(pattern, flags).match(string)
match(pattern,string,flags=0)
# pattern:正则模型
# string:要匹配的字符串
# falgs:匹配模式
2.fullmatch(pattern,string,flags=0)
def fullmatch(pattern, string, flags=0):
"""Try to apply the pattern to all of the string, returning
a match object, or None if no match was found."""
return _compile(pattern, flags).fullmatch(string)
3.search(pattern,string,flags=0)
def search(pattern, string, flags=0):
# search,浏览整个字符串去匹配第一个,未匹配成功返回None
"""Scan through string looking for a match to the pattern, returning
a match object, or None if no match was found."""
return _compile(pattern, flags).search(string)
4.sub(pattern,repl,string,count=0,flags=0)
def sub(pattern, repl, string, count=0, flags=0):
# sub,替换匹配成功的指定位置字符串
"""Return the string obtained by replacing the leftmost
non-overlapping occurrences of the pattern in string by the
replacement repl. repl can be either a string or a callable;
if a string, backslash escapes in it are processed. If it is
a callable, it's passed the match object and must return
a replacement string to be used."""
return _compile(pattern, flags).sub(repl, string, count)
sub是替换字符串中的元素,可以指定替换几个。
sub(pattern,repl,string,count=0,flags=0)
# pattern:正则模型
# repl:要替换的字符串或可执行对象
# string:要匹配的字符串
# count:指定匹配个数
# flags:匹配模式
下面实例中是将字符串中前两个数字替换为“|”,实例如下:
>>> m = re.sub("[0-9]","|","alex1is2sb6dese8",count=2)
>>> m
'alex|is|sb6dese8'
5.subn(pattern,repl,string,count=0,flags=0)
def subn(pattern, repl, string, count=0, flags=0):
"""Return a 2-tuple containing (new_string, number).
new_string is the string obtained by replacing the leftmost
non-overlapping occurrences of the pattern in the source
string by the replacement repl. number is the number of
substitutions that were made. repl can be either a string or a
callable; if a string, backslash escapes in it are processed.
If it is a callable, it's passed the match object and must
return a replacement string to be used."""
return _compile(pattern, flags).subn(repl, string, count)
6.split(pattern,string,maxsplit=0,flags=0)
def split(pattern, string, maxsplit=0, flags=0):
# split,根据正则匹配分割字符串
"""Split the source string by the occurrences of the pattern,
returning a list containing the resulting substrings. If
capturing parentheses are used in pattern, then the text of all
groups in the pattern are also returned as part of the resulting
list. If maxsplit is nonzero, at most maxsplit splits occur,
and the remainder of the string is returned as the final element
of the list."""
return _compile(pattern, flags).split(string, maxsplit)
split(pattern,string,maxsplit=0,flags=0)
# pattern:正则模型
# string:要匹配的字符串
# maxsplit:指定分割个数
# flags:匹配模式
实例如下,下面的例子是以数字为分隔,把字符串进行分隔,分隔成一个列表,如下:
>>> import re
>>> m = re.split("[0-9]","alex1rain2jack3helen rachel8")
>>> m
['alex', 'rain', 'jack', 'helen rachel', '']
>>> m = re.split("[0-9]","alex1is2sb4heeh")
>>> m
['alex', 'is', 'sb', 'heeh']
7.findall(pattern,string,flags=0)
def findall(pattern, string, flags=0):
# findall,获取非重复的匹配列表;如果有一个组则以列表形式返回,且每一个匹配均是字符串;如果模型中有多个组,则以列表形式返回,且每一个匹配 是元祖;
# 空的匹配也会包含在结果中
"""Return a list of all non-overlapping matches in the string.
If one or more capturing groups are present in the pattern, return
a list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result."""
return _compile(pattern, flags).findall(string)
findall(pattern,string,flags)是获取字符串中指定正则模型的格式,并返回一个列表,下面的例子是获取字符串中所有的数字,并返回一个列表:
>>> m = re.findall("[0-9]","alex11rain2jack3helan rache8")
>>> m
['1', '1', '2', '3', '8'] (1)
>>> m = re.findall("[0-9]+","alex11rain2jack3helan rache8")
>>> m
['11', '2', '3', '8'] (2)
上述代码中,(1)处如果两个数字在一起,只获取一个;在(2)处“+”是获取一个或多个。
8.finditer(pattern,string,flags=0)
def finditer(pattern, string, flags=0):
"""Return an iterator over all non-overlapping matches in the
string. For each match, the iterator returns a match object.
Empty matches are included in the result."""
return _compile(pattern, flags).finditer(string)
9.compile(pattern,flags=0)
def compile(pattern, flags=0):
"Compile a regular expression pattern, returning a pattern object."
return _compile(pattern, flags)
10.purge()
def purge():
"Clear the regular expression caches"
_cache.clear()
_cache_repl.clear()
11.temlate(pattern,flags=0)
def template(pattern, flags=0):
"Compile a template pattern, returning a pattern object"
return _compile(pattern, flags|T)
12.escape(pattern)
def escape(pattern):
"""
Escape all the characters in pattern except ASCII letters, numbers and '_'.
"""
字符类
实例
正则式(pattern) | 描述(describe) |
[Pp]ython | 匹配"Python"或"python" |
rub[ye] | 匹配"ruby"或"rube" |
[aeiou] | 匹配括号内任意一个字母 |
[0-9] | 匹配任何数字。类似于[0123456789] |
[a-z] | 匹配任何小写字母 |
[A-Z] | 匹配任何大写字母 |
[a-zA-Z0-9] | 匹配任何字母及数字 |
[^aeiou] | 匹配除了aeiou字母以外的所有字符 |
[^0-9] | 匹配除了数字外的字符 |
特殊字符类
正则式(pattern) | 描述(describe) |
. | 匹配除了"\n"之外的任何单个字符。要匹配包括"\n"在内的任何字符,请使用像"[.\n]"的模式 |
\d |
匹配一个数字字符。等价于[0-9] |
\D | 匹配一个非数字字符。等价于[^0-9] |
\s | 匹配任何空白字符,包括空格,制表符,换页符。等价于[\f\n\r\t\v] |
\S | 匹配任何非空白字符。等价于[^\f\n\r\t\v] |
\w | 匹配包括下划线的任何单词字符。等价于[A-Za-z0-9_] |
\W | 匹配任何非单词字符。等价于[^A-Za-z0-9_] |
re.match与re.search的区别:
re.match只匹配字符串的开始,如果字符串开始不符合正则表达式,则匹配失败,函数返回None;而re.search匹配整个字符串,知道找到一个匹配。
几个常见的正则例子:
1.匹配手机号
phone_str = "hey my name is Jersey,and my phone number is 13651054607, please call me if you are pretty!"
phone_str2
=
"hey my name is alex, and my phone number is 18651054604, please call me if you are pretty!"
m = re.search("(1)[358]\d{9})",phone_str2)
if m :
print(m.group())
2.匹配IPv4
ip_addr = "inet 192.168.60.223 netmask oxfffff00 broadcast 192.168.60.255"
m = re.search("\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}",ip_addr)
print(m.group())
3.分组匹配地址
contactInfo = "Oldboy School,Beijing Changping Shahe:010-8343245"
match = re.search(r'(\w+),(\w+):(\S+)',contactInfo) #分组 (方法一)
match.group(1)
match.group(2)
match.group(3)
match
=
re.search(r
'(?P<last>\w+), (?P<first>\w+): (?P<phone>\S+)'
, contactInfo)
(方法二)
>>> match.group('last')
'Doe'
>>> match.group('first')
'John'
>>> match.group('phone')
'555-1212'
4. 匹配email
email
=
"alex.li@126.com http://www.oldboyedu.com"
m
=
re.search(r
"[0-9.a-z]{0,26}@[0-9.a-z]{0,20}.[0-9a-z]{0,8}"
, email)
print
(m.group())
下面我们来看一个实例,定义一个正则表达式:
import re
#导入re模块,处理字符串的正则表达式格式
m = re.match("abc","abcdef")
print(m)
m = re.match("abc","abcdef")
print(m.group())
m = re.match("abc","bcdef")
print(m)
运行结果如下:
<_sre.SRE_Match object; span=(0, 3), match='abc'>
abc
None
上面代码中,我们定义了一个正则表达式格式m,然后使用match()进行查找匹配,match()是从头开始查找,找到了就返回一个正则对象,如果查找不到则返回None.可以使用group()来查看查找到了什么内容。match()函数是从头开始。
m = re.match("[0-9]{0,10}","15d6afdgd")
if m:
print(m.group())
pattern中[0-9]{0,10}是匹配数字0-9,{0,10}是匹配0到10次。
匹配字符串中所有的数字,findall(pattern,string,flags):
m = re.findall("[0-9]{1,10}","15d6afd2334d2dgd3")
print(m)
运行结果如下:
['15', '6', '2334', '2', '3']
上面代码我们匹配了字符串中的数字,返回一个列表;下面我们来匹配字符串中所有的字母:
m = re.findall("[A-Za-z]{1,10}","15d6afd2334d2dgd3")
print(m)
运行如下:
['d', 'afd', 'd', 'dgd']
上面代码,我们匹配了字符串中所有的字符[A-Za-z]。
点(.)是匹配除了"\n"之外的任何单个字符。要匹配包括"\n"在内的任何字符,请使用像[.\n].下面来看一个实例:
m = re.findall(".","15d6afd2334d2dgd3")
print(m)
['1', '5', 'd', '6', 'a', 'f', 'd', '2', '3', '3', '4', 'd', '2', 'd', 'g', 'd', '3']
上面代码中我们使用点(.)匹配除了"\n"之外任何单个字符,我们得到了单个字符的列表,因为点(.)是匹配任何单个字符。
下面我们来使用点星(.*)来进行匹配,我们知道*是匹配任何零个或多个字符,代码如下:
m = re.findall(".*","15d6afd2334d2dgd3")
print(m)
运行结果如下:
['15d6afd2334d2dgd3', '']
由于点(.)是匹配的任何字符串,*星号是匹配的次数,点星(.*)的含义是匹配任何零个或多个字符串(除了"\n"之外)。
下面我们使用点加(.+)我们知道点(.)是匹配的内容,加号(+)是匹配的次数,匹配单次或多次,点加(.+)是匹配一次或多次字符串。代码如下:
m = re.findall(".+","15d6afd2334d2dgd3")
print(m)
运行如下:
['15d6afd2334d2dgd3']
下面我们来使用点问(.?)来看看,由于?是代表零次或者1次,看代码及运行结果:
m = re.findall(".?","15d6afd2334d2dgd3")
print(m)
运行结果如下:
['1', '5', 'd', '6', 'a', 'f', 'd', '2', '3', '3', '4', 'd', '2', 'd', 'g', 'd', '3', '']
问号(?)可以用在至少出现一次的情况,因为最多也就出现一次,加号(+)代表至少一次以上,星号(*)代表至少0次以上。
^以什么开始,$以什么结尾。