python基础之正则表达式爬虫应用，configparser模块和subprocess模块

正则表达式爬虫应用（校花网）

 1 import requests

 2 import re

 3 import json

 4 #定义函数返回网页的字符串信息

 5 def getPage_str(url):

 6     page_string=requests.get(url)

 7     return page_string.text

 8

 9 hua_dic={}

10 def run_re(url):　　#爬取名字、学校和喜爱的人数

11     hua_str=getPage_str(url)

12     hua_list=re.finditer('<span class="price">(?P<name>.*?)</span>.*?class="img_album_btn">(?P<school>.*?)</a>.*?<em class.*?>(?P<like>\d+?)</em>',hua_str,re.S)

13     for n in hua_list:　　　　#将名字、学校和喜爱的人数写入字典

14         hua_dic[n.group('name')]=[n.group('school'),n.group('like')]

15

16 def url():　　#获取url地址

17     for i in range(0,43):

18         urls="http://www.xiaohuar.com/list-1-%s.html" %i

19         yield urls

20 #执行爬取内容

21 for i in url():

22     run_re(i)

23

24 print(hua_dic)

25

26 # with open('aaa','w',encoding='utf-8') as f:

27 #     f.write(str(hua_dic))

28 data=json.dumps(hua_dic)　　#将爬取的字典进行序列化操作

29 print(data)

30 f=open('hua.json','a')

31 f.write(data)

32 #反序列化

33 # f1=open('hua.json','r')

34 # new_data=json.load(f1)

35 # print(new_data)

configparser模块

该模块适用于linux下conf配置文件的格式与windows ini文件类似，可以包含一个或多个节（section），每个节可以有多个参数（键=值）。

如：

[DEFAULT]

ServerAliveInterval = 45

Compression = yes

CompressionLevel = 9

ForwardX11 = yes

[bitbucket.org]

User = hg

[topsecret.server.com]

Port = 50022

ForwardX11 = no

生成文件示例：

 1 import configparser

 2

 3 config = configparser.ConfigParser()　　#定义一个对象

 4

 5 config["DEFAULT"] = {'ServerAliveInterval': '45',　　#定义DEFAULT节的键值对信息，DEFAULT节是一个特殊的节，在其他的节里都包含DEFAULT节的内容

 6                       'Compression': 'yes',

 7                      'CompressionLevel': '9',

 8                      'ForwardX11':'yes'

 9                      }

10

11 config['bitbucket.org'] = {'User':'hg'}　　#普通的节

12

13 config['topsecret.server.com'] = {'Host Port':'5022','ForwardX11':'no'}　　#普通的节

14

15 with open('example.ini', 'w') as configfile:　　#写入文件

16     config.write(configfile)

查找文件内容：

 1 import configparser

 2

 3 config = configparser.ConfigParser()

 4 #--------------------------查找文件内容,基于字典的形

 5 print(config.sections())        #  []

 6 config.read('example.ini')

 7 print(config.sections())        #   ['bitbucket.org', 'topsecret.server.com']

 8 print('bytebong.com' in config) # False

 9 print('bitbucket.org' in config) # True

10

11 print(config['bitbucket.org']["user"])  # hg

12 print(config['DEFAULT']['Compression']) #yes

13 print(config['topsecret.server.com']['ForwardX11'])  #no

14 print(config['bitbucket.org'])          #<Section: bitbucket.org>

15 for key in config['bitbucket.org']:     # 注意,有default会默认default的键

16     print(key)

17 print(config.options('bitbucket.org'))  # 同for循环,找到'bitbucket.org'下所有键

18 print(config.items('bitbucket.org'))    #找到'bitbucket.org'下所有键值对

19 print(config.get('bitbucket.org','compression')) # yes       get方法取深层嵌套的值

subprocess模块

当我们需要调用系统的命令的时候，最先考虑的os模块。用os.system()和os.popen()来进行操作。但是这两个命令过于简单，不能完成一些复杂的操作，如给运行的命令提供输入或者读取命令的输出，判断该命令的运行状态，管理多个命令的并行等等。这时subprocess中的Popen命令就能有效的完成我们需要的操作。

subprocess模块允许一个进程创建一个新的子进程，通过管道连接到子进程的stdin/stdout/stderr，获取子进程的返回值等操作。

这个模块只一个类：Popen。

简单命令

1 import subprocess

2 #  创建一个新的进程,与主进程不同步  if in win:

3 s=subprocess.Popen('dir',shell=True)

4 #  创建一个新的进程,与主进程不同步  if in linux:

5 s=subprocess.Popen('ls')

6 s.wait()                  # s是Popen的一个实例对象，意思是等待子进程运行完后才继续运行

7 print('ending...')

带选项命令（win、linux一样）

1 import subprocess

2 subprocess.Popen('ls -l',shell=True)

3 #subprocess.Popen(['ls','-l'])

控制子进程

1 s.poll() # 检查子进程状态

2 s.kill() # 终止子进程

3 s.send_signal() # 向子进程发送信号

4 s.terminate() # 终止子进程

5 s.pid:子进程号

子进程输出流控制

可以在Popen()建立子进程的时候改变标准输入、标准输出和标准错误，并可以利用subprocess.PIPE将多个子进程的输入和输出连接在一起，构成管道(pipe)：

 1 import subprocess

 2 # s1 = subprocess.Popen(["ls","-l"], stdout=subprocess.PIPE)

 3 # print(s1.stdout.read())

 4 #s2.communicate()

 5 s1 = subprocess.Popen(["cat","/etc/passwd"], stdout=subprocess.PIPE)

 6 s2 = subprocess.Popen(["grep","0:0"],stdin=s1.stdout, stdout=subprocess.PIPE)

 7 out = s2.communicate()

 8 print(out)

 9

10 s=subprocess.Popen("dir",shell=True,stdout=subprocess.PIPE)

11 print(s.stdout.read().decode("gbk"))

ubprocess.PIPE实际上为文本流提供一个缓存区。s1的stdout将文本输出到缓存区，随后s2的stdin从该PIPE中将文本读取走。s2的输出文本也被存放在PIPE中，直到communicate()方法从PIPE中读取出PIPE中的文本。
注意：communicate()是Popen对象的一个方法，该方法会阻塞父进程，直到子进程完成

秒客网

python基础之正则表达式爬虫应用，configparser模块和subprocess模块

相关文章