python网络爬虫学习笔记（1）

本文实例为大家分享了python网络爬虫的笔记，供大家参考，具体内容如下

（一）三种网页抓取方法

1、正则表达式：

模块使用C语言编写，速度快，但是很脆弱，可能网页更新后就不能用了。

2、Beautiful Soup

模块使用Python编写，速度慢。

安装：

1	`pip` `install` `beautifulsoup4`

3、 Lxml

模块使用C语言编写，即快速又健壮，通常应该是最好的选择。

（二） Lxml安装

1	`pip` `install` `lxml`

如果使用lxml的css选择器，还要安装下面的模块

1	`pip` `install` `cssselect`

（三）使用lxml示例

									import urllib.request as re

									import lxml.html

									#下载网页并返回HTML

									def download(url,user_agent='Socrates',num=2):

									  print('下载:'+url)

									  #设置用户代理

									  headers = {'user_agent':user_agent}

									  request = re.Request(url,headers=headers)

									  try:

									    #下载网页

									    html = re.urlopen(request).read()

									  except re.URLError as e:

									    print('下载失败'+e.reason)

									    html=None

									    if num>0:

									      #遇到5XX错误时，递归调用自身重试下载，最多重复2次

									      if hasattr(e,'code') and 500<=e.code<600:

									        return download(url,num-1)

									  return html

									html = download('https://tieba.baidu.com/p/5475267611')

									#将HTML解析为统一的格式

									tree = lxml.html.fromstring(html)

									# img = tree.cssselect('img.BDE_Image')

									#通过lxml的xpath获取src属性的值，返回一个列表

									img = tree.xpath('//img[@class="BDE_Image"]/@src')

									x= 0

									#迭代列表img,将图片保存在当前目录下

									for i in img:

									  re.urlretrieve(i,'%s.jpg'%x)

									  x += 1

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持服务器之家。

原文链接：https://www.cnblogs.com/simple-free/p/8757758.html

秒客网

python网络爬虫学习笔记（1）

相关文章