Python爬虫简单介绍

一．requests库简单使用

简单获取一个网页的源代码：

import requests

sessions = requests.session()

sessions.headers['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36'

url = "https://baike.baidu.com/item/%E8%8C%83%E5%86%B0%E5%86%B0/22984"

r = sessions.get(url)

print(r.status_code)

html_content = r.content.decode('utf-8')

print(html_content)

其中，r.status_code的值有如下对应关系。

Python爬虫简单介绍

r.content可以获取页面的全部内容。

二．BeautifulSoup库简单使用

Beautiful Soup是一个可以从HTML或XML文件中提取数据的Python库.

测试文档如下：

html_doc = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

Beautiful Soup库简单使用。

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, "lxml")

简单调用方法如下：

soup.title

# <title>The Dormouse's story</title>

soup.title.name

# u'title'

soup.title.string

# u'The Dormouse's story'

soup.title.parent.name

# u'head'

soup.p

# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']

# u'title'

soup.a

# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link3")

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

print(soup.find("a", id="link1"))

# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

其中，find_all或者find是比较常用的。

find_all() 方法将返回文档中符合条件的所有tag；find() 方法将返回文档中符合条件的一个tag；

三．简单下载一张图片

已知网页上图片的地址，下载该图片到本地。

import requests

sessions = requests.session()

sessions.headers['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36'

img_url = "https://gss2.bdstatic.com/9fo3dSag_xI4khGkpoWK1HF6hhy/baike/c0%3Dbaike150%2C5%2C5%2C150%2C50/sign=e95e57acd20735fa85fd46ebff3864d6/f703738da9773912f15d70d6fe198618367ae20a.jpg"

r = sessions.get(img_url)

print(r.status_code)

f = open("1.jpg","wb")

f.write(r.content)

f.close()

参考：

https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

http://docs.python-requests.org/zh_CN/latest/

秒客网

Python爬虫简单介绍

一．requests库简单使用

二．BeautifulSoup库简单使用

三．简单下载一张图片

相关文章