python简易爬虫实现

目的：爬取昵称

目标网站：糗事百科

依赖的库文件：request、sys、beautifulSoup4、imp、io

Python使用版本：3.4

说明：参考http://cn.python-requests.org/zh_CN/latest/user/quickstart.html

步骤：

一、熟悉request

Request介绍：

　　Request库是一个python http库，其内部依赖urllib3库。

　　以下是它的功能特性:
　　国际化域名和 URL、Keep-Alive & 连接池、带持久 Cookie 的会话、浏览器式的 SSL 认证、基本/摘要式的身份认证、优雅的 key/value Cookie、自动解压、自动内容解码、Unicode 响应体、文件分块上传、连接超时、流下载、支持 .netrc、分块请求、线程安全。

Request API操作：

　　Request的API对所有HTTP请求类型都是显而易见的，例如对于HTTP的请求类型：

　　GET、POST、PUT、DELETE、HEAD和OPTIONSS

　　对应的request API操作为（例）：

r = requests.get('https://github.com/timeline.json')

　　r = requests.post("http://httpbin.org/post")

　　r = requests.put("http://httpbin.org/put")

　　r = requests.delete("http://httpbin.org/delete")

　　 r = requests.head("http://httpbin.org/get")

　　 r = requests.options("http://httpbin.org/get")

本文主要针对request的获取操作来做说明：

以GitHubHub时间线和服务器响应的内容格式为例：

1、响应内容

import requests

r = requests.get('https://github.com/timeline.json')

r.text

　　Requests可以根据服务器响应的内容自动解码，支持大多数unicode，当然我们也可以以指定的解码格式来解码内容，如r.text前加上r.encoding = 'utf-8'.

2、二进制响应内容和json响应内容

r.content

r.json()

调用该两种方法分别替换上文的r.text，则分别表示字节的方式访问请求的内容，而非文本格式和以json的格式解码内容。

3、原始响应内容

import requests

r = requests.get('https://github.com/timeline.json',stream=True)

r.raw

r.raw.read(10)

#将获取的原始数据写入test.txt文件

with open('test.txt','wb') as fd:

for chunk in r.iter_content(10):

fd.write(chunk)

二、beautifulSoup介绍：

　　这是Python的一个库，在此主要的作用是从爬取到的网页内容中获取数据，Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。

三、爬取昵称

　　由于本人初次使用Python，所以就做一个最简单的爬虫吧！代码非常简单，就只是获取糗事百科的首页的昵称：

 1 # -*- coding: UTF-8 -*-
 2 from bs4 import BeautifulSoup
 3 from imp import reload
 4 import requests
 5 import sys
 6 import io
 7 sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf8')
 8 #解决unicode编码与ascll编码不兼容的问题
 9 #reload(sys)
10 #sys.setdefaultencoding("utf-8")
11 ############################
12 class Crawler(object):
13    def __init__(self):
14             print("开始爬取数据")
15 #getSource获取网页源代码
16    def getSource(self,url):
17        html = requests.get(url)
18        #print(str(html.text))可以在此打印，看是否抓取到内容
19        return html.text
20                            
21 
22 
23 #主函数
24 if __name__=='__main__':
25    url = 'http://www.qiushibaike.com'
26    testCrawler = Crawler()
27    content = testCrawler.getSource(url)
28    soup = BeautifulSoup(content)
29    fd = open("crawler.txt", 'w') 
30    for i in soup.find_all('h2'):
31                 print(i.getText())
32                 fd.write(i.getText()+'\n')
33    fd.close()

秒客网

python简易爬虫实现

相关文章