第一次网络爬虫和测试

爬虫的基本流程

1.发起请求：
通过HTTP库向目标站点发起请求，即发送一个Request，请求可以包含额外的headers等信息，然后等待服务器响应。这个请求的过程就像我们打开浏览器，在浏览器地址栏输入网址：www.baidu.com，然后点击回车。这个过程其实就相当于浏览器作为一个浏览的客户端，向服务器端发送了 一次请求。
2.获取响应内容：
如果服务器能正常响应，我们会得到一个Response，Response的内容便是所要获取的内容，类型可能有HTML、Json字符串，二进制数据(图片，视频等）等类型。这个过程就是服务器接收客户端的请求，进过解析发送给浏览器的网页HTML文件

3.解析内容：
得到的内容可能是HTML，可以使用正则表达式，网页解析库进行解析。也可能是Json，可以直接转为Json对象解析。可能是二进制数据，可以做保存或者进一步处理。这一步相当于浏览器把服务器端的文件获取到本地，再进行解释并且展现出来。
4.保存数据：
保存的方式可以是把数据存为文本，也可以把数据保存到数据库，或者保存为特定的jpg，mp4 等格式的文件。这就相当于我们在浏览网页时，下载了网页上的图片或者视频。
Request

1.什么是Request？
浏览器发送信息给该网址所在的服务器，这个过程就叫做HTTP Request。
2.Request中包含什么？
请求方式：请求方式的主要类型是GET，OST两种，另外还有HEAD、PUT、DELETE等。GET 请求的请求参数会显示在URL链接的后面，比如我们打开百度，搜索“图片”，我们会看到请求的URL链接为https://www.baidu.com/s?wd=图片。而 POST 请求的请求参数会存放在Request内，并不会出现在 URL 链接的后面，比如我们登录知乎，输入用户名和密码，我们会看到浏览器开发者工具的Network页，Request请求有Form Data的键值对信息，那里就存放了我们的登录信息，有利于保护我们的账户信息安全；
请求 URL：URL 全称是统一资源定位符，也就是我们说的网址。比如一张图片，一个音乐文件，一个网页文档等都可以用唯一URL来确定，它包含的信息指出文件的位置以及浏览器应该怎么去处理它；
请求头(Request Headers)：请求头包含请求时的头部信息，如User-Agent（指定浏览器的请求头），Host，Cookies等信息；
请求体：请求体是请求是额外携带的数据，比如登录表单提交的登录信息数据。
Response

1.什么是Response？
服务器收到浏览器发送的信息后，能够根据浏览器发送信息的内容，做出相应的处理，然后把消息回传给浏览器，这个过程就叫做HTTP Response。
2.Response中包含什么？
响应状态：有多种响应状态，比如200代表成功，301 跳转页面，404 表示找不到页面，502 表示服务器错误；
响应头(Response Headers)：比如内容类型，内容长度，服务器信息，设置Cookie等；
响应体：响应体最主要的部分，包含了请求资源的内容，比如网页 HTML 代码，图片二进制数据等。 
以下是我测试百度网页的代码

# -*- coding: utf-8 -*-
"""
Spyder Editor

This is a temporary script file.
"""

import requests
def getHTMLText(url):
    try:
 r=requests.get(url,timeout=30)
        r.raise_for_status()
        r.encoding='utf-8'
        return r.text
    except:
        return "error"
url="http://www.baidu.com"

for i in range(20):
    print(getHTMLText(url))
    print(i)运行结果如下：

<!DOCTYPE html>
<html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>百度一下，你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class="head_wrapper"> <div class="s_form"> <div class="s_form_wrapper"> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class="fm"> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class="s_ipt" value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class="mnav">新闻</a> <a href=http://www.hao123.com name=tj_trhao123 class="mnav">hao123</a> <a href=http://map.baidu.com name=tj_trmap class="mnav">地图</a> <a href=http://v.baidu.com name=tj_trvideo class="mnav">视频</a> <a href=http://tieba.baidu.com name=tj_trtieba class="mnav">贴吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class="lb">登录</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登录</a>');</script> <a href=//www.baidu.com/more/ name=tj_briicon class="bri" style="display: block;">更多产品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>关于百度</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>©2017 Baidu <a href=http://www.baidu.com/duty/>使用百度前必读</a>  <a href=http://jianyi.baidu.com/ class="cp-feedback">意见反馈</a> 京ICP证030173号  <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>

第一次网络爬虫和测试

爬取中国大学排名

代码如下:

 1 # -*- coding: utf-8 -*-
 2 '''
 3 获取中国大学的排名
 4 @author: bpf
 5 '''
 6 import requests
 7 from bs4 import BeautifulSoup
 8 import pandas
 9 # 1. 获取网页内容
10 def getHTMLText(url):
11     try:
12         r = requests.get(url, timeout = 30)
13         r.raise_for_status()
14         r.encoding = 'utf-8'
15         return r.text
16     except Exception as e:
17         print("Error:", e)
18         return ""
19 
20 # 2. 分析网页内容并提取有用数据
21 def fillTabelList(soup): # 获取表格的数据
22     tabel_list = []      # 存储整个表格数据
23     Tr = soup.find_all('tr')
24     for tr in Tr:
25         Td = tr.find_all('td')
26         if len(Td) == 0:
27             continue
28         tr_list = [] # 存储一行的数据
29         for td in Td:
30             tr_list.append(td.string)
31         tabel_list.append(tr_list)
32     return tabel_list
33 
34 # 3. 可视化展示数据
35 def PrintTableList(tabel_list, num):
36     # 输出前num行数据
37     print("{1:^2}{2:{0}^10}{3:{0}^5}{4:{0}^5}{5:{0}^8}".format(chr(12288), "排名", "学校名称", "省市", "总分", "生涯质量"))
38     for i in range(num):
39         text = tabel_list[i]
40         print("{1:{0}^2}{2:{0}^10}{3:{0}^5}{4:{0}^8}{5:{0}^10}".format(chr(12288), *text))
41 
42 # 4. 将数据存储为csv文件
43 def saveAsCsv(filename, tabel_list):
44     FormData = pandas.DataFrame(tabel_list)
45     FormData.columns = ["排名", "学校名称", "省市", "总分", "生涯质量", "培养结果", "科研规模", "科研质量", "顶尖成果", "顶尖人才", "科技服务", "产学研合作", "成果转化"]
46     FormData.to_csv(filename, encoding='utf-8', index=False)
47 
48 if __name__ == "__main__":
49     url = "http://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html"
50     html = getHTMLText(url)
51     soup = BeautifulSoup(html, features="html.parser")
52     data = fillTabelList(soup)
53     #print(data)
54     PrintTableList(data, 10)   # 输出前10行数据
55     saveAsCsv("D:\\University_Rank.csv", data)

运行结果：

第一次网络爬虫和测试

获取网页的各个属性标签内容

(我也不知道这个标题什么意思, 感觉很牛但看不懂, 才怪呢！你肯定看得懂！)

这里，选取一个很厉害的网站做演示

URL = https://www.runoob.com/

步骤说明：

1. 找个url, 上面有了, 其实随便一个都是OK的

2. 抓取网页内容，这个上面有详解，不难

3. 本次使用 BeautifulSoup 第三方库，需要自行下载【详情介绍】

4. 开工

　　前面提供了 URL，现在抓取网页内容

 1 # -*- encoding:utf-8 -*-
 2 from requests import get
 3 def getText(url):
 4     try:
 5         r = get(url, timeout=5)
 6         r.raise_for_status()
 7         r.encoding = 'utf-8'
 8         return r.text
 9     except Exception as e:
10         print("Error:", e)
11         return ''

　　然后再引入 beautifulsoup库，不过与其他的库有点不一样，别写错了哟

from bs4 import BeautifulSoup

　　之后创建一个 beautifulsoup 对象

1 url = "https://www.runoob.com/"
2 html = getText(url)
3 soup = BeautifulSoup(html)

　　好了，现在想要干嘛就干嘛 ↓↓↓

① 获取 head 标签

print("head:", soup.head)
print("head:", len(soup.head))

　　　　由于结果比较多，就只输出第二个结果

head: 33

② 获取 body 标签

print("body:", soup.body)
print("body:", len(soup.body))

　　由于结果比较多，就只输出第二个结果

body: 39

③ 获取 title 标签

print("title:", soup.title)

title: <title>菜鸟教程 - 学的不仅是技术，更是梦想！</title>

④ 获取 title 的内容

print("title_string:", soup.title.string)

title_string: 菜鸟教程 - 学的不仅是技术，更是梦想！

⑤ 查找特定 id 的内容

print("special_id:", soup.find(id='cd-login'))

special_id: <div id="cd-login"> 
<div class="cd-form">
<p class="fieldset"> ......

⑥ 查找所有的 a 标签

a: [<a href="/">菜鸟教程 -- 学的不仅是技术，更是梦想！</a>, <a class="current" data-id="index" href="//www.runoob.com/" title="菜鸟教程">首页</a>, ......

⑦ 摘取所有的中文字符

1 import re
2 def getChinese(text):
3     text_unicode = text.strip() # 将字符串进行处理, 包括转化为unicode
4     string = re.compile('[^\u4e00-\u9fff]')
5         # 中文编码范围是 \u4e00-\u9ffff
6         # 中文、数字编码范围是 \u4e00-\u9fa50
7     chinese = "".join(string.split(text_unicode))
8     return chinese

print("Chinese:", getChinese(html))

爬虫必应网站

import requests
def getHTMLText(url):
    try:
        r=requests.get(url,timeout=30)
        r.raise_for_status()
        r.encoding='utf-8'
        print(r.status_code)
        print(r.text)
        print(r.encoding)
        print(r.text)
    except:
        return ""
for i in range(0,20):
    url="https://https://www.bing.com"
    getHTMLText(url
此处是打开必应网页的代码不知为什么进不去
百度就可以`

秒客网

第一次网络爬虫和测试

爬取中国大学排名

获取网页的各个属性标签内容

相关文章