爬虫04 /asyncio、selenium规避检测、动作链、无头浏览器

爬虫04 /asyncio、selenium规避检测、动作链、无头浏览器

1. 协程asyncio

协程基础
- 特殊的函数
  - 就是async关键字修饰的一个函数的定义
  - 特殊之处：
    
    特殊函数被调用后会返回一个协程对象
    
    特殊函数调用后内部的程序语句没有被立即执行
- 协程
  - 对象。协程==特殊的函数。协程表示的就是一组特定的操作。
- 任务对象
  - 高级的协程（对协程的进一步的封装）/任务对象表示一组指定的操作
    
    任务对象协程特殊的函数
    
    任务对象==特殊的函数
  - 绑定回调/一般用于解析：
    
    task.add_done_callback(task)
    
    参数task：当前回调函数对应的任务对象
    
    task.result():返回的就是任务对象对应的特殊函数的返回值
- 事件循环对象
  - 创建事件循环对象
  - 将任务对象注册到该对象中并且开启该对象
  - 作用：loop可以将其内部注册的所有的任务对象进行异步执行
- 代码示例：
```
import asyncio

from time import sleep

# 特殊的函数

async def get_request(url):

    print('正在下载:',url)

    sleep(2)

    print('下载完毕：',url)

    return 'page_text'

# 回调函数的定义（普通的函数）

def parse(task):

    # 参数表示的就是任务对象

    print('i am callback!!!',task.result())

# 特殊函数的调用

c = get_request('www.lbzhk.com')

# 创建一个任务对象

task = asyncio.ensure_future(c)

# 给任务对象绑定一个回调函数

task.add_done_callback(parse)

# 创建一个事件循环对象

loop = asyncio.get_event_loop()

# 将任务对象注册到该对象中并且开启该对象

loop.run_until_complete(task)   # 让loop执行了一个任务
```

多任务协程

挂起：就是交出cpu的使用权。

wait(tasks):给每个任务对象赋予一个可被挂起的的权限
await：被用作特殊函数内部（被阻塞）
代码示例：

import asyncio

from time import sleep

import time

# 特殊的函数

async def get_request(url):

    print('正在下载:',url)

    await asyncio.sleep(2)

    print('下载完毕：',url)

    return 'i am page_text!!!'

def parse(task):

    page_text = task.result()

    print(page_text)

start = time.time()

urls = ['www.1.com','www.2.com','www.3.com']

tasks = []  # 存储的是所有的任务对象。多任务！

for url in urls:

    c = get_request(url)

    task = asyncio.ensure_future(c)

    task.add_done_callback(parse)

    tasks.append(task)

loop = asyncio.get_event_loop()

# asyncio.wait(tasks):给每一个任务对象赋予一个可被挂起的权限

loop.run_until_complete(asyncio.wait(tasks))

print('总耗时：',time.time()-start)

2. aiohttp多任务异步爬虫

实现异步爬取的条件
- 不能在特殊函数内部出现不支持异步的模块代码，否则会中断整个的异步效果
- requests模块不支持异步
- aiohttp是一个支持异步的网络请求模块

使用aiohttp模块实现多任务异步爬虫的流程

环境安装
```
pip install aiohttp
```

编码流程：

大致的架构:

with aiohttp.ClientSession() as s:

# s.get(url,headers,params,proxy="http://ip:port")

    with s.get(url) as response:

        # response.read()二进制/相当于requests的.content

        page_text = response.text()

        return page_text

细节补充：

在每一个with前加上async，标记是一个特殊函数
需要在每一个阻塞操作前加上await

async with aiohttp.ClientSession() as s:

    # s.get(url,headers,params,proxy="http://ip:port")

    async with await s.get(url) as response:

        # response.read()二进制（.content）

        page_text = await response.text()

        return page_text

代码示例：

import asyncio

import aiohttp

import time

from bs4 import BeautifulSoup

# 将被请求的url全部整合到一个列表中

urls = ['http://127.0.0.1:5000/bobo','http://127.0.0.1:5000/jay','http://127.0.0.1:5000/tom']

start = time.time()

async def get_request(url):

    async with aiohttp.ClientSession() as s:

        # s.get(url,headers,params,proxy="http://ip:port")

        async with await s.get(url) as response:

            # response.read()二进制（.content）

            page_text = await response.text()

            return page_text

def parse(task):

    page_text = task.result()

    soup = BeautifulSoup(page_text,'lxml')

    data = soup.find('div',class_="tang").text

    print(data)

tasks = []

for url in urls:

    c = get_request(url)

    task = asyncio.ensure_future(c)

    task.add_done_callback(parse)

    tasks.append(task)

loop = asyncio.get_event_loop()

loop.run_until_complete(asyncio.wait(tasks))

print('总耗时：',time.time()-start)

3. selenium的使用

selenium和爬虫之间的关联：
- 模拟登录
- 便捷的捕获到动态加载的数据
  
  特点：可见及可得
  
  缺点：效率低
selenium概念/安装
- 概念：基于浏览器自动化的一个模块。
- 环境的安装：
```
pip install selenium
```
selenium的具体使用

准备浏览器的驱动程序：http://chromedriver.storage.googleapis.com/index.html

selenium演示程序

from selenium import webdriver

from time import sleep

# 后面是你的浏览器驱动位置，记得前面加r'','r'是防止字符转义的

driver = webdriver.Chrome(r'chromedriver')

# 用get打开百度页面

driver.get("http://www.baidu.com")

# 查找页面的“设置”选项，并进行点击

driver.find_elements_by_link_text('设置')[0].click()

sleep(2)

# 打开设置后找到“搜索设置”选项，设置为每页显示50条

driver.find_elements_by_link_text('搜索设置')[0].click()

sleep(2)

# 选中每页显示50条

m = driver.find_element_by_id('nr')

sleep(2)

m.find_element_by_xpath('//*[@id="nr"]/option[3]').click()

m.find_element_by_xpath('.//option[3]').click()

sleep(2)

# 点击保存设置

driver.find_elements_by_class_name("prefpanelgo")[0].click()

sleep(2)

# 处理弹出的警告页面   确定accept() 和 取消dismiss()

driver.switch_to_alert().accept()

sleep(2)

# 找到百度的输入框，并输入 美女

driver.find_element_by_id('kw').send_keys('美女')

sleep(2)

# 点击搜索按钮

driver.find_element_by_id('su').click()

sleep(2)

# 在打开的页面中找到“Selenium - 开源中国社区”，并打开这个页面

driver.find_elements_by_link_text('美女_百度图片')[0].click()

sleep(3)

# 关闭浏览器

driver.quit()

selenium基本使用指令

from selenium import webdriver

bro = webdriver.Chrome(executable_path='./chromedriver.exe')

# 请求的发送：

bro.get(url)

# 标签定位

# 使用xpath定位

search = bro.find_element_by_xpath('//input[@id="key"]')

# 使用id定位

search = bro.find_element_by_id('key')

# 使用class类值定位

search = bro.find_elements_by_class_name('prefpanelgo')

# 向指定标签中录入文本数据

search.send_keys('mac pro')

# 模拟点击

search.click()

# JS注入

bro.execute_script('window.scrollTo(0,document.body.scrollHeight)')

# 处理弹出的警告页面   确定accept() 和 取消dismiss()

bro.switch_to_alert().accept()

# switch_to.frame进行指定子页面的切换

bro.switch_to.frame('iframeResult')

# 捕获到当前页面的数据

page_text = bro.page_source

# 保留当前页面截图

bro.save_screenshot('123.png')

# 关闭浏览器

bro.quit()

selenium简单使用示例代码：

from selenium import webdriver

from time import sleep

# 结合着浏览器的驱动实例化一个浏览器对象

bro = webdriver.Chrome(executable_path='./chromedriver.exe')

# 请求的发送

url = 'https://www.jd.com/'

bro.get(url)

sleep(1)

# 标签定位

# bro.find_element_by_xpath('//input[@id="key"]')

search = bro.find_element_by_id('key')

search.send_keys('mac pro')   # 向指定标签中录入文本数据

sleep(2)

btn = bro.find_element_by_xpath('//*[@id="search"]/div/div[2]/button')

btn.click()

sleep(2)

# JS注入

bro.execute_script('window.scrollTo(0,document.body.scrollHeight)')

# 捕获到当前页面的数据

page_text = bro.page_source

print(page_text)

sleep(3)

bro.quit()

动态加载数据的捕获代码示例：

http://125.35.6.84:81/xk/,对药监总局前3页的企业名称进行爬取

from selenium import webdriver

from lxml import etree

from time import sleep

bro = webdriver.Chrome(executable_path='./chromedriver.exe')

url = 'http://125.35.6.84:81/xk/'

bro.get(url)

page_text = bro.page_source

all_page_text = [page_text]

# 点击下一页

for i in range(2):

    # 获取标签

    nextPage = bro.find_element_by_xpath('//*[@id="pageIto_next"]')

    # 进行点击

    nextPage.click()

    sleep(1)

    all_page_text.append(bro.page_source)

# 对爬取到的数据进行解析

for page_text in all_page_text:

    tree = etree.HTML(page_text)

    li_list = tree.xpath('//*[@id="gzlist"]/li')

    for li in li_list:

        name = li.xpath('./dl/@title')[0]

        print(name)

sleep(2)

bro.quit()

4. 动作链

动作链概念/使用流程
- ActionChains，一系列的行为动作
  
  动作链对象action和浏览器对象bro是独立的
- 使用流程：
  1. 实例化一个动作链对象，需要将指定的浏览器和动作链对象进行绑定
  2. 执行相关的连续的动作
  3. perform()立即执行动作链制定好的动作

示例代码：

from selenium import webdriver

from selenium.webdriver import ActionChains # 动作链

from time import sleep

bro = webdriver.Chrome(executable_path='./chromedriver.exe')

url = 'https://www.runoob.com/try/try.php?filename=jqueryui-api-droppable'

bro.get(url)

# NoSuchElementException:定位的标签是存在与iframe之中，则就会抛出这个错误

# 解决方法：switch_to.frame进行指定子页面的切换

bro.switch_to.frame('iframeResult')

div_tag = bro.find_element_by_xpath('//*[@id="draggable"]')

# 实例化一个动作链对象

action = ActionChains(bro)

action.click_and_hold(div_tag)   # 点击且长按

# perform()让动作链立即执行

for i in range(5):

    action.move_by_offset(xoffset=15,yoffset=15).perform()

    sleep(2)

action.release()

sleep(5)

bro.quit()

5. 12306模拟登录分析

模拟登录流程：
1. 将当前浏览器页面进行图片保存
2. 将验证码的局部区域进行裁剪
  - 捕获标签在页面中的位置信息
  - 裁剪范围对应的矩形区域
  - 使用Image工具进行指定区域的裁剪
3. 调用打码平台进行验证码的识别/返回对应的坐标位置

代码示例：

from selenium import webdriver

from selenium.webdriver import ActionChains

from time import sleep

from PIL import Image  # 安装PIL或者是Pillow

from CJY import Chaojiying_Client

# 封装一个识别验证码的函数

def transformCode(imgPath,imgType):

    chaojiying = Chaojiying_Client('bobo328410948', 'bobo328410948', '899370')

    im = open(imgPath, 'rb').read()

    return chaojiying.PostPic(im, imgType)['pic_str']

bro = webdriver.Chrome(executable_path='./chromedriver.exe')

bro.get('https://kyfw.12306.cn/otn/login/init')

sleep(2)

# 将当前浏览器页面进行图片保存

bro.save_screenshot('./main.png')

# 将验证码的局部区域进行裁剪

# 捕获标签在页面中的位置信息

img_tag = bro.find_element_by_xpath('//*[@id="loginForm"]/div/ul[2]/li[4]/div/div/div[3]/img')

location = img_tag.location   # 标签的起始位置坐标（左下角坐标）

size = img_tag.size   # 标签的尺寸

# 裁剪范围对应的矩形区域

rangle = (int(location['x']),int(location['y']),int(location['x']+size['width']),int(location['y']+size['height']))

# 使用Image工具进行指定区域的裁剪

i = Image.open('./main.png')

frame = i.crop(rangle)   # crop就是根据指定的裁剪范围进行图片的截取

frame.save('code.png')

# 调用打码平台进行验证码的识别

result = transformCode('./code.png',9004)

print(result) #x1,y1|x2,y2|x3,y3

# x1,y1|x2,y2|x3,y3 ==>[[x1,y1],[x2,y2],[x3,y3]]

all_list = []    # [[x1,y1],[x2,y2],[x3,y3]]

if '|' in result:

    list_1 = result.split('|')

    count_1 = len(list_1)

    for i in range(count_1):

        xy_list = []

        x = int(list_1[i].split(',')[0])

        y = int(list_1[i].split(',')[1])

        xy_list.append(x)

        xy_list.append(y)

        all_list.append(xy_list)

else:

    x = int(result.split(',')[0])

    y = int(result.split(',')[1])

    xy_list = []

    xy_list.append(x)

    xy_list.append(y)

    all_list.append(xy_list)

for point in all_list:

    x = point[0]

    y = point[1]

    ActionChains(bro).move_to_element_with_offset(img_tag,x,y).click().perform()

    sleep(1)

bro.find_element_by_id('username').send_keys('xxxxxx')

sleep(1)

bro.find_element_by_id('password').send_keys('xxxx')

sleep(1)

bro.find_element_by_id('loginSub').click()

sleep(10)

print(bro.page_source)

bro.quit()

6. selenium规避风险

测试服务器是否有selenium检测机制
1. 正常打开一个网站进行window.navigator.webdriver的js注入，返回值为undefined
2. 使用selenium打开的页面，进行上述js注入返回的是true

规避检测代码示例：

# 规避检测

from selenium import webdriver

from selenium.webdriver import ChromeOptions

option = ChromeOptions()

option.add_experimental_option('excludeSwitches', ['enable-automation'])

bro = webdriver.Chrome(executable_path='./chromedriver.exe',options=option)

url = 'https://www.taobao.com/'

bro.get(url)

7. 无头浏览器

现有无头浏览器
- phantomJs
- 谷歌无头

无头浏览器代码示例：

# 无头浏览器

from selenium import webdriver

from selenium.webdriver.chrome.options import Options

from time import sleep

chrome_options = Options()

chrome_options.add_argument('--headless')

chrome_options.add_argument('--disable-gpu')

bro = webdriver.Chrome(executable_path='./chromedriver.exe',chrome_options=chrome_options)

url = 'https://www.taobao.com/'

bro.get(url)

sleep(2)

bro.save_screenshot('123.png')

print(bro.page_source)

总结：

网络请求的模块：requests/urllib/aiohttp
aiohttp和requests的区别：
- 代理requests用poroxies，aiohttp用的是proxy
- 接收二进制文件requests用response.content，aiohttp用的是response.read()

秒客网

爬虫04 /asyncio、selenium规避检测、动作链、无头浏览器

爬虫04 /asyncio、selenium规避检测、动作链、无头浏览器

1. 协程asyncio

2. aiohttp多任务异步爬虫

3. selenium的使用

4. 动作链

5. 12306模拟登录分析

6. selenium规避风险

7. 无头浏览器

总结：

相关文章