python3中使用requests和beautifulsoup爬取西刺免费代理IP 入库mongo!

import requests  # 使用requests获取源码信息
import pymongo  # 入库mongo
from bs4 import BeautifulSoup  # 使用BeautifulSoup解析网页信息

可以爬取多页，设置翻页就可以。下面爬取的是一页。

client = pymongo.MongoClient()  # 连接本机上的mongo
database = client['IP代理']  # 设置数据库名
table = database['ip池']  # 设置表明
head = {'User-Agent': 'Mozilla/5.0'}  # 设置头部信息
request = requests.get('http://www.xicidaili.com/wn/', headers=head)  # 加上头部信息，还有请求的网址
request.encoding = request.apparent_encoding  # 设置页面的编码
soup = BeautifulSoup(request.text, 'lxml').find_all('table', id='ip_list')  # 找到放ip的表
for i in soup:  # 遍历里面得到的所有ip信息
# 找到ip的标签遍历得到ip 
    ip = [i.get_text() for i in  i.select('tr > td:nth-of-type(2)')]
# 找到端口的标签遍历得到端口
    dk = [i.get_text() for i in  i.select('tr > td:nth-of-type(3)')]
# 打包成字典
for key,value in zip(dk,ip):
        data = {
            key: value
        }
# 把数据插入到mongo表中
        table.insert(data)

秒客网

python3中使用requests和beautifulsoup爬取西刺免费代理IP 入库mongo!

相关文章