Python抓取*词条并存入MySQL

环境：

Python3.6
MySQL
Pytcharm

首先是抓取页面的分析

Python抓取*词条并存入MySQL

可以分析得到是页面中大多数词条链接<a>href是以 /wiki/ 开头的，页面也是简单的静态加载。

所以提取页面中词条流程如下

用urllib库抓取页面，Python3中将urllib,urllib2合成了urllib
BeautifulSoup解析页面
将提取好的数据存入数据库，使用pymysql数据库接口

先设计数据库

打开Navicat

新建库：

Python抓取*词条并存入MySQL

新建表：

Python抓取*词条并存入MySQL

然后是代码

# encoding:utf-8
'''
@author:
@time:
'''

from urllib.request import urlopen
from urllib.request import Request
from bs4 import BeautifulSoup as bs
import re
import pymysql.cursors



def getURL(url):
    # 请求url并将结果用UTF-8编码
    req = Request(url)
    # 模拟浏览器的访问
    req.add_header(
        'User-Agent',
        'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
    )
    resq = urlopen(req).read().decode('utf-8')


    # 使用BeautifulSoup去解析网页
    soup = bs(resq,'html.parser')

    listURL = soup.findAll('a',href= re.compile(r'^/wiki/'))

    for url in listURL:
        # 剔除其中的图片链接
        if re.search('\.(jpg|JPG)$',url['href']):
            listURL.remove(url)
    return listURL

def store2SQL(connection,listURL):
    try:
        #获取会话指针
        with connection.cursor() as cursor:
            #创建SQL语句
            sql = 'INSERT INTO urls (urlname,urlhref) VALUES(%s,%s) '
            # 执行sql

            for url in listURL:
                print(url)
                cursor.execute(sql,(url.get_text(),'https://zh.wikipedia.org/wiki'+url['href']))
                # 提交操作
                connection.commit()

    finally:

        connection.close()



if __name__ == '__main__':

    rootURL = 'https://zh.wikipedia.org/wiki/Wikipedia:%E9%A6%96%E9%A1%B5' # 抓取的入口
    listURL = getURL(rootURL)

    connection = pymysql.connect(
        host    =   'localhost',
        user    =   'root',
        password=   '',
        db      =   'wikiurl',
        charset =   'utf8mb4'
    )
    store2SQL(connection,listURL)

运行结果：

可以看到页面中所有的非图片链接均已抓取并存入数据库之中

Python抓取*词条并存入MySQL

秒客网

Python抓取*词条并存入MySQL

相关文章