Python爬虫:清华大学新闻爬虫的实现

时间:2024-01-27 10:03:18

最近往python爬虫这块研究了一下,不禁被python的简洁和强大震撼到了,下面给大家介绍一下我用python3.12做的爬虫,我将会使用的库包括:requests,BeautifulSoup,time,re,jieba。

详细步骤:

1.爬取每个新闻对应的URL:

该程序的爬取时间比较长,你也可以根据需求适当减少,

程序如下:

#导入必要的库
import requests
from bs4 import BeautifulSoup
import time
raw='https://www.tsinghua.edu.cn/info/1177/'
#创建用于保存URL的文件
with open('d:\\清华url.txt','w+') as f:
    urls=[]
    #设置伪造请求头
    headers = {
        'User-Agent': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)'
    }
    for i in range(109542,102140-1):
        #生成url
        url=f"{raw}{str(i)}.htm"
        try:
            response = requests.head(url)
            #抓取URL
            r = requests.get(url, timeout=30, headers=headers)
            time.sleep(0.5)#等待页面加载完毕
            r.raise_for_status()#检测状态
            r.encoding=r.apparent_encoding
            print(f'{url}的外部抓取已完成,开始处理----------')
        except requests.RequestException as e:#具体的有关请求失败的异常
            print('请求错误:', e)
        except Exception as e:
            print('发生错误:', e)
        soup=BeautifulSoup(r.text,'html.parser')#解析器也可更换为’lxml‘,但需要安装
        #搜索目标'div'标签
        al=soup.find_all('p',class_='vsbcontent_start')
        for a in al:
            html=a.get('div')#基本的get请求
            temp=url+'\n'
            #防止重复
            if temp not in urls:
                urls.append(temp)
                f.write(temp)
                print(f'{url}已结束爬取')
    print(f"\n总共爬取了{len(urls)}个url,")
    print("url被载入D盘中'清华url'了!")
f.close()

我这里的请求头是固定的,但大家在实操时先设置一个包含较多请求头的列表,再随机选择会更加安全哦~

程序进程结束后,会在D:\\下出现‘清华url.txt’文件,打开后是这样的:

Python爬虫:清华大学新闻爬虫的实现_jieba


2.读取新闻内容

这一步的目的如标题所说,它大致就是使用BeautifulSoup对页面进行解析:

代码如下:

import requests
from bs4 import BeautifulSoup
count = 0   #计数器
with open('d:\\清华url.txt', 'r') as f:
    for line in f.readlines():
        line = line.strip()
        count += 1
        try:
            r = requests.get(line, timeout=20)
            r.raise_for_status()
            r.encoding = r.apparent_encoding
        except:
            print('Error')
            continue
        soup = BeautifulSoup(r.text, 'html.parser')
        print('开始提取文本,稍等-------loading-------')
        s = soup.find_all('p')
        with open('d:\\清华新闻.txt', 'a+', encoding='utf-8') as c:
            for i in s:
                print(f"第{count}篇获取成功!-------正在过滤信息,请稍等-------")
                c.write(i.get_text())
print('完成')


3.jieba分词

在这一步骤中,我们将对文本进行分词处理,并最终筛选出百大高频词。

import jieba
# 读取文件内容
with open('d:\\清华新闻.txt', 'r', encoding='utf-8', errors='replace') as f:
    txt_content = open('d:\\清华新闻.txt', encoding='utf-8').read()
    # 停用词列表
    stopwords = [line.strip() for line in open('d:\\停用词表.txt', encoding='utf-8', errors='replace').readlines()]
    # 使用jieba进行分词
    words = jieba.lcut(txt_content)
    # 统计词频
    counts = {}
    for word in words:
        if word not in stopwords:
            if word != '供稿' or word != '编辑' or word != '审核':
                if len(word) == 1:
                    continue
                else:
                    counts[word] = counts.get(word, 0) + 1
    # 排序并打印结果
    items = list(counts.items())
    items.sort(key=lambda x: x[1], reverse=True)
    for i in range(100):
        word, count = items[i]
        print('{:<10}{:>7}'.format(word, count))

最后出现如下界面:

Python爬虫:清华大学新闻爬虫的实现_jieba_02