最近往python爬虫这块研究了一下,不禁被python的简洁和强大震撼到了,下面给大家介绍一下我用python3.12做的爬虫,我将会使用的库包括:requests,BeautifulSoup,time,re,jieba。
详细步骤:
1.爬取每个新闻对应的URL:
该程序的爬取时间比较长,你也可以根据需求适当减少,
程序如下:
#导入必要的库
import requests
from bs4 import BeautifulSoup
import time
raw='https://www.tsinghua.edu.cn/info/1177/'
#创建用于保存URL的文件
with open('d:\\清华url.txt','w+') as f:
urls=[]
#设置伪造请求头
headers = {
'User-Agent': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)'
}
for i in range(109542,102140-1):
#生成url
url=f"{raw}{str(i)}.htm"
try:
response = requests.head(url)
#抓取URL
r = requests.get(url, timeout=30, headers=headers)
time.sleep(0.5)#等待页面加载完毕
r.raise_for_status()#检测状态
r.encoding=r.apparent_encoding
print(f'{url}的外部抓取已完成,开始处理----------')
except requests.RequestException as e:#具体的有关请求失败的异常
print('请求错误:', e)
except Exception as e:
print('发生错误:', e)
soup=BeautifulSoup(r.text,'html.parser')#解析器也可更换为’lxml‘,但需要安装
#搜索目标'div'标签
al=soup.find_all('p',class_='vsbcontent_start')
for a in al:
html=a.get('div')#基本的get请求
temp=url+'\n'
#防止重复
if temp not in urls:
urls.append(temp)
f.write(temp)
print(f'{url}已结束爬取')
print(f"\n总共爬取了{len(urls)}个url,")
print("url被载入D盘中'清华url'了!")
f.close()
我这里的请求头是固定的,但大家在实操时先设置一个包含较多请求头的列表,再随机选择会更加安全哦~
程序进程结束后,会在D:\\下出现‘清华url.txt’文件,打开后是这样的:
2.读取新闻内容
这一步的目的如标题所说,它大致就是使用BeautifulSoup对页面进行解析:
代码如下:
import requests
from bs4 import BeautifulSoup
count = 0 #计数器
with open('d:\\清华url.txt', 'r') as f:
for line in f.readlines():
line = line.strip()
count += 1
try:
r = requests.get(line, timeout=20)
r.raise_for_status()
r.encoding = r.apparent_encoding
except:
print('Error')
continue
soup = BeautifulSoup(r.text, 'html.parser')
print('开始提取文本,稍等-------loading-------')
s = soup.find_all('p')
with open('d:\\清华新闻.txt', 'a+', encoding='utf-8') as c:
for i in s:
print(f"第{count}篇获取成功!-------正在过滤信息,请稍等-------")
c.write(i.get_text())
print('完成')
3.jieba分词
在这一步骤中,我们将对文本进行分词处理,并最终筛选出百大高频词。
import jieba
# 读取文件内容
with open('d:\\清华新闻.txt', 'r', encoding='utf-8', errors='replace') as f:
txt_content = open('d:\\清华新闻.txt', encoding='utf-8').read()
# 停用词列表
stopwords = [line.strip() for line in open('d:\\停用词表.txt', encoding='utf-8', errors='replace').readlines()]
# 使用jieba进行分词
words = jieba.lcut(txt_content)
# 统计词频
counts = {}
for word in words:
if word not in stopwords:
if word != '供稿' or word != '编辑' or word != '审核':
if len(word) == 1:
continue
else:
counts[word] = counts.get(word, 0) + 1
# 排序并打印结果
items = list(counts.items())
items.sort(key=lambda x: x[1], reverse=True)
for i in range(100):
word, count = items[i]
print('{:<10}{:>7}'.format(word, count))
最后出现如下界面: