源码:
1 '''百度贴吧数据抓取,不同吧不同页''' 2 3 from urllib import request 4 from urllib import parse 5 6 # 定义常用变量 7 base_url = "https://tieba.baidu.com/f?kw=" 8 headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0'} 9 10 # 拼接url,(先编码,再拼接,再请求) 11 tb_name = input("请输入贴吧名称:") 12 key = parse.quote(tb_name) 13 url = base_url + key 14 15 print(url) 16 17 # 三步走 18 # 重构请求对象,包装请求头 19 req = request.Request(url,headers=headers) 20 # 发送请求urlopen 21 res = request.urlopen(req) 22 # 获取响应 23 html = res.read().decode('utf-8') 24 25 # print(html) 26 27 # 保存文件 28 with open('贴吧.txt','w') as f: 29 f.write(html)
在进行爬虫数据时, 出现这样的错误:
请输入贴吧名称:美女吧
https://tieba.baidu.com/f?kw=%E7%BE%8E%E5%A5%B3%E5%90%A7
Traceback (most recent call last):
File "D:/AID1812/Spider/day01/05_百度贴吧_练习.py", line 29, in <module>
f.write(html)
UnicodeEncodeError: 'gbk' codec can't encode character '\U0001f236' in position 166141: illegal multibyte sequence
解决方案:
with open() 内补充添加 encoding="utf-8", 就OK了.
# 保存文件
with open('贴吧.txt','w',encoding='utf-8') as f:
f.write(html)