爬虫出现gbk错误

时间:2021-01-09 20:40:52

源码:

 1 '''百度贴吧数据抓取,不同吧不同页'''
 2 
 3 from urllib import request
 4 from urllib import parse
 5 
 6 # 定义常用变量
 7 base_url = "https://tieba.baidu.com/f?kw="
 8 headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0'}
 9 
10 # 拼接url,(先编码,再拼接,再请求)
11 tb_name = input("请输入贴吧名称:")
12 key = parse.quote(tb_name)
13 url = base_url + key
14 
15 print(url)
16 
17 # 三步走
18 # 重构请求对象,包装请求头
19 req = request.Request(url,headers=headers)
20 # 发送请求urlopen
21 res = request.urlopen(req)
22 # 获取响应
23 html = res.read().decode('utf-8')
24 
25 # print(html)
26 
27 # 保存文件
28 with open('贴吧.txt','w') as f:
29     f.write(html)

在进行爬虫数据时, 出现这样的错误:

 

请输入贴吧名称:美女吧
https://tieba.baidu.com/f?kw=%E7%BE%8E%E5%A5%B3%E5%90%A7
Traceback (most recent call last):
File "D:/AID1812/Spider/day01/05_百度贴吧_练习.py", line 29, in <module>
f.write(html)
UnicodeEncodeError: 'gbk' codec can't encode character '\U0001f236' in position 166141: illegal multibyte sequence

解决方案:

with open() 内补充添加 encoding="utf-8", 就OK了.

# 保存文件
with open('贴吧.txt','w',encoding='utf-8') as f:
f.write(html)