python3使用beautifulSoup - UnicodeDecodeError: 'utf-8' codec can't decode
将html文件转为纯文本,用Python3调用beautifulSoup 超简单的代码一直出错,用于打开本地文件: def load_data(file_path): with open(file_path, 'r') as pf: try: soup = BeautifulSoup(pf, "html.parser") table = soup.find('table') rownum = 0 entry_list = [] for row in table.findAll('tr'): rownum += 1 if rownum!=1: col = row.findAll('td') entry_list.append(SATData(hostname=col[0].getText().strip(), db_instance=col[1].getText().strip(), sat_type=col[2].getText().strip(), os_version=col[3].getText().strip(), signoff_date=col[4].getText().strip(), comment=col[5].getText().strip())) if rownum % 500 == 0: SATData.objects.bulk_create(entry_list) entry_list = [] logger.info('Insert Data %d' % rownum) SATData.objects.bulk_create(entry_list) logger.info('Insert Data %d' % (rownum-1)) except Exception as e: logger.exception(str(e)) 出现下面的错误: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 2127: invalid start byte 问题出在文件读取而非BeautifulSoup的解析上!! 好吧,查查为什么文件读取有问题,直接上正解,同样四行代码 from bs4 import BeautifulSoup file = open('index.html','r',encoding='iso-8859-1') soup = BeautifulSoup(file,'lxml') print (soup) 然后soup.get_text()得到标签中的文字 def download_satdata(tz_today_str, target_file): url = 'http://....../Download_SAT_Inventory.asp' try: logger.info(tz_today_str + ': Start to download OAT_SATData.') r = requests.get(url, stream=True) test = r.headers t1 = requests.utils.get_encodings_from_content(r.content) print(test, t1) with open(target_file, "wb") as code: code.write(r.content) logger.info(tz_today_str + ': Downloading OAT_SATData is completed.') except Exception as e: logger.exception(str(e)) http://xiaorui.cc/2016/02/19/%E4%BB%A3%E7%A0%81%E5%88%86%E6%9E%90python-requests%E5%BA%93%E4%B8%AD%E6%96%87%E7%BC%96%E7%A0%81%E9%97%AE%E9%A2%98/ (代码分析Python requests库中文编码问题) http://blog.csdn.net/a491057947/article/details/47292923 (Python 使用requests时的编码问题) Note: 1. 在python3中,bytes和str如何转化呢? 使用decode()方法将bytes转为str 使用encode()方法将str转为bytes 2、TypeError: write() argument must be str, not bytes 将文件打开方式改变为'wb+'即可 即打开读写一个二进制文件 3、TypeError: cannot use a string pattern on a bytes-like object 将文件用'rb+'打开后 附上解码方式 (通常是非utf-8所致) f = open(fileName,"rb+") content = f.read().decode('gbk') 4. 一般情况下,文件都是用文本模式打开的,也就意味着,文件读写都是使用某种编码的,末日呢情况下都是用utf-8编码。'b'会用二进制形式打开文件。这个时候,文件读写都是以字节的形式。 在文本模式下,默认会把平台相关的换行符(windows平台是\r\n,Linux平台是\n)转换成\n,在写文件时,会把\n转换成平台相关的字符写入。这种后台的操作对于文本会非常有用,但是对于二进制文件如jpeg或exe文件,则会破坏文件,因此在打开这类文件时千万要使用二进制模式打开。 links: 1. https://*.com/questions/26612492/python-unicodedecodeerror-utf-8-codec-cant-decode-byte-invalid-continuati (Python: UnicodeDecodeError: 'utf-8' codec can't decode byte…invalid continuation byte) 2. http://blog.csdn.net/kelindame/article/details/75014485 (python 中文iso-8859-1编码转utf8编码) 3. http://blog.csdn.net/zm2714/article/details/8012474 (python读写不同编码txt文件) 4. http://outofmemory.cn/code-snippet/629/python-duxie-file-setting-file-charaeter-coding-biru-utf-8 (python读写文件,和设置文件的字符编码比如utf-8) 5. https://www.cnblogs.com/dengyg200891/p/6059277.html (python3中读取和写入文件时如何解决编码问题)