I dump dict object using json.dump
. To avoid UnicodeDecodeError
, I set ensure_ascii=False
following this advice.
我使用json.dump转储dict类型对象。为了避免UnicodeDecodeError,我设置了ensure_ascii=False。
with open(my_file_path, "w") as f:
f.write(json.dumps(my_dict, ensure_ascii=False))
The dump file has been successfully created, but when loading the dumped file UnicodeDecodeError happens:
转储文件已成功创建,但加载转储文件unicodedecodededeerror时发生:
with open(my_file_path, "r") as f:
return json.loads(f.read())
How to avoid UnicodeDecodeError
on loading the dump file?
如何在加载转储文件时避免unicodedecodedededeerror ?
Error message and stacktrace
The Error message is UnicodeDecodeError: 'utf8' codec can't decode byte 0x93 in position 0: invalid start byte
and stacktrace is:
错误消息为UnicodeDecodeError:“utf8”编解码器在0位置无法解码字节0x93:起始字节无效,stacktrace为:
/Users/name/.pyenv/versions/anaconda-2.0.1/python.app/Contents/lib/python2.7/json/__init__.pyc in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
336 if (cls is None and encoding is None and object_hook is None and
337 parse_int is None and parse_float is None and
--> 338 parse_constant is None and object_pairs_hook is None and not kw):
339 return _default_decoder.decode(s)
340 if cls is None:
/Users/name/.pyenv/versions/anaconda-2.0.1/python.app/Contents/lib/python2.7/json/decoder.pyc in decode(self, s, _w)
364 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
365 end = _w(s, end).end()
--> 366 if end != len(s):
367 raise ValueError(errmsg("Extra data", s, end, len(s)))
368 return obj
/Users/name/.pyenv/versions/anaconda-2.0.1/python.app/Contents/lib/python2.7/json/decoder.pyc in raw_decode(self, s, idx)
380 obj, end = self.scan_once(s, idx)
381 except StopIteration:
--> 382 raise ValueError("No JSON object could be decoded")
383 return obj, end
UnicodeDecodeError: 'utf8' codec can't decode byte 0x93 in position 0: invalid start byte
1 个解决方案
#1
1
In Python2, you could use ensure_ascii=False
and decode the result before calling json.loads
:
在Python2中,可以使用ensure_ascii=False,并在调用json.load之前对结果进行解码。
import json
my_dict = {b'\x93': [b'foo', b'\x93', {b'\x93': b'\x93'}]}
dumped = json.dumps(my_dict, ensure_ascii=False)
print(repr(dumped))
# '{"\\u201c": ["foo", "\\u201c", {"\\u201c": "\\u201c"}]}'
result = json.loads(dumped.decode('cp1252'))
print(result)
# {u'\u201c': [u'foo', u'\u201c', {u'\u201c': u'\u201c'}]}
However, note that the result
returned by json.loads
contains unicode
, not str
s. So the result
is not exactly the same as my_dict
.
但是,请注意json返回的结果。load包含unicode,而不是strs。所以结果和my_dict类型并不完全相同。
Note that json.loads
always decodes strings to unicode, so if you are interested in faithfully recovering the dict using json.dumps
and json.loads
, then you need to start with a dict which contains only unicode
, no str
s.
请注意,json。加载总是将字符串解码为unicode,因此,如果您对使用json忠实地恢复命令感兴趣的话。转储和json。加载,然后您需要从只包含unicode的命令开始,没有strs。
Moreover, in Python3 json.dumps
requires all dicts to have keys which are unicode strings. So the above solution does not work in Python3.
此外,在Python3 json。转储要求所有dicts都具有unicode字符串的键。所以上面的解在Python3中不起作用。
An alternative which will work in both Python2 and Python3 is to make sure you pass json.loads
a dict whose keys and values are unicode
(or contain no str
s). For example, if you use convert
(below) to recursively change the keys and values to unicode
before passing them to json.loads
:
在Python2和Python3中都可以使用的另一种方法是确保您传递json。加载一个关键字和值为unicode(或不包含strs)的命令。例如,如果您使用convert(以下)递归地将键和值更改为unicode,然后再将它们传递给json.load:
import json
def convert(obj, enc):
if isinstance(obj, str):
return obj.decode(enc)
if isinstance(obj, (list, tuple)):
return [convert(item, enc) for item in obj]
if isinstance(obj, dict):
return {convert(key, enc) : convert(val, enc)
for key, val in obj.items()}
else: return obj
my_dict = {'\x93': ['foo', '\x93', {'\x93': '\x93'}]}
my_dict = convert(my_dict, 'cp1252')
dumped = json.dumps(my_dict)
print(repr(dumped))
# '{"\\u201c": ["foo", "\\u201c", {"\\u201c": "\\u201c"}]}'
result = json.loads(dumped)
print(result)
# {u'\u201c': [u'foo', u'\u201c', {u'\u201c': u'\u201c'}]}
assert result == my_dict
convert
will decode all str
s found in lists, tuples and dicts inside my_dict
.
convert将解码在my_dict内的列表、元组和句中的所有strs。
Above, I used 'cp1252'
as the encoding since (as Fumu pointed out) '\x93'
decoded with cp1252
is a LEFT DOUBLE QUOTATION MARK
:
上面,我使用了'cp1252'作为编码,因为(如Fumu所指出)用cp1252解码的'\x93'是一个左双引号:
In [18]: import unicodedata as UDAT
In [19]: UDAT.name('\x93'.decode('cp1252'))
Out[19]: 'LEFT DOUBLE QUOTATION MARK'
If you know the str
s in my_dict
have been encoded in some other encoding, you should of course call convert
using that encoding instead.
如果您知道my_dict中的strs是用其他编码进行编码的,那么当然应该使用该编码调用convert。
Even better, instead of using convert
, take care to ensure all str
s are decoded to unicode
as you are building my_dict
.
更好的是,不要使用convert,在构建my_dict时要确保所有的strs都被解码为unicode。
#1
1
In Python2, you could use ensure_ascii=False
and decode the result before calling json.loads
:
在Python2中,可以使用ensure_ascii=False,并在调用json.load之前对结果进行解码。
import json
my_dict = {b'\x93': [b'foo', b'\x93', {b'\x93': b'\x93'}]}
dumped = json.dumps(my_dict, ensure_ascii=False)
print(repr(dumped))
# '{"\\u201c": ["foo", "\\u201c", {"\\u201c": "\\u201c"}]}'
result = json.loads(dumped.decode('cp1252'))
print(result)
# {u'\u201c': [u'foo', u'\u201c', {u'\u201c': u'\u201c'}]}
However, note that the result
returned by json.loads
contains unicode
, not str
s. So the result
is not exactly the same as my_dict
.
但是,请注意json返回的结果。load包含unicode,而不是strs。所以结果和my_dict类型并不完全相同。
Note that json.loads
always decodes strings to unicode, so if you are interested in faithfully recovering the dict using json.dumps
and json.loads
, then you need to start with a dict which contains only unicode
, no str
s.
请注意,json。加载总是将字符串解码为unicode,因此,如果您对使用json忠实地恢复命令感兴趣的话。转储和json。加载,然后您需要从只包含unicode的命令开始,没有strs。
Moreover, in Python3 json.dumps
requires all dicts to have keys which are unicode strings. So the above solution does not work in Python3.
此外,在Python3 json。转储要求所有dicts都具有unicode字符串的键。所以上面的解在Python3中不起作用。
An alternative which will work in both Python2 and Python3 is to make sure you pass json.loads
a dict whose keys and values are unicode
(or contain no str
s). For example, if you use convert
(below) to recursively change the keys and values to unicode
before passing them to json.loads
:
在Python2和Python3中都可以使用的另一种方法是确保您传递json。加载一个关键字和值为unicode(或不包含strs)的命令。例如,如果您使用convert(以下)递归地将键和值更改为unicode,然后再将它们传递给json.load:
import json
def convert(obj, enc):
if isinstance(obj, str):
return obj.decode(enc)
if isinstance(obj, (list, tuple)):
return [convert(item, enc) for item in obj]
if isinstance(obj, dict):
return {convert(key, enc) : convert(val, enc)
for key, val in obj.items()}
else: return obj
my_dict = {'\x93': ['foo', '\x93', {'\x93': '\x93'}]}
my_dict = convert(my_dict, 'cp1252')
dumped = json.dumps(my_dict)
print(repr(dumped))
# '{"\\u201c": ["foo", "\\u201c", {"\\u201c": "\\u201c"}]}'
result = json.loads(dumped)
print(result)
# {u'\u201c': [u'foo', u'\u201c', {u'\u201c': u'\u201c'}]}
assert result == my_dict
convert
will decode all str
s found in lists, tuples and dicts inside my_dict
.
convert将解码在my_dict内的列表、元组和句中的所有strs。
Above, I used 'cp1252'
as the encoding since (as Fumu pointed out) '\x93'
decoded with cp1252
is a LEFT DOUBLE QUOTATION MARK
:
上面,我使用了'cp1252'作为编码,因为(如Fumu所指出)用cp1252解码的'\x93'是一个左双引号:
In [18]: import unicodedata as UDAT
In [19]: UDAT.name('\x93'.decode('cp1252'))
Out[19]: 'LEFT DOUBLE QUOTATION MARK'
If you know the str
s in my_dict
have been encoded in some other encoding, you should of course call convert
using that encoding instead.
如果您知道my_dict中的strs是用其他编码进行编码的,那么当然应该使用该编码调用convert。
Even better, instead of using convert
, take care to ensure all str
s are decoded to unicode
as you are building my_dict
.
更好的是,不要使用convert,在构建my_dict时要确保所有的strs都被解码为unicode。