UnicodeDecodeError json。后加载json。在python 2.x中使用ensure_ascii=False进行转储

I dump dict object using json.dump. To avoid UnicodeDecodeError, I set ensure_ascii=False following this advice.

我使用json.dump转储dict类型对象。为了避免UnicodeDecodeError，我设置了ensure_ascii=False。

with open(my_file_path, "w") as f:
    f.write(json.dumps(my_dict, ensure_ascii=False))

The dump file has been successfully created, but when loading the dumped file UnicodeDecodeError happens:

转储文件已成功创建，但加载转储文件unicodedecodededeerror时发生:

with open(my_file_path, "r") as f:
    return json.loads(f.read())

How to avoid UnicodeDecodeError on loading the dump file?

如何在加载转储文件时避免unicodedecodedededeerror ?

Error message and stacktrace

The Error message is UnicodeDecodeError: 'utf8' codec can't decode byte 0x93 in position 0: invalid start byte and stacktrace is:

错误消息为UnicodeDecodeError:“utf8”编解码器在0位置无法解码字节0x93:起始字节无效，stacktrace为:

/Users/name/.pyenv/versions/anaconda-2.0.1/python.app/Contents/lib/python2.7/json/__init__.pyc in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    336     if (cls is None and encoding is None and object_hook is None and
    337             parse_int is None and parse_float is None and
--> 338             parse_constant is None and object_pairs_hook is None and not kw):
    339         return _default_decoder.decode(s)
    340     if cls is None:

/Users/name/.pyenv/versions/anaconda-2.0.1/python.app/Contents/lib/python2.7/json/decoder.pyc in decode(self, s, _w)
    364         obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    365         end = _w(s, end).end()
--> 366         if end != len(s):
    367             raise ValueError(errmsg("Extra data", s, end, len(s)))
    368         return obj

/Users/name/.pyenv/versions/anaconda-2.0.1/python.app/Contents/lib/python2.7/json/decoder.pyc in raw_decode(self, s, idx)
    380             obj, end = self.scan_once(s, idx)
    381         except StopIteration:
--> 382             raise ValueError("No JSON object could be decoded")
    383         return obj, end

UnicodeDecodeError: 'utf8' codec can't decode byte 0x93 in position 0: invalid start byte

1 个解决方案

#1

In Python2, you could use ensure_ascii=False and decode the result before calling json.loads:

在Python2中，可以使用ensure_ascii=False，并在调用json.load之前对结果进行解码。

import json

my_dict = {b'\x93': [b'foo', b'\x93', {b'\x93': b'\x93'}]}

dumped = json.dumps(my_dict, ensure_ascii=False)
print(repr(dumped))
# '{"\\u201c": ["foo", "\\u201c", {"\\u201c": "\\u201c"}]}'
result = json.loads(dumped.decode('cp1252'))
print(result)
# {u'\u201c': [u'foo', u'\u201c', {u'\u201c': u'\u201c'}]}

However, note that the result returned by json.loads contains unicode, not strs. So the result is not exactly the same as my_dict.

但是，请注意json返回的结果。load包含unicode，而不是strs。所以结果和my_dict类型并不完全相同。

Note that json.loads always decodes strings to unicode, so if you are interested in faithfully recovering the dict using json.dumps and json.loads, then you need to start with a dict which contains only unicode, no strs.

请注意,json。加载总是将字符串解码为unicode，因此，如果您对使用json忠实地恢复命令感兴趣的话。转储和json。加载，然后您需要从只包含unicode的命令开始，没有strs。

Moreover, in Python3 json.dumps requires all dicts to have keys which are unicode strings. So the above solution does not work in Python3.

此外,在Python3 json。转储要求所有dicts都具有unicode字符串的键。所以上面的解在Python3中不起作用。

An alternative which will work in both Python2 and Python3 is to make sure you pass json.loads a dict whose keys and values are unicode (or contain no strs). For example, if you use convert (below) to recursively change the keys and values to unicode before passing them to json.loads:

在Python2和Python3中都可以使用的另一种方法是确保您传递json。加载一个关键字和值为unicode(或不包含strs)的命令。例如，如果您使用convert(以下)递归地将键和值更改为unicode，然后再将它们传递给json.load:

import json

def convert(obj, enc):
    if isinstance(obj, str):
        return obj.decode(enc)
    if isinstance(obj, (list, tuple)):
        return [convert(item, enc) for item in obj]
    if isinstance(obj, dict):
        return {convert(key, enc) : convert(val, enc)
                for key, val in obj.items()}
    else: return obj

my_dict = {'\x93': ['foo', '\x93', {'\x93': '\x93'}]}
my_dict = convert(my_dict, 'cp1252')

dumped = json.dumps(my_dict)
print(repr(dumped))
# '{"\\u201c": ["foo", "\\u201c", {"\\u201c": "\\u201c"}]}'
result = json.loads(dumped)
print(result)
# {u'\u201c': [u'foo', u'\u201c', {u'\u201c': u'\u201c'}]}
assert result == my_dict

convert will decode all strs found in lists, tuples and dicts inside my_dict.

convert将解码在my_dict内的列表、元组和句中的所有strs。

Above, I used 'cp1252' as the encoding since (as Fumu pointed out) '\x93' decoded with cp1252 is a LEFT DOUBLE QUOTATION MARK:

上面，我使用了'cp1252'作为编码，因为(如Fumu所指出)用cp1252解码的'\x93'是一个左双引号:

In [18]: import unicodedata as UDAT

In [19]: UDAT.name('\x93'.decode('cp1252'))
Out[19]: 'LEFT DOUBLE QUOTATION MARK'

If you know the strs in my_dict have been encoded in some other encoding, you should of course call convert using that encoding instead.

如果您知道my_dict中的strs是用其他编码进行编码的，那么当然应该使用该编码调用convert。

Even better, instead of using convert, take care to ensure all strs are decoded to unicode as you are building my_dict.

更好的是，不要使用convert，在构建my_dict时要确保所有的strs都被解码为unicode。

#1