在数据库中缓存大型非unicode字典?

时间:2021-04-30 08:18:09

I have a large dictionary (outputs as string in 366MB, ~383764153 line filetextfile) that I want to store in a database for fast access and to skip the computation time involved in populating the dictionary.

我有一个大字典(输出为366MB中的字符串,~383764153行filetextfile),我想存储在数据库中以便快速访问并跳过填充字典所涉及的计算时间。

My dictionary consists of a dictionary of dictionaries of filename/contents pairs. Small subset:

我的字典由文件名/内容对的字典组成。小子集:

{
    'Reuters/19960916': {
        '54826newsML': '<?xml version="1.0"
encoding="iso-8859-1" ?>\r\n<newsitem itemid="54826" id="root"
date="1996-09-16" xml:lang="en">\r\n<title>USA: RESEARCH ALERT -
Crestar Financial cut.</title>\r\n<headline>RESEARCH ALERT - Crestar
Financial cut.</headline>\r\n<text>\n<p>-- Salomon Brothers analyst
Carole Berger said she cut her rating on Crestar Financial Corp to
hold from buy, at the same time lowering her 1997 earnings per share
view to $5.40 from $5.85.</p>\n<p>-- Crestar said it would buy
Citizens Bancorp in a $774 million stock swap.</p>\n<p>-- Crestar
shares were down 2-1/2 at 58-7/8. Citizens Bancorp soared 14-5/8 to
46-7/8.</p>\n</text>\r\n<copyright>(c) Reuters Limited',
        '55964newsML': '<?xml version="1.0" encoding="iso-8859-1"
?>\r\n<newsitem itemid="55964" id="root" date="1996-09-16"
xml:lang="en">\r\n<title>USA: Nebraska cattle sales thin at
$114/dressed-feedlot.</title>\r\n'
    }
}

I thought MongoDB would be a good fit, but it looks like it requires both the key and value need to be Unicode, and since I am grabbing the filenames from namelist() on ZipFile it is not guaranteed to be Unicode.

我认为MongoDB非常适合,但看起来它需要密钥和值都需要Unicode,因为我从ZipFile上的namelist()获取文件名,所以不能保证是Unicode。

How would you recommend I serialise this dictionary into a database?

您如何推荐我将此词典序列化到数据库中?

2 个解决方案

#1


5  

pymongo doesn't require strings to be unicode, it actually sends ascii stings as is and encodes unicodes to UTF8. When retrieving data from pymongo, you always get unicode. @@ http://api.mongodb.org/python/2.0/tutorial.html#a-note-on-unicode-strings

pymongo不需要字符串为unicode,它实际上发送ascii stings并将unicodes编码为UTF8。从pymongo检索数据时,您总是获得unicode。 @@ http://api.mongodb.org/python/2.0/tutorial.html#a-note-on-unicode-strings

If your input contains "international" byte strings with high-order bytes (like ab\xC3cd) you need to convert these strings to unicode or encode them as UTF-8. Here's a simple recursive converter that handles arbitrary nested dicts:

如果您的输入包含具有高位字节的“国际”字节字符串(如ab \ xC3cd),则需要将这些字符串转换为unicode或将它们编码为UTF-8。这是一个处理任意嵌套dicts的简单递归转换器:

def unicode_all(s):
    if isinstance(s, dict):
        return dict((unicode(k), unicode_all(v)) for k, v in s.items())
    if isinstance(s, list):
        return [unicode_all(v) for v in s]
    return unicode(s)

#2


0  

If you have the RAM (and you apparently do, because you populated the dictionary to begin with) -- cPickle. Or if you want something requiring less RAM but would be slower -- shelve.

如果你有RAM(你显然是这样做的,因为你填写了字典开头) - cPickle。或者如果你想要一些需要较少内存但速度较慢的东西 - 搁置。

#1


5  

pymongo doesn't require strings to be unicode, it actually sends ascii stings as is and encodes unicodes to UTF8. When retrieving data from pymongo, you always get unicode. @@ http://api.mongodb.org/python/2.0/tutorial.html#a-note-on-unicode-strings

pymongo不需要字符串为unicode,它实际上发送ascii stings并将unicodes编码为UTF8。从pymongo检索数据时,您总是获得unicode。 @@ http://api.mongodb.org/python/2.0/tutorial.html#a-note-on-unicode-strings

If your input contains "international" byte strings with high-order bytes (like ab\xC3cd) you need to convert these strings to unicode or encode them as UTF-8. Here's a simple recursive converter that handles arbitrary nested dicts:

如果您的输入包含具有高位字节的“国际”字节字符串(如ab \ xC3cd),则需要将这些字符串转换为unicode或将它们编码为UTF-8。这是一个处理任意嵌套dicts的简单递归转换器:

def unicode_all(s):
    if isinstance(s, dict):
        return dict((unicode(k), unicode_all(v)) for k, v in s.items())
    if isinstance(s, list):
        return [unicode_all(v) for v in s]
    return unicode(s)

#2


0  

If you have the RAM (and you apparently do, because you populated the dictionary to begin with) -- cPickle. Or if you want something requiring less RAM but would be slower -- shelve.

如果你有RAM(你显然是这样做的,因为你填写了字典开头) - cPickle。或者如果你想要一些需要较少内存但速度较慢的东西 - 搁置。