More specific dupe of 875228—Simple data storing in Python.
更具体的875228重复数据库 - 用Python存储简单数据。
I have a rather large dict (6 GB) and I need to do some processing on it. I'm trying out several document clustering methods, so I need to have the whole thing in memory at once. I have other functions to run on this data, but the contents will not change.
我有一个相当大的字典(6 GB),我需要对它进行一些处理。我正在尝试几种文档聚类方法,所以我需要将整个内容同时存储在内存中。我有其他功能来运行这些数据,但内容不会改变。
Currently, every time I think of new functions I have to write them, and then re-generate the dict. I'm looking for a way to write this dict to a file, so that I can load it into memory instead of recalculating all it's values.
目前,每次我想到新函数我都要编写它们,然后重新生成dict。我正在寻找一种方法将此dict写入文件,以便我可以将其加载到内存中而不是重新计算它的所有值。
to oversimplify things it looks something like: {((('word','list'),(1,2),(1,3)),(...)):0.0, ....}
过分简化事物看起来像:{((('word','list'),(1,2),(1,3)),(...)):0.0,....}
I feel that python must have a better way than me looping around through some string looking for : and ( trying to parse it into a dictionary.
我觉得python必须有比我更好的方式循环查找一些字符串:和(尝试将其解析为字典。
6 个解决方案
#1
60
Why not use python pickle? Python has a great serializing module called pickle it is very easy to use.
为什么不使用python pickle? Python有一个很棒的序列化模块叫做pickle,它很容易使用。
import cPickle
cPickle.dump(obj, open('save.p', 'wb'))
obj = cPickle.load(open('save.p', 'rb'))
There are two disadvantages with pickle:
泡菜有两个缺点:
- It's not secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.
- The format is not human readable.
它对于错误或恶意构造的数据是不安全的。切勿取消从不受信任或未经身份验证的来源收到的数据。
格式不是人类可读的。
If you are using python 2.6 there is a builtin module called json. It is as easy as pickle to use:
如果您使用的是python 2.6,则会有一个名为json的内置模块。它就像泡菜一样简单:
import json
encoded = json.dumps(obj)
obj = json.loads(encoded)
Json format is human readable and is very similar to the dictionary string representation in python. And doesn't have any security issues like pickle. But might be slower than cPickle.
Json格式是人类可读的,与python中的字典字符串表示非常相似。而且没有像pickle这样的安全问题。但可能比cPickle慢。
#2
12
I'd use shelve
, json
, yaml
, or whatever, as suggested by other answers.
如其他答案所示,我会使用shelve,json,yaml等等。
shelve
is specially cool because you can have the dict
on disk and still use it. Values will be loaded on-demand.
shelve特别酷,因为你可以在磁盘上使用dict并仍然使用它。值将按需加载。
But if you really want to parse the text of the dict
, and it contains only str
ings, int
s and tuple
s like you've shown, you can use ast.literal_eval
to parse it. It is a lot safer, since you can't eval full expressions with it - It only works with str
ings, numbers, tuple
s, list
s, dict
s, bool
eans, and None
:
但是如果你真的想要解析dict的文本,并且它只包含你已经显示的字符串,整数和元组,你可以使用ast.literal_eval来解析它。它更加安全,因为你不能用它来评估完整的表达式 - 它只适用于字符串,数字,元组,列表,dicts,布尔值和None:
>>> import ast
>>> print ast.literal_eval("{12: 'mydict', 14: (1, 2, 3)}")
{12: 'mydict', 14: (1, 2, 3)}
#3
4
I would suggest that you use YAML for your file format so you can tinker with it on the disc
我建议您使用YAML作为文件格式,以便在光盘上修改它
How does it look:
- It is indent based
- It can represent dictionaries and lists
- It is easy for humans to understand
An example: This block of code is an example of YAML (a dict holding a list and a string)
Full syntax: http://www.yaml.org/refcard.html
To get it in python, just easy_install pyyaml. See http://pyyaml.org/
要在python中获取它,只需easy_install pyyaml。见http://pyyaml.org/
It comes with easy file save / load functions, that I can't remember right this minute.
它带有简单的文件保存/加载功能,这一点我记不起来了。
#4
0
Write it out in a serialized format, such as pickle (a python standard library module for serialization) or perhaps by using JSON (which is a representation that can be evaled to produce the memory representation again).
以序列化格式写出来,例如pickle(用于序列化的python标准库模块),或者可能使用JSON(这是一种可以被证明可以再次生成内存表示的表示)。
#5
0
This solution at SourceForge uses only standard Python modules:
SourceForge上的此解决方案仅使用标准Python模块:
y_serial.py module :: warehouse Python objects with SQLite
y_serial.py module ::使用SQLite仓库Python对象
"Serialization + persistance :: in a few lines of code, compress and annotate Python objects into SQLite; then later retrieve them chronologically by keywords without any SQL. Most useful "standard" module for a database to store schema-less data."
“序列化+持久性::在几行代码中,将Python对象压缩并注释为SQLite;然后通过关键字按时间顺序检索它们,而不使用任何SQL。最有用的”标准“模块,用于存储无模式数据的数据库。”
http://yserial.sourceforge.net
The compression bonus will probably reduce your 6GB dictionary to 1GB. If you do not want a store a series of dictionaries, the module also contains a file.gz solution which might be more suitable given your dictionary size.
压缩奖励可能会将您的6GB字典减少到1GB。如果您不希望商店出现一系列词典,则该模块还包含一个file.gz解决方案,根据您的字典大小,该解决方案可能更合适。
#6
0
Here are a few alternatives depending on your requirements:
以下是一些替代方案,具体取决于您的要求:
-
numpy
stores your plain data in a compact form and performs group/mass operations wellnumpy以简洁的形式存储您的简单数据,并很好地执行组/批量操作
-
shelve
is like a large dict backed up by a fileshelve就像一个由文件备份的大型字典
-
some 3rd party storage module, e.g.
stash
, stores arbitrary plain data一些第三方存储模块,例如,存储,存储任意明文数据
-
proper database, e.g. mongodb for hairy data or mysql or sqlite plain data and faster retrieval
适当的数据库mongodb用于毛发数据或mysql或sqlite普通数据和更快的检索
#1
60
Why not use python pickle? Python has a great serializing module called pickle it is very easy to use.
为什么不使用python pickle? Python有一个很棒的序列化模块叫做pickle,它很容易使用。
import cPickle
cPickle.dump(obj, open('save.p', 'wb'))
obj = cPickle.load(open('save.p', 'rb'))
There are two disadvantages with pickle:
泡菜有两个缺点:
- It's not secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.
- The format is not human readable.
它对于错误或恶意构造的数据是不安全的。切勿取消从不受信任或未经身份验证的来源收到的数据。
格式不是人类可读的。
If you are using python 2.6 there is a builtin module called json. It is as easy as pickle to use:
如果您使用的是python 2.6,则会有一个名为json的内置模块。它就像泡菜一样简单:
import json
encoded = json.dumps(obj)
obj = json.loads(encoded)
Json format is human readable and is very similar to the dictionary string representation in python. And doesn't have any security issues like pickle. But might be slower than cPickle.
Json格式是人类可读的,与python中的字典字符串表示非常相似。而且没有像pickle这样的安全问题。但可能比cPickle慢。
#2
12
I'd use shelve
, json
, yaml
, or whatever, as suggested by other answers.
如其他答案所示,我会使用shelve,json,yaml等等。
shelve
is specially cool because you can have the dict
on disk and still use it. Values will be loaded on-demand.
shelve特别酷,因为你可以在磁盘上使用dict并仍然使用它。值将按需加载。
But if you really want to parse the text of the dict
, and it contains only str
ings, int
s and tuple
s like you've shown, you can use ast.literal_eval
to parse it. It is a lot safer, since you can't eval full expressions with it - It only works with str
ings, numbers, tuple
s, list
s, dict
s, bool
eans, and None
:
但是如果你真的想要解析dict的文本,并且它只包含你已经显示的字符串,整数和元组,你可以使用ast.literal_eval来解析它。它更加安全,因为你不能用它来评估完整的表达式 - 它只适用于字符串,数字,元组,列表,dicts,布尔值和None:
>>> import ast
>>> print ast.literal_eval("{12: 'mydict', 14: (1, 2, 3)}")
{12: 'mydict', 14: (1, 2, 3)}
#3
4
I would suggest that you use YAML for your file format so you can tinker with it on the disc
我建议您使用YAML作为文件格式,以便在光盘上修改它
How does it look:
- It is indent based
- It can represent dictionaries and lists
- It is easy for humans to understand
An example: This block of code is an example of YAML (a dict holding a list and a string)
Full syntax: http://www.yaml.org/refcard.html
To get it in python, just easy_install pyyaml. See http://pyyaml.org/
要在python中获取它,只需easy_install pyyaml。见http://pyyaml.org/
It comes with easy file save / load functions, that I can't remember right this minute.
它带有简单的文件保存/加载功能,这一点我记不起来了。
#4
0
Write it out in a serialized format, such as pickle (a python standard library module for serialization) or perhaps by using JSON (which is a representation that can be evaled to produce the memory representation again).
以序列化格式写出来,例如pickle(用于序列化的python标准库模块),或者可能使用JSON(这是一种可以被证明可以再次生成内存表示的表示)。
#5
0
This solution at SourceForge uses only standard Python modules:
SourceForge上的此解决方案仅使用标准Python模块:
y_serial.py module :: warehouse Python objects with SQLite
y_serial.py module ::使用SQLite仓库Python对象
"Serialization + persistance :: in a few lines of code, compress and annotate Python objects into SQLite; then later retrieve them chronologically by keywords without any SQL. Most useful "standard" module for a database to store schema-less data."
“序列化+持久性::在几行代码中,将Python对象压缩并注释为SQLite;然后通过关键字按时间顺序检索它们,而不使用任何SQL。最有用的”标准“模块,用于存储无模式数据的数据库。”
http://yserial.sourceforge.net
The compression bonus will probably reduce your 6GB dictionary to 1GB. If you do not want a store a series of dictionaries, the module also contains a file.gz solution which might be more suitable given your dictionary size.
压缩奖励可能会将您的6GB字典减少到1GB。如果您不希望商店出现一系列词典,则该模块还包含一个file.gz解决方案,根据您的字典大小,该解决方案可能更合适。
#6
0
Here are a few alternatives depending on your requirements:
以下是一些替代方案,具体取决于您的要求:
-
numpy
stores your plain data in a compact form and performs group/mass operations wellnumpy以简洁的形式存储您的简单数据,并很好地执行组/批量操作
-
shelve
is like a large dict backed up by a fileshelve就像一个由文件备份的大型字典
-
some 3rd party storage module, e.g.
stash
, stores arbitrary plain data一些第三方存储模块,例如,存储,存储任意明文数据
-
proper database, e.g. mongodb for hairy data or mysql or sqlite plain data and faster retrieval
适当的数据库mongodb用于毛发数据或mysql或sqlite普通数据和更快的检索