搁置对于大型词典来说太慢了,我该怎么做才能提高性能?

时间:2021-08-02 06:43:26

I am storing a table using python and I need persistence.

我使用python存储一个表,我需要持久性。

Essentially I am storing the table as a dictionary string to numbers. And the whole is stored with shelve

基本上我将表作为字典字符串存储到数字中。整个存放与搁置

self.DB=shelve.open("%s%sMoleculeLibrary.shelve"%(directory,os.sep),writeback=True) 

I use writeback to True as I found the system tends to be unstable if I don't.

我使用writeback to True,因为我发现系统往往不稳定,如果我不这样做。

After the computations the system needs to close the database, and store it back. Now the database (the table) is about 540MB, and it is taking ages. The time exploded after the table grew to about 500MB. But I need a much bigger table. In fact I need two of them.

计算完成后,系统需要关闭数据库并将其存储回来。现在数据库(表)大约是540MB,而且需要很长时间。桌子增长到大约500MB后,时间爆炸了。但我需要一张更大的桌子。事实上我需要其中两个。

I am probably using the wrong form of persistence. What can I do to improve performance?

我可能正在使用错误的持久性形式。我该怎么做才能提高性能?

4 个解决方案

#1


13  

For storing a large dictionary of string : number key-value pairs, I'd suggest a JSON-native storage solution such as MongoDB. It has a wonderful API for Python, Pymongo. MongoDB itself is lightweight and incredibly fast, and json objects will natively be dictionaries in Python. This means that you can use your string key as the object ID, allowing for compressed storage and quick lookup.

为了存储字符串的大字典:数字键值对,我建议使用JSON本地存储解决方案,例如MongoDB。它有一个很棒的Python,Pymongo API。 MongoDB本身是轻量级的,速度非常快,json对象本身就是Python中的字典。这意味着您可以使用字符串键作为对象ID,从而允许压缩存储和快速查找。

As an example of how easy the code would be, see the following:

作为代码容易实现的示例,请参阅以下内容:

d = {'string1' : 1, 'string2' : 2, 'string3' : 3}
from pymongo import Connection
conn = Connection()
db = conn['example-database']
collection = db['example-collection']
for string, num in d.items():
    collection.save({'_id' : string, 'value' : num})
# testing
newD = {}
for obj in collection.find():
    newD[obj['_id']] = obj['value']
print newD
# output is: {u'string2': 2, u'string3': 3, u'string1': 1}

You'd just have to convert back from unicode, which is trivial.

你只需要从unicode转换回来,这是微不足道的。

#2


9  

Based on my experience, I would recommend using SQLite3, which comes with Python. It works well with larger databases and key numbers. Millions of keys and gigabytes of data is not a problem. Shelve is totally wasted at that point. Also having separate db-process isn't beneficial, it just requires more context swaps. In my tests I found out that SQLite3 was the preferred option to use, when handling larger data sets locally. Running local database engine like mongo, mysql or postgresql doesn't provide any additional value and also were slower.

根据我的经验,我建议使用Python附带的SQLite3。它适用于较大的数据库和密钥数字。数以百万计的密钥和千兆字节的数据不是问题。 Shelve在这一点上完全被浪费了。单独的db-process也没有用,它只需要更多的上下文交换。在我的测试中,我发现在本地处理更大的数据集时,SQLite3是首选的选项。运行本地数据库引擎(如mongo,mysql或postgresql)不会提供任何其他值,也会更慢。

#3


1  

How much larger? What are the access patterns? What kinds of computation do you need to do on it?

多大了?什么是访问模式?你需要做什么样的计算?

Keep in mind that you are going to have some performance limits if you can't keep the table in memory no matter how you do it.

请记住,如果无论如何都无法将表保留在内存中,您将会有一些性能限制。

You may want to look at going to SQLAlchemy, or directly using something like bsddb, but both of those will sacrifice simplicity of code. However, with SQL you may be able to offload some of the work to the database layer depending on the workload.

您可能希望查看SQLAlchemy,或直接使用类似bsddb的内容,但这两种方法都会牺牲代码的简单性。但是,使用SQL,您可以根据工作负载将部分工作卸载到数据库层。

#4


0  

I think your problem is due to the fact that you use the writeback=True. The documentation says (emphasis is mine):

我认为你的问题是因为你使用了writeback = True。文档说(重点是我的):

Because of Python semantics, a shelf cannot know when a mutable persistent-dictionary entry is modified. By default modified objects are written only when assigned to the shelf (see Example). If the optional writeback parameter is set to True, all entries accessed are also cached in memory, and written back on sync() and close(); this can make it handier to mutate mutable entries in the persistent dictionary, but, if many entries are accessed, it can consume vast amounts of memory for the cache, and it can make the close operation very slow since all accessed entries are written back (there is no way to determine which accessed entries are mutable, nor which ones were actually mutated).

由于Python语义,架子无法知道何时修改了可变的持久字典条目。默认情况下,只有在分配给工具架时才会编写修改的对象(请参见示例)。如果可选的writeback参数设置为True,则访问的所有条目也会缓存在内存中,并写回sync()和close();这可以使持久化字典中的可变条目变得更容易,但是,如果访问了许多条目,它可能会为缓存消耗大量内存,并且它可以使关闭操作非常慢,因为所有访问的条目都被写回(没有办法确定哪些访问的条目是可变的,哪些是实际变异的。

You could avoid using writeback=True and make sure the data is written only once (you have to pay attention that subsequent modifications are going to be lost).

您可以避免使用writeback = True并确保数据只写入一次(您必须注意后续修改将丢失)。

If you believe this is not the right storage option (it's difficult to say without knowing how the data is structured), I suggest sqlite3, it's integrated in python (thus very portable) and has very nice performances. It's somewhat more complicated than a simple key-value store.

如果您认为这不是正确的存储选项(很难说不知道数据的结构如何),我建议使用sqlite3,它集成在python中(因此非常便携)并且具有非常好的性能。它比简单的键值存储更复杂。

See other answers for alternatives.

查看替代品的其他答案。

#1


13  

For storing a large dictionary of string : number key-value pairs, I'd suggest a JSON-native storage solution such as MongoDB. It has a wonderful API for Python, Pymongo. MongoDB itself is lightweight and incredibly fast, and json objects will natively be dictionaries in Python. This means that you can use your string key as the object ID, allowing for compressed storage and quick lookup.

为了存储字符串的大字典:数字键值对,我建议使用JSON本地存储解决方案,例如MongoDB。它有一个很棒的Python,Pymongo API。 MongoDB本身是轻量级的,速度非常快,json对象本身就是Python中的字典。这意味着您可以使用字符串键作为对象ID,从而允许压缩存储和快速查找。

As an example of how easy the code would be, see the following:

作为代码容易实现的示例,请参阅以下内容:

d = {'string1' : 1, 'string2' : 2, 'string3' : 3}
from pymongo import Connection
conn = Connection()
db = conn['example-database']
collection = db['example-collection']
for string, num in d.items():
    collection.save({'_id' : string, 'value' : num})
# testing
newD = {}
for obj in collection.find():
    newD[obj['_id']] = obj['value']
print newD
# output is: {u'string2': 2, u'string3': 3, u'string1': 1}

You'd just have to convert back from unicode, which is trivial.

你只需要从unicode转换回来,这是微不足道的。

#2


9  

Based on my experience, I would recommend using SQLite3, which comes with Python. It works well with larger databases and key numbers. Millions of keys and gigabytes of data is not a problem. Shelve is totally wasted at that point. Also having separate db-process isn't beneficial, it just requires more context swaps. In my tests I found out that SQLite3 was the preferred option to use, when handling larger data sets locally. Running local database engine like mongo, mysql or postgresql doesn't provide any additional value and also were slower.

根据我的经验,我建议使用Python附带的SQLite3。它适用于较大的数据库和密钥数字。数以百万计的密钥和千兆字节的数据不是问题。 Shelve在这一点上完全被浪费了。单独的db-process也没有用,它只需要更多的上下文交换。在我的测试中,我发现在本地处理更大的数据集时,SQLite3是首选的选项。运行本地数据库引擎(如mongo,mysql或postgresql)不会提供任何其他值,也会更慢。

#3


1  

How much larger? What are the access patterns? What kinds of computation do you need to do on it?

多大了?什么是访问模式?你需要做什么样的计算?

Keep in mind that you are going to have some performance limits if you can't keep the table in memory no matter how you do it.

请记住,如果无论如何都无法将表保留在内存中,您将会有一些性能限制。

You may want to look at going to SQLAlchemy, or directly using something like bsddb, but both of those will sacrifice simplicity of code. However, with SQL you may be able to offload some of the work to the database layer depending on the workload.

您可能希望查看SQLAlchemy,或直接使用类似bsddb的内容,但这两种方法都会牺牲代码的简单性。但是,使用SQL,您可以根据工作负载将部分工作卸载到数据库层。

#4


0  

I think your problem is due to the fact that you use the writeback=True. The documentation says (emphasis is mine):

我认为你的问题是因为你使用了writeback = True。文档说(重点是我的):

Because of Python semantics, a shelf cannot know when a mutable persistent-dictionary entry is modified. By default modified objects are written only when assigned to the shelf (see Example). If the optional writeback parameter is set to True, all entries accessed are also cached in memory, and written back on sync() and close(); this can make it handier to mutate mutable entries in the persistent dictionary, but, if many entries are accessed, it can consume vast amounts of memory for the cache, and it can make the close operation very slow since all accessed entries are written back (there is no way to determine which accessed entries are mutable, nor which ones were actually mutated).

由于Python语义,架子无法知道何时修改了可变的持久字典条目。默认情况下,只有在分配给工具架时才会编写修改的对象(请参见示例)。如果可选的writeback参数设置为True,则访问的所有条目也会缓存在内存中,并写回sync()和close();这可以使持久化字典中的可变条目变得更容易,但是,如果访问了许多条目,它可能会为缓存消耗大量内存,并且它可以使关闭操作非常慢,因为所有访问的条目都被写回(没有办法确定哪些访问的条目是可变的,哪些是实际变异的。

You could avoid using writeback=True and make sure the data is written only once (you have to pay attention that subsequent modifications are going to be lost).

您可以避免使用writeback = True并确保数据只写入一次(您必须注意后续修改将丢失)。

If you believe this is not the right storage option (it's difficult to say without knowing how the data is structured), I suggest sqlite3, it's integrated in python (thus very portable) and has very nice performances. It's somewhat more complicated than a simple key-value store.

如果您认为这不是正确的存储选项(很难说不知道数据的结构如何),我建议使用sqlite3,它集成在python中(因此非常便携)并且具有非常好的性能。它比简单的键值存储更复杂。

See other answers for alternatives.

查看替代品的其他答案。