适用于Python的快速,可搜索的dict存储

时间:2021-04-01 16:58:30

Current I use SQLite (w/ SQLAlchemy) to store about 5000 dict objects. Each dict object corresponds to an entry in PyPI with keys - (name, version, summary .. sometimes 'description' can be as big as the project documentation).

目前我使用SQLite(带有SQLAlchemy)来存储大约5000个dict对象。每个dict对象对应于PyPI中带有键的条目 - (名称,版本,摘要......有时“描述”可以与项目文档一样大)。

Writing these entries (from JSON) back to the disk (SQLite format) takes several seconds, and it feels slow.

将这些条目(从JSON)写回磁盘(SQLite格式)需要几秒钟,而且感觉很慢。

Writing is done as frequent as once a day, but reading/searching for a particular entry based on a key (usually name or description) is done very often.

写作每天进行一次,但是经常根据键(通常是名称或描述)读取/搜索特定条目。

Just like apt-get.

就像apt-get一样。

Is there a storage library for use with Python that will suit my needs better than SQLite?

是否有一个与Python一起使用的存储库,可以比SQLite更好地满足我的需求?

4 个解决方案

#1


Did you put indices on name and description? Searching on 5000 indexed entries should be essentially instantaneous (of course ORMs will make your life much harder, as they usually do [even relatively good ones such as SQLAlchemy, but try "raw sqlite" and it absolutely should fly).

你是否在名称和描述上加上了索引?搜索5000个索引条目应该基本上是即时的(当然ORM会让你的生活变得更加困难,因为它们通常会[甚至比较好的,如SQLAlchemy,但尝试“原始sqlite”,它绝对应该飞)。

Writing just the updated entries (again with real SQL) should also be basically instantaneous -- ideally a single update statement should do it, but even a thousand should be no real problem, just make sure to turn off autocommit at the start of the loop (and if you want turn it back again later).

只写更新的条目(再次使用真正的SQL)也应该基本上是即时的 - 理想情况下,单个更新语句应该这样做,但即使是一千也不应该是真正的问题,只需确保在循环开始时关闭自动提交(如果你想稍后再转回来)。

#2


It might be overkill for your application, but you ought to check out schema-free/document-oriented databases. Personally I'm a fan of couchdb. Basically, rather than store records as rows in a table, something like couchdb stores key-value pairs, and then (in the case of couchdb) you write views in javascript to cull the data you need. These databases are usually easier to scale than relational databases, and in your case may be much faster, since you dont have to hammer your data into a shape that will fit into a relational database. On the other hand, it means that there is another service running.

它可能对您的应用程序来说太过分了,但您应该检查无架构/面向文档的数据库。就个人而言,我是couchdb的粉丝。基本上,不是将记录存储为表中的行,而是像couchdb存储键值对,然后(在couchdb的情况下),您在javascript中编写视图以剔除您需要的数据。这些数据库通常比关系数据库更容易扩展,在您的情况下可能会更快,因为您不必将数据锤入适合关系数据库的形状。另一方面,这意味着还有另一个服务正在运行。

#3


Given the approximate number of objects stated (around 5,000), SQLite is probably not the problem behind speed. It's the intermediary measures; for example JSON or possibly non-optimal use of SQLAlChemy.

鉴于所述对象的大概数量(约5,000个),SQLite可能不是速度背后的问题。这是中介措施;例如JSON或可能非最佳使用SQLAlChemy。

Try this out (fairly fast even for million objects): y_serial.py module :: warehouse Python objects with SQLite

试试这个(即使对于百万个对象也相当快):y_serial.py module ::使用SQLite的仓库Python对象

"Serialization + persistance :: in a few lines of code, compress and annotate Python objects into SQLite; then later retrieve them chronologically by keywords without any SQL. Most useful "standard" module for a database to store schema-less data."

“序列化+持久性::在几行代码中,将Python对象压缩并注释为SQLite;然后通过关键字按时间顺序检索它们,而不使用任何SQL。最有用的”标准“模块,用于存储无模式数据的数据库。”

http://yserial.sourceforge.net

The yserial search on your keys is done using the regular expression ("regex") code on the SQLite side, not Python, so there's another substantial speed improvement.

密钥上的yserial搜索是使用SQLite端的正则表达式(“regex”)代码完成的,而不是Python,因此还有另一个显着的速度提升。

Let us know how it works out.

让我们知道它是如何运作的。

#4


I'm solving a very similar problem for myself right now, using Nucular, which might suit your needs. It's a file-system based storage and seems very fast indeed. (It comes with an example app that indexes the whole python source tree) It's concurrent-safe, requires no external libraries and is pure python. It searches rapidly and has powerful fulltext search, indexing and so on - kind of a specialised, in-process, native python-dict store after the manner of the trendy Couchdb and mongodb, but much lighter.

我正在为自己解决一个非常类似的问题,使用Nucular,它可能适合你的需要。它是一个基于文件系统的存储,看起来非常快。 (它带有一个示例应用程序,索引整个python源代码树)它是并发安全的,不需要外部库,而且是纯python。它快速搜索并具有强大的全文搜索,索引等等 - 一种专门的,正在进行中的本地python-dict商店,以时尚的Couchdb和mongodb的方式,但更轻。

It does have limitations, though - it can't store or query on nested dictionaries, so not every JSON type can be stored in it. Moreover, although its text searching is powerful, its numerical queries are weak and unindexed. Nonetheless, it may be precisely what you are after.

但它确实有局限性 - 它不能存储或查询嵌套字典,因此不是每种JSON类型都可以存储在其中。此外,虽然它的文本搜索功能强大,但它的数字查询很弱且没有索引。尽管如此,它可能正是你所追求的。

#1


Did you put indices on name and description? Searching on 5000 indexed entries should be essentially instantaneous (of course ORMs will make your life much harder, as they usually do [even relatively good ones such as SQLAlchemy, but try "raw sqlite" and it absolutely should fly).

你是否在名称和描述上加上了索引?搜索5000个索引条目应该基本上是即时的(当然ORM会让你的生活变得更加困难,因为它们通常会[甚至比较好的,如SQLAlchemy,但尝试“原始sqlite”,它绝对应该飞)。

Writing just the updated entries (again with real SQL) should also be basically instantaneous -- ideally a single update statement should do it, but even a thousand should be no real problem, just make sure to turn off autocommit at the start of the loop (and if you want turn it back again later).

只写更新的条目(再次使用真正的SQL)也应该基本上是即时的 - 理想情况下,单个更新语句应该这样做,但即使是一千也不应该是真正的问题,只需确保在循环开始时关闭自动提交(如果你想稍后再转回来)。

#2


It might be overkill for your application, but you ought to check out schema-free/document-oriented databases. Personally I'm a fan of couchdb. Basically, rather than store records as rows in a table, something like couchdb stores key-value pairs, and then (in the case of couchdb) you write views in javascript to cull the data you need. These databases are usually easier to scale than relational databases, and in your case may be much faster, since you dont have to hammer your data into a shape that will fit into a relational database. On the other hand, it means that there is another service running.

它可能对您的应用程序来说太过分了,但您应该检查无架构/面向文档的数据库。就个人而言,我是couchdb的粉丝。基本上,不是将记录存储为表中的行,而是像couchdb存储键值对,然后(在couchdb的情况下),您在javascript中编写视图以剔除您需要的数据。这些数据库通常比关系数据库更容易扩展,在您的情况下可能会更快,因为您不必将数据锤入适合关系数据库的形状。另一方面,这意味着还有另一个服务正在运行。

#3


Given the approximate number of objects stated (around 5,000), SQLite is probably not the problem behind speed. It's the intermediary measures; for example JSON or possibly non-optimal use of SQLAlChemy.

鉴于所述对象的大概数量(约5,000个),SQLite可能不是速度背后的问题。这是中介措施;例如JSON或可能非最佳使用SQLAlChemy。

Try this out (fairly fast even for million objects): y_serial.py module :: warehouse Python objects with SQLite

试试这个(即使对于百万个对象也相当快):y_serial.py module ::使用SQLite的仓库Python对象

"Serialization + persistance :: in a few lines of code, compress and annotate Python objects into SQLite; then later retrieve them chronologically by keywords without any SQL. Most useful "standard" module for a database to store schema-less data."

“序列化+持久性::在几行代码中,将Python对象压缩并注释为SQLite;然后通过关键字按时间顺序检索它们,而不使用任何SQL。最有用的”标准“模块,用于存储无模式数据的数据库。”

http://yserial.sourceforge.net

The yserial search on your keys is done using the regular expression ("regex") code on the SQLite side, not Python, so there's another substantial speed improvement.

密钥上的yserial搜索是使用SQLite端的正则表达式(“regex”)代码完成的,而不是Python,因此还有另一个显着的速度提升。

Let us know how it works out.

让我们知道它是如何运作的。

#4


I'm solving a very similar problem for myself right now, using Nucular, which might suit your needs. It's a file-system based storage and seems very fast indeed. (It comes with an example app that indexes the whole python source tree) It's concurrent-safe, requires no external libraries and is pure python. It searches rapidly and has powerful fulltext search, indexing and so on - kind of a specialised, in-process, native python-dict store after the manner of the trendy Couchdb and mongodb, but much lighter.

我正在为自己解决一个非常类似的问题,使用Nucular,它可能适合你的需要。它是一个基于文件系统的存储,看起来非常快。 (它带有一个示例应用程序,索引整个python源代码树)它是并发安全的,不需要外部库,而且是纯python。它快速搜索并具有强大的全文搜索,索引等等 - 一种专门的,正在进行中的本地python-dict商店,以时尚的Couchdb和mongodb的方式,但更轻。

It does have limitations, though - it can't store or query on nested dictionaries, so not every JSON type can be stored in it. Moreover, although its text searching is powerful, its numerical queries are weak and unindexed. Nonetheless, it may be precisely what you are after.

但它确实有局限性 - 它不能存储或查询嵌套字典,因此不是每种JSON类型都可以存储在其中。此外,虽然它的文本搜索功能强大,但它的数字查询很弱且没有索引。尽管如此,它可能正是你所追求的。