当数据不适合内存时，适用于Erlang应用程序的数据存储后端

I'm researching possible options how to organize data storage for an Erlang application. The data it supposed to use is basically a huge collection of binary blobs indexed by short string ids. Each blob is under 10 Kb but there are many of them. I'd expect that in total they would have size up to 200 Gb so obviously it cannot fit into memory. The typical operation on this data is either reading a blob by its id or updating a blob by its id or adding a new one. At each given period of day only a subset of ids is being used so the data storage access performance might benefit from in-memory cache. Speaking about performance - it is quite critical. The target is to have around 500 reads and 500 updates per second on commodity hardware (say on EC2 VM).

我正在研究如何为Erlang应用程序组织数据存储的可能选项。它应该使用的数据基本上是由短字符串id索引的大量二进制blob集合。每个斑点都在10 Kb以下,但其中有很多。我希望它们的总体尺寸可达200 Gb,所以很明显它无法适应内存。对此数据的典型操作是通过其id读取blob或通过其id更新blob或添加新的blob。在每个给定的时段,仅使用一部分ID,因此数据存储访问性能可能受益于内存缓存。谈到性能 - 这非常关键。目标是在商用硬件上(例如在EC2 VM上)每秒进行大约500次读取和500次更新。

Any suggestions what to use here? As I understand dets is out of question as it is limited to 2G (or was it 4G?). Mnesia probably out of question too; my impression is that it was mainly designed for cases when data fits memory. I'm considering trying EDTK's Berkeley DB driver for the task. Would it work in the above scenario? Does anybody have experience using it in the production in the similar conditions?

有什么建议可以在这里使用吗?据我所知,dets是不可能的,因为它仅限于2G(或者它是4G?)。 Mnesia也可能毫无疑问;我的印象是它主要是为数据适合内存的情况而设计的。我正在考虑尝试使用EDTK的Berkeley DB驱动程序。它会在上述情况下起作用吗?有没有人在类似条件下的生产中使用它的经验?

5 个解决方案

#1

tcerl came out of facing the same size limit. I'm not using Erlang these days but it sounds like what you're looking for.

tcerl面临相同的大小限制。这些天我没有使用Erlang,但这听起来像你正在寻找的。

#2

Have you looked at what CouchDB is doing? It might not be quite what you are after as a drop in product, but there is lots of erlang code in there for storing data. There is also some talk of providing a native erlang interface instead of the REST api.

你看过CouchDB在做什么吗?作为产品的下降,它可能不是你所追求的,但是存在大量用于存储数据的erlang代码。还有一些关于提供本机erlang接口而不是REST api的讨论。

#3

Is there any reason why you can't just use a file system, treating filename as your string id and file contents as a binary blob? You can choose one (filesystem) that fits your performance requirements, and you should get caching basically for free, provided by your OS.

是否有任何理由不能只使用文件系统,将文件名视为字符串ID并将文件内容视为二进制blob?您可以选择一个(文件系统),以满足您的性能要求,您应该基本上免费获得缓存,由您的操作系统提供。

#4

Mnesia can store data on disk just fine. There's also dets (disk based term storage) which is roughly analogous to Berkeley DB. It's in the standard lib: http://www.erlang.org/doc/apps/stdlib/index.html

Mnesia可以将数据存储在磁盘上。还有dets(基于磁盘的术语存储),它大致类似于Berkeley DB。它位于标准库中:http://www.erlang.org/doc/apps/stdlib/index.html

#5

I would recommend Apache CouchDB.

我推荐Apache CouchDB。

It's a great fit for Erlang, and from the sound of it (you mention ID-based blobs and don't mention any relational requirements) you're looking for a document-oriented database.

它非常适合Erlang,从它的声音(你提到基于ID的blob,并没有提到任何关系要求),你正在寻找一个面向文档的数据库。

Since the interface is REST, you can very simply add a commodity HTTP cache in front of it if you need caching.

由于接口是REST,因此如果需要缓存,可以非常简单地在其前面添加商品HTTP缓存。

The documentation for CouchDB is of a very high quality.

CouchDB的文档质量非常高。

It also has built-in Map-Reduce :)

它还内置了Map-Reduce :)

#1