最节省空间的方式来存储数以百万计的简单数据?

时间:2021-05-12 23:49:41

My data looks like this:

我的数据是这样的:

00000000001 : `12341234...12341234'

00000000001:“12341234……12341234”

Basically a unique id value associated with a big string of numbers (less than 100 chars).

基本上是一个与一串大的数字(小于100个字符)相关联的唯一id值。

I want to store 10's of millions and maybe even 100's of millions of these pieces of data, just IDs pointing to big number strings. I am wondering what the most space efficient way to store them is and I also want to keep in mind a quick look up time as well. I want my application to be given a number like 550,000 and be able to quickly reference the big string of numbers associated with it.

我想存储10万甚至100万的这些数据,id指向大的字符串。我想知道最节省空间的方式是什么,我也想记住快速查找时间。我希望给我的应用程序一个数字,比如55万,并且能够快速引用与它相关的大字符串。

I have looked at open source DBs as an option (MySQL) and I also considered something like JSON or XML. Are there other options? What would be best?

我曾将开源DBs视为一个选项(MySQL),也考虑过JSON或XML之类的东西。有其他选择吗?最好的是什么?

The reason I am uncertain is because the data is so simple. I am afraid of using certain databases because some are relational or object oriented, but I don't have a need for those features (there might be overhead here). I am also afraid my data is too simple and repetitive for something like JSON too because I feel like much of the file space will be consumed by repeating "id" : and "bignumber" : over and over.

我不确定的原因是数据太简单了。我害怕使用某些数据库,因为有些数据库是面向关系的或面向对象的,但是我不需要这些特性(这里可能存在开销)。我也担心我的数据对于JSON这样的东西来说过于简单和重复,因为我觉得重复“id”和“bignumber”会消耗大量的文件空间。

Any suggestions?

有什么建议吗?

4 个解决方案

#1


3  

It looks like both id and value are integer values, so storing them as binary data (as opposed to strings) would save a lot of space. This rules out JSON or XML, which are text-based.

看起来id和值都是整数值,因此将它们存储为二进制数据(而不是字符串)将节省大量空间。这就排除了基于文本的JSON或XML。

I think you want to use a key-value store, such as BerkeleyDB. They allow fast lookup by key (but nothing else).

我认为您需要使用键值存储,比如BerkeleyDB。它们允许按键快速查找(但不允许其他方式)。

Using something like SQLite would also have very little overhead and allow for convenient access methods.

使用类似SQLite的方法也会有很少的开销,并且允许使用方便的访问方法。

It would also be important that you can access the data without reading it completely into memory first (database engines manage that for you, with JSON or a hand-rolled format this can be a lot of work).

同样重要的是,您可以在不将数据完全读入内存的情况下访问数据(数据库引擎为您管理数据,使用JSON或手工卷格式,这可能需要大量工作)。

If you do not need network access (but want to work on local files), an embedded database system like BerkeleyDB or SQLite seems to be the best fit. Not having a server also greatly reduces the setup overhead.

如果您不需要网络访问(但希望处理本地文件),那么像BerkeleyDB或SQLite这样的嵌入式数据库系统似乎是最佳选择。没有服务器也会大大减少设置开销。

#2


3  

I think the most efficient way to store this data would be to omit the "id" and just store your big numbers in fixed format. You would need about 42 bytes to store numbers with 100 digits or less and you could easily lookup the number you're after by multiplying "id" by 42 and going straight to the offset where your number is stored.

我认为存储这些数据最有效的方法是省略“id”,只以固定格式存储大数字。你需要大约42个字节来存储100位或更少的数字,你可以很容易地通过将“id”乘以42来查找你想要的数字,并直接到达存储数字的偏移量。

#3


1  

MySQL or similar will handle a lot of details for you. SQLite might be good too as you don't need that many features.

MySQL或类似的系统将为您处理许多细节。SQLite可能也不错,因为您不需要那么多特性。

A integer field and a text field would work, but you can pack more data into a binary blob doing packing and unpacking as necessary. I'd probably encode them two digits to a byte, though you could do better if you want to deal with bit shifts and such.

一个整数字段和一个文本字段可以工作,但是您可以将更多的数据打包到二进制blob中,在必要时进行打包和解压。我可能会把它们编码成一个字节的两位数,但是如果你想处理位移位之类的问题,你可以做得更好。

As @gordy suggests, if all your values have lots of digits, you might do better with a fixed row size for everything as it'll be faster for lookups. Use variable width if size is more important.

正如@gordy所建议的,如果您的所有值都有很多位数字,那么您可能会对所有内容都有一个固定的行大小,因为它会更快地查找。如果尺寸更重要,请使用可变宽度。

If your data is going to be read only, you might try compressing it with MySQL's archive table type.

如果您的数据是只读的,您可以尝试使用MySQL的archive表类型压缩它。

http://dev.mysql.com/doc/refman/5.1/en/archive-storage-engine.html

http://dev.mysql.com/doc/refman/5.1/en/archive-storage-engine.html

#4


0  

Any old database should work fine; form BDB (or more modern variants, Redis, Tokyo Cabinet) to standard sql DBs like MySQL or Postgres. My own favorite for latter is H2, a simple but reasonably performant and nicely embeddable SQL DB.

任何旧的数据库都应该运行良好;从BDB(或更现代的变体,Redis, Tokyo Cabinet)到标准的sql DBs(如MySQL或Postgres)。我自己最喜欢的是H2,这是一种简单但性能合理、可嵌入的SQL DB。

For basic storage choices would be larger; XML/JSON (often compressed with gzip) is fine, but if you do need id lookups, a database makes more sense.

对于基本的存储选择将会更大;XML/JSON(通常用gzip压缩)是可以的,但是如果您确实需要id查找,那么数据库更有意义。

#1


3  

It looks like both id and value are integer values, so storing them as binary data (as opposed to strings) would save a lot of space. This rules out JSON or XML, which are text-based.

看起来id和值都是整数值,因此将它们存储为二进制数据(而不是字符串)将节省大量空间。这就排除了基于文本的JSON或XML。

I think you want to use a key-value store, such as BerkeleyDB. They allow fast lookup by key (but nothing else).

我认为您需要使用键值存储,比如BerkeleyDB。它们允许按键快速查找(但不允许其他方式)。

Using something like SQLite would also have very little overhead and allow for convenient access methods.

使用类似SQLite的方法也会有很少的开销,并且允许使用方便的访问方法。

It would also be important that you can access the data without reading it completely into memory first (database engines manage that for you, with JSON or a hand-rolled format this can be a lot of work).

同样重要的是,您可以在不将数据完全读入内存的情况下访问数据(数据库引擎为您管理数据,使用JSON或手工卷格式,这可能需要大量工作)。

If you do not need network access (but want to work on local files), an embedded database system like BerkeleyDB or SQLite seems to be the best fit. Not having a server also greatly reduces the setup overhead.

如果您不需要网络访问(但希望处理本地文件),那么像BerkeleyDB或SQLite这样的嵌入式数据库系统似乎是最佳选择。没有服务器也会大大减少设置开销。

#2


3  

I think the most efficient way to store this data would be to omit the "id" and just store your big numbers in fixed format. You would need about 42 bytes to store numbers with 100 digits or less and you could easily lookup the number you're after by multiplying "id" by 42 and going straight to the offset where your number is stored.

我认为存储这些数据最有效的方法是省略“id”,只以固定格式存储大数字。你需要大约42个字节来存储100位或更少的数字,你可以很容易地通过将“id”乘以42来查找你想要的数字,并直接到达存储数字的偏移量。

#3


1  

MySQL or similar will handle a lot of details for you. SQLite might be good too as you don't need that many features.

MySQL或类似的系统将为您处理许多细节。SQLite可能也不错,因为您不需要那么多特性。

A integer field and a text field would work, but you can pack more data into a binary blob doing packing and unpacking as necessary. I'd probably encode them two digits to a byte, though you could do better if you want to deal with bit shifts and such.

一个整数字段和一个文本字段可以工作,但是您可以将更多的数据打包到二进制blob中,在必要时进行打包和解压。我可能会把它们编码成一个字节的两位数,但是如果你想处理位移位之类的问题,你可以做得更好。

As @gordy suggests, if all your values have lots of digits, you might do better with a fixed row size for everything as it'll be faster for lookups. Use variable width if size is more important.

正如@gordy所建议的,如果您的所有值都有很多位数字,那么您可能会对所有内容都有一个固定的行大小,因为它会更快地查找。如果尺寸更重要,请使用可变宽度。

If your data is going to be read only, you might try compressing it with MySQL's archive table type.

如果您的数据是只读的,您可以尝试使用MySQL的archive表类型压缩它。

http://dev.mysql.com/doc/refman/5.1/en/archive-storage-engine.html

http://dev.mysql.com/doc/refman/5.1/en/archive-storage-engine.html

#4


0  

Any old database should work fine; form BDB (or more modern variants, Redis, Tokyo Cabinet) to standard sql DBs like MySQL or Postgres. My own favorite for latter is H2, a simple but reasonably performant and nicely embeddable SQL DB.

任何旧的数据库都应该运行良好;从BDB(或更现代的变体,Redis, Tokyo Cabinet)到标准的sql DBs(如MySQL或Postgres)。我自己最喜欢的是H2,这是一种简单但性能合理、可嵌入的SQL DB。

For basic storage choices would be larger; XML/JSON (often compressed with gzip) is fine, but if you do need id lookups, a database makes more sense.

对于基本的存储选择将会更大;XML/JSON(通常用gzip压缩)是可以的,但是如果您确实需要id查找,那么数据库更有意义。