JSON文档数据库中的密钥成本(mongodb,elasticsearch)

时间:2022-08-25 13:50:01

I would like if someone had any experience with speed or optimization effects on the size of JSON keys in a document store database like mongodb or elasticsearch.

我想如果有人对文件存储数据库(如mongodb或elasticsearch)中的JSON键大小有任何速度或优化效果的经验。

So for example: I have 2 documents

例如:我有2个文件

doc1: { keeeeeey1: 'abc', keeeeeeey2: 'xyz')

doc2: { k1: 'abc', k2: 'xyz')

Lets say I have 10 million records, then to store data in doc1 format would mean more db file size than to store in doc2.

假设我有1000万条记录,那么以doc1格式存储数据将意味着更多的db文件大小而不是存储在doc2中。

Other than that would are the disadvantages or negative effects in terms of speed or RAM or any other optimization?

除此之外,速度或RAM或任何其他优化方面的缺点或负面影响是什么?

1 个解决方案

#1


1  

You correctly noticed that the documents will have different size. So you will save at least 15 bytes per document (60% for similar documents) if you decide to adopt the second schema. This will end up in something like 140MB for your 10 million records. This will give you the following advantage:

您正确地注意到文档将具有不同的大小。因此,如果您决定采用第二个模式,那么每个文档至少可以保存15个字节(类似文档为60%)。对于1000万条记录,这最终将达到140MB。这将为您提供以下优势:

  • HDD savings. The only problem is that looking at the prices for current HDD this is mostly useless.
  • 硬盘节省。唯一的问题是,查看当前硬盘的价格,这几乎是无用的。

  • RAM saving. In comparison with hard discs, this can be useful for indexing. In mongodb working set of indexes should fit in RAM to achieve a good performance. So if you will have indexes on these two fields, you will not only save 140MB of HDD space but also 140MB of potential RAM space (which is actually noticable).
  • 节省RAM。与硬盘相比,这对索引很有用。在mongodb中,工作集索引应该适合RAM以实现良好的性能。因此,如果您在这两个字段上有索引,您不仅可以节省140MB的HDD空间,还可以节省140MB的潜在RAM空间(实际上是显着的)。

  • I/O. A lot of bottlenecks happens due to the limitation of input/output system (the speed of reading/writing from the disk is limited). For your documents, this means that with schema 2 you can potentially read/write twice as many documents per 1 second.
  • I / O。由于输入/输出系统的限制(从磁盘读取/写入的速度有限),会发生许多瓶颈。对于您的文档,这意味着使用模式2,您可以每1秒读取/写入两倍的文档。

  • network. In a lot of situations network is even way slower then IO, and if you DB server is on different machine then you application server the data has to be sent over the wire. And you will also be able to send twice as much data.
  • 网络。在很多情况下,网络甚至比IO慢,如果数据库服务器在不同的机器上,那么应用服务器必须通过网络发送数据。而且您还可以发送两倍的数据。

After telling about advantages, I have to tell you a disadvantage for a small keys:

在讲述优点之后,我不得不告诉你一个小钥匙的缺点:

  • readability of the database. When you do db.coll.findOne() and sees {_id: 1, t: 13423, a: 3, b:0.2} it is pretty hard to understand what is exactly stored here.
  • 数据库的可读性。当你执行db.coll.findOne()并看到{_id:1,t:13423,a:3,b:0.2}时,很难理解这里存储的是什么。

  • readability of the application similar with the database, but at least here you can have a solution. With a mapping logic, which transforms currentDate to c and price to p you can write a clean code and have a short schema.
  • 应用程序的可读性与数据库类似,但至少在这里你可以有一个解决方案。使用映射逻辑,将currentDate转换为c并将price转换为p,您可以编写干净的代码并使用简短的模式。

#1


1  

You correctly noticed that the documents will have different size. So you will save at least 15 bytes per document (60% for similar documents) if you decide to adopt the second schema. This will end up in something like 140MB for your 10 million records. This will give you the following advantage:

您正确地注意到文档将具有不同的大小。因此,如果您决定采用第二个模式,那么每个文档至少可以保存15个字节(类似文档为60%)。对于1000万条记录,这最终将达到140MB。这将为您提供以下优势:

  • HDD savings. The only problem is that looking at the prices for current HDD this is mostly useless.
  • 硬盘节省。唯一的问题是,查看当前硬盘的价格,这几乎是无用的。

  • RAM saving. In comparison with hard discs, this can be useful for indexing. In mongodb working set of indexes should fit in RAM to achieve a good performance. So if you will have indexes on these two fields, you will not only save 140MB of HDD space but also 140MB of potential RAM space (which is actually noticable).
  • 节省RAM。与硬盘相比,这对索引很有用。在mongodb中,工作集索引应该适合RAM以实现良好的性能。因此,如果您在这两个字段上有索引,您不仅可以节省140MB的HDD空间,还可以节省140MB的潜在RAM空间(实际上是显着的)。

  • I/O. A lot of bottlenecks happens due to the limitation of input/output system (the speed of reading/writing from the disk is limited). For your documents, this means that with schema 2 you can potentially read/write twice as many documents per 1 second.
  • I / O。由于输入/输出系统的限制(从磁盘读取/写入的速度有限),会发生许多瓶颈。对于您的文档,这意味着使用模式2,您可以每1秒读取/写入两倍的文档。

  • network. In a lot of situations network is even way slower then IO, and if you DB server is on different machine then you application server the data has to be sent over the wire. And you will also be able to send twice as much data.
  • 网络。在很多情况下,网络甚至比IO慢,如果数据库服务器在不同的机器上,那么应用服务器必须通过网络发送数据。而且您还可以发送两倍的数据。

After telling about advantages, I have to tell you a disadvantage for a small keys:

在讲述优点之后,我不得不告诉你一个小钥匙的缺点:

  • readability of the database. When you do db.coll.findOne() and sees {_id: 1, t: 13423, a: 3, b:0.2} it is pretty hard to understand what is exactly stored here.
  • 数据库的可读性。当你执行db.coll.findOne()并看到{_id:1,t:13423,a:3,b:0.2}时,很难理解这里存储的是什么。

  • readability of the application similar with the database, but at least here you can have a solution. With a mapping logic, which transforms currentDate to c and price to p you can write a clean code and have a short schema.
  • 应用程序的可读性与数据库类似,但至少在这里你可以有一个解决方案。使用映射逻辑,将currentDate转换为c并将price转换为p,您可以编写干净的代码并使用简短的模式。