存储大量数据:数据库还是文件系统?

时间:2022-05-17 17:00:54

Let's say my application creates, stores and retrieves a very large amount of entries (tens of millions). Each entry has variable number of different data (for example, some entries have only a few bytes such as ID/title, while some may have megabytes of supplementary data). Basic structure of each entry is same and is in XML format.

假设我的应用程序创建、存储和检索大量条目(数千万)。每个条目都有不同数据的可变数量(例如,有些条目只有几个字节,比如ID/title,而有些条目可能有mb的补充数据)。每个条目的基本结构是相同的,是XML格式。

Entries are created and edited (most likely by appending, not rewriting) arbitrarily.

条目被创建和编辑(很可能是通过添加而不是重写)。

Does it make sense to store entries as separate files in a file system while keeping necessary sets of indexes in the DB vs. saving everything in a DB?

将条目存储为文件系统中的独立文件,同时在DB中保存必要的索引集,还是在DB中保存所有内容,这样做有意义吗?

7 个解决方案

#1


4  

It really depends on how you're going to use it. Databases can handle more entries in a table than most people think, especially with proper indexing. On the other hand, if you aren't going to be making use of the functionality that a relational database provides, there might not be much reason to use it.

这取决于你如何使用它。数据库可以处理表中比大多数人认为的更多的条目,特别是使用适当的索引。另一方面,如果您不打算使用关系数据库提供的功能,那么可能没有太多理由使用它。

Ok, enough generalizing. Given that a database eventually boils down to "files on disk" anyway, I wouldn't worry too much about what "the right thing to do" is. If the primary purpose of the database is just to efficiently retrieve these files, I think it would be perfectly fine to keep the DB entries small and look up file paths instead of actual data - especially since your file system should be pretty efficient at retrieving data given a specific location.

好的,足够的概括。考虑到数据库最终归结为“磁盘上的文件”,我不会太担心“正确的事情”是什么。如果数据库的主要目的是为了有效地检索这些文件,我认为这将是完全可以保持数据库条目小和查找文件路径,而不是实际的数据,尤其是您的文件系统应该很有效地检索数据给出一个特定的位置。

In case you're interested, this is actually a common data storage pattern for search engines - the index will store the indexed data and a pointer to the stored data on disk, rather than storing everything in the index.

如果您感兴趣,这实际上是搜索引擎的一种通用数据存储模式——索引将存储索引数据和指向磁盘上存储的数据的指针,而不是将所有数据存储在索引中。

#2


3  

I would definitely store the data on the file system and a hash the path in the DB.

我肯定会将数据存储在文件系统中,并对DB中的路径进行散列处理。

#3


1  

Well depending on your costs, MS SQL Server has what's called a "Primary XML Index" that can be created, even on unstructured data. This allows you to write XQuery to search down the columns and the database will assist you.

根据您的成本,MS SQL Server有所谓的“主XML索引”,可以创建,甚至可以创建非结构化数据。这允许您编写XQuery来搜索列,数据库将帮助您。

If there is any coherency at all in the data, or it can be placed into a schema then you may see a benefit to this.

如果数据中有任何一致性,或者可以将其放入模式中,那么您可能会看到这一点的好处。

Might I recommend if you have large amounts of binary data such as images etc, that you strip these out and place them somewhere else, such as a file system. Or if you use 2008 there is a type called "Filestream" (cheers @Marc_s) which allows you to index, store and secure all the files you write down and use NTFS APIs to retrieve them (i.e fast block transfer) but still have them kept as columns in the database.

如果您有大量的二进制数据(如图像等),我建议您将这些数据删除并放置到其他地方,比如文件系统。或者,如果您使用2008,有一种类型叫做“Filestream”(cheers @Marc_s),它允许您索引、存储和保护您写下的所有文件,并使用NTFS api来检索它们(i)。e快速块传输)但是仍然保持它们作为数据库中的列。

Having the database there might give you a good layer of abstraction and scaling if your application puts large demands on searching through the XML data, which means that you don't have to.

如果应用程序对搜索XML数据提出了很大的要求,那么拥有数据库可能会给您一个很好的抽象层和扩展层,这意味着您不必这么做。

Just my 2c.

只是我的2摄氏度。

#4


1  

At work I often have to accumulate large sets of XML documents for later analysis. Normally this is done by sticking them into a directory, and the analysis is done by grep (or a bespoke Java program with all its XML factory/builder/wrapper/API paraphernalia).

在工作中,我经常需要积累大量XML文档,以便以后进行分析。通常,这是通过将它们插入到一个目录中来完成的,并且分析是由grep(或者是一个定制的Java程序,其中包含所有的XML工厂/构建器/包装器/API工具)完成的。

One slow day I thought I'd try putting it in PostgreSQL. There are two features that I wanted to try out:

有一天,我觉得我应该试试PostgreSQL。我想尝试两个特性:

  • Automatic compression of large data when appropriate (TOAST).
  • 在适当的时候自动压缩大数据(吐司)。
  • Indexing using an expression.
  • 使用一个表达式索引。

Regarding the first feature, the DB size was less than half of the raw files size. Doing a full text search, a table scan using WHERE data::TEXT LIKE '%pattern%', was actually quicker than running grep on the files. When you are dealing with a few GB of XML this alone makes the DB worthwhile.

对于第一个特性,DB大小小于原始文件大小的一半。执行全文文本搜索(使用WHERE data:::文本如'%pattern%)进行表扫描,实际上比在文件上运行grep要快。当您处理几GB的XML时,仅这一点就使DB有价值。

The second feature, indexing, is a bit more work to maintain. There were a few particular elements I guessed would be good to index. An index on xpath('//tradeHeader/tradeId/text()', data) works, but it can be a pain to duplicate in each query. I found it easier to add ordinary columns for some fields, and use insert/update triggers to keep them in sync.

第二个特性,索引,需要做更多的维护工作。我认为有一些特定的元素可以作为索引。xpath的索引('//tradeHeader/tradeId/text(), data)可以工作,但是在每个查询中复制索引会很麻烦。我发现为某些字段添加普通列更容易,并使用insert/update触发器来保持它们的同步。

#5


1  

A couple of considerations:

一些注意事项:

  • transaction management;
  • 事务管理;
  • backup and recovery.
  • 备份和恢复。

These are general easier to marshal with a database than with a file system. But probably the hardest thing is to synchronise a file system backup with a database's roll forward (redo) logging. The more transactional your application, the more these factors matter.

一般来说,使用数据库进行封送比使用文件系统更容易。但是最困难的事情可能是将文件系统备份与数据库的前滚(重做)日志同步。应用程序的事务性越强,这些因素就越重要。

It appears from your question that you are not intending to make any use of normal database functionality (relational integrity, joining). In which case you should give strong consideration to a third option: store your data in the file system and, instead of a database, use a file-based text retrieval engine like Solr (or Lucene) , Sphinx, Autonomy, etc.

从您的问题中可以看出,您不打算使用任何正常的数据库功能(关系完整性,join)。在这种情况下,您应该对第三种选择给予强烈的考虑:将数据存储在文件系统中,而不是使用数据库,使用基于文件的文本检索引擎,如Solr(或Lucene)、Sphinx、Autonomy等。

#6


1  

I will use HDFS(Hadoop distributed file system) to store the data. Main idea is that you will get high availability, scalability and replication. Any queries to your application can be made map reduce queries. And main fields can be stored as a distributed index on top of Hadoop using Katta.

我将使用HDFS(Hadoop分布式文件系统)来存储数据。主要思想是您将获得高可用性、可伸缩性和复制。对应用程序的任何查询都可以进行map reduce查询。可以使用Katta将主字段存储为Hadoop上的分布式索引。

Try googling for these technologies.

尝试在谷歌上搜索这些技术。

#7


0  

It depends on how you are going to use the data, as a previous response says.

这取决于你将如何使用这些数据,正如之前的回应所说。

The data in a database can be used to support a lot of different kinds of queries, and feed the results to reports, forms, OLAP engines and lots of other kinds of tools. Appropriate indexing can speed up searches dramatically.

数据库中的数据可以用于支持许多不同类型的查询,并将结果提供给报告、表单、OLAP引擎和许多其他类型的工具。适当的索引可以极大地加速搜索。

If you know SQL, and if the database is well designed, coming up with queries is easier, quicker, and less error prone than doing the equivalent thing with files. But, as others have noted, you can plug your XML data into SQL without moving it to a database.

如果您了解SQL,并且数据库设计良好,那么与对文件执行相同的操作相比,生成查询更容易、更快、更不容易出错。但是,正如其他人所注意到的,您可以将XML数据插入到SQL中,而不必将其移动到数据库中。

Designing a good multipurpose schema is harder than most beginners think it is. There's a lot to learn, and it isn't just about how to manipulate one tool or another. And a bad multipurpose schema can be even harder to work with than files.

设计一个好的多用途模式比大多数初学者认为的要困难。要学习的东西很多,不仅仅是如何操作一个或另一个工具。一个糟糕的多用途模式比文件更难处理。

If you decide to go with a database, be prepared to make a significant investment. And make sure you are going to get the benefits of that investment.

如果您决定使用数据库,请准备好进行重大投资。确保你能从投资中获益。

#1


4  

It really depends on how you're going to use it. Databases can handle more entries in a table than most people think, especially with proper indexing. On the other hand, if you aren't going to be making use of the functionality that a relational database provides, there might not be much reason to use it.

这取决于你如何使用它。数据库可以处理表中比大多数人认为的更多的条目,特别是使用适当的索引。另一方面,如果您不打算使用关系数据库提供的功能,那么可能没有太多理由使用它。

Ok, enough generalizing. Given that a database eventually boils down to "files on disk" anyway, I wouldn't worry too much about what "the right thing to do" is. If the primary purpose of the database is just to efficiently retrieve these files, I think it would be perfectly fine to keep the DB entries small and look up file paths instead of actual data - especially since your file system should be pretty efficient at retrieving data given a specific location.

好的,足够的概括。考虑到数据库最终归结为“磁盘上的文件”,我不会太担心“正确的事情”是什么。如果数据库的主要目的是为了有效地检索这些文件,我认为这将是完全可以保持数据库条目小和查找文件路径,而不是实际的数据,尤其是您的文件系统应该很有效地检索数据给出一个特定的位置。

In case you're interested, this is actually a common data storage pattern for search engines - the index will store the indexed data and a pointer to the stored data on disk, rather than storing everything in the index.

如果您感兴趣,这实际上是搜索引擎的一种通用数据存储模式——索引将存储索引数据和指向磁盘上存储的数据的指针,而不是将所有数据存储在索引中。

#2


3  

I would definitely store the data on the file system and a hash the path in the DB.

我肯定会将数据存储在文件系统中,并对DB中的路径进行散列处理。

#3


1  

Well depending on your costs, MS SQL Server has what's called a "Primary XML Index" that can be created, even on unstructured data. This allows you to write XQuery to search down the columns and the database will assist you.

根据您的成本,MS SQL Server有所谓的“主XML索引”,可以创建,甚至可以创建非结构化数据。这允许您编写XQuery来搜索列,数据库将帮助您。

If there is any coherency at all in the data, or it can be placed into a schema then you may see a benefit to this.

如果数据中有任何一致性,或者可以将其放入模式中,那么您可能会看到这一点的好处。

Might I recommend if you have large amounts of binary data such as images etc, that you strip these out and place them somewhere else, such as a file system. Or if you use 2008 there is a type called "Filestream" (cheers @Marc_s) which allows you to index, store and secure all the files you write down and use NTFS APIs to retrieve them (i.e fast block transfer) but still have them kept as columns in the database.

如果您有大量的二进制数据(如图像等),我建议您将这些数据删除并放置到其他地方,比如文件系统。或者,如果您使用2008,有一种类型叫做“Filestream”(cheers @Marc_s),它允许您索引、存储和保护您写下的所有文件,并使用NTFS api来检索它们(i)。e快速块传输)但是仍然保持它们作为数据库中的列。

Having the database there might give you a good layer of abstraction and scaling if your application puts large demands on searching through the XML data, which means that you don't have to.

如果应用程序对搜索XML数据提出了很大的要求,那么拥有数据库可能会给您一个很好的抽象层和扩展层,这意味着您不必这么做。

Just my 2c.

只是我的2摄氏度。

#4


1  

At work I often have to accumulate large sets of XML documents for later analysis. Normally this is done by sticking them into a directory, and the analysis is done by grep (or a bespoke Java program with all its XML factory/builder/wrapper/API paraphernalia).

在工作中,我经常需要积累大量XML文档,以便以后进行分析。通常,这是通过将它们插入到一个目录中来完成的,并且分析是由grep(或者是一个定制的Java程序,其中包含所有的XML工厂/构建器/包装器/API工具)完成的。

One slow day I thought I'd try putting it in PostgreSQL. There are two features that I wanted to try out:

有一天,我觉得我应该试试PostgreSQL。我想尝试两个特性:

  • Automatic compression of large data when appropriate (TOAST).
  • 在适当的时候自动压缩大数据(吐司)。
  • Indexing using an expression.
  • 使用一个表达式索引。

Regarding the first feature, the DB size was less than half of the raw files size. Doing a full text search, a table scan using WHERE data::TEXT LIKE '%pattern%', was actually quicker than running grep on the files. When you are dealing with a few GB of XML this alone makes the DB worthwhile.

对于第一个特性,DB大小小于原始文件大小的一半。执行全文文本搜索(使用WHERE data:::文本如'%pattern%)进行表扫描,实际上比在文件上运行grep要快。当您处理几GB的XML时,仅这一点就使DB有价值。

The second feature, indexing, is a bit more work to maintain. There were a few particular elements I guessed would be good to index. An index on xpath('//tradeHeader/tradeId/text()', data) works, but it can be a pain to duplicate in each query. I found it easier to add ordinary columns for some fields, and use insert/update triggers to keep them in sync.

第二个特性,索引,需要做更多的维护工作。我认为有一些特定的元素可以作为索引。xpath的索引('//tradeHeader/tradeId/text(), data)可以工作,但是在每个查询中复制索引会很麻烦。我发现为某些字段添加普通列更容易,并使用insert/update触发器来保持它们的同步。

#5


1  

A couple of considerations:

一些注意事项:

  • transaction management;
  • 事务管理;
  • backup and recovery.
  • 备份和恢复。

These are general easier to marshal with a database than with a file system. But probably the hardest thing is to synchronise a file system backup with a database's roll forward (redo) logging. The more transactional your application, the more these factors matter.

一般来说,使用数据库进行封送比使用文件系统更容易。但是最困难的事情可能是将文件系统备份与数据库的前滚(重做)日志同步。应用程序的事务性越强,这些因素就越重要。

It appears from your question that you are not intending to make any use of normal database functionality (relational integrity, joining). In which case you should give strong consideration to a third option: store your data in the file system and, instead of a database, use a file-based text retrieval engine like Solr (or Lucene) , Sphinx, Autonomy, etc.

从您的问题中可以看出,您不打算使用任何正常的数据库功能(关系完整性,join)。在这种情况下,您应该对第三种选择给予强烈的考虑:将数据存储在文件系统中,而不是使用数据库,使用基于文件的文本检索引擎,如Solr(或Lucene)、Sphinx、Autonomy等。

#6


1  

I will use HDFS(Hadoop distributed file system) to store the data. Main idea is that you will get high availability, scalability and replication. Any queries to your application can be made map reduce queries. And main fields can be stored as a distributed index on top of Hadoop using Katta.

我将使用HDFS(Hadoop分布式文件系统)来存储数据。主要思想是您将获得高可用性、可伸缩性和复制。对应用程序的任何查询都可以进行map reduce查询。可以使用Katta将主字段存储为Hadoop上的分布式索引。

Try googling for these technologies.

尝试在谷歌上搜索这些技术。

#7


0  

It depends on how you are going to use the data, as a previous response says.

这取决于你将如何使用这些数据,正如之前的回应所说。

The data in a database can be used to support a lot of different kinds of queries, and feed the results to reports, forms, OLAP engines and lots of other kinds of tools. Appropriate indexing can speed up searches dramatically.

数据库中的数据可以用于支持许多不同类型的查询,并将结果提供给报告、表单、OLAP引擎和许多其他类型的工具。适当的索引可以极大地加速搜索。

If you know SQL, and if the database is well designed, coming up with queries is easier, quicker, and less error prone than doing the equivalent thing with files. But, as others have noted, you can plug your XML data into SQL without moving it to a database.

如果您了解SQL,并且数据库设计良好,那么与对文件执行相同的操作相比,生成查询更容易、更快、更不容易出错。但是,正如其他人所注意到的,您可以将XML数据插入到SQL中,而不必将其移动到数据库中。

Designing a good multipurpose schema is harder than most beginners think it is. There's a lot to learn, and it isn't just about how to manipulate one tool or another. And a bad multipurpose schema can be even harder to work with than files.

设计一个好的多用途模式比大多数初学者认为的要困难。要学习的东西很多,不仅仅是如何操作一个或另一个工具。一个糟糕的多用途模式比文件更难处理。

If you decide to go with a database, be prepared to make a significant investment. And make sure you are going to get the benefits of that investment.

如果您决定使用数据库,请准备好进行重大投资。确保你能从投资中获益。