在关系数据库中存储科学数据

I want to store hierarchical, two-dimensional scientific datasets in a relational database (MySQL or SQLite). Each dataset contains a table of numerical data with an arbitrary number of columns. In addition, each dataset can have one or more children of the same type associated with a given row of its table. Each dataset typically has between 1 and 100 columns and between 1 and 1.000.000 rows. The database should be able to handle many datasets (>1000) and reading/writing of data should be reasonably fast.

我想在关系数据库(MySQL或SQLite)中存储分层的二维科学数据集。每个数据集包含一个具有任意列数的数值表。此外,每个数据集可以有一个或多个与其表的给定行关联的相同类型的子项。每个数据集通常具有1到100列以及1到1.000.000行。数据库应该能够处理许多数据集(> 1000),并且读取/写入数据应该相当快。

What would the best DB schema to store such kind of data? Is it reasonable to have a "master" table with the names, IDs and relations of individual datasets and in addition one table per dataset which contains the numerical values?

存储此类数据的最佳数据库架构是什么?拥有一个包含各个数据集的名称,ID和关系的“主”表是否合理,另外每个数据集包含一个包含数值的表?

4 个解决方案

#1

Is it reasonable to have a "master" table with the names, IDs and relations of individual datasets and in addition one table per dataset which contains the numerical values?

拥有一个包含各个数据集的名称,ID和关系的“主”表是否合理,另外每个数据集包含一个包含数值的表?

That's how I'd do it.

我就是这样做的。

I'm not exactly sure how the 'arbitrary columns' thing is working, because data usually doesn't work like that. Regardless, it sounds like storing it as row,col,val might work nicely.

我不确定'任意列'是如何工作的,因为数据通常不会那样工作。无论如何,它听起来像存储行,col,val可能很好地工作。

Honestly though, if you don't need to search through it (max, min, etc.), it might be better to use some kind of flat file.

老实说,如果你不需要搜索它(最大,最小等),最好使用某种平面文件。

An alternative setup that might be interesting is using SQLite, with a separate database file for each dataset, plus one master one.

另一种可能有趣的设置是使用SQLite,每个数据集都有一个单独的数据库文件,另外还有一个主数据库文件。

Whatever you pick, how well it will work really depends on what you're going to do with the data.

无论你选择什么,它的工作效果取决于你将如何处理数据。

#2

You're going to end up trading off flexibility for performance, I think. You can hard-code your db schema, which it sounds like you want to avoid, but would give you the best performance, or

我认为,你最终会牺牲性能的灵活性。您可以对您的数据库架构进行硬编码,这听起来像是您想要避免的,但会为您提供最佳性能,或者

leave the schema determined at runtime, stored in a 'master' table, which increases your flexibility, but reduces your ability to enforce referential integrity and set data types.

保留在运行时确定的模式,存储在“主”表中,这会增加您的灵活性,但会降低您实施参照完整性和设置数据类型的能力。

for awhile, you could try both approaches until you have enough info about which will perform better for your task.

有一段时间,你可以尝试这两种方法,直到你有足够的信息,哪些将更好地完成你的任务。

#3

It's hard to be specific without understanding the problem domain, but if your data is inherently relational, use a relational model. If your data is not inherently relational, I wouldn't try to force it into a relational model for the sake of it - the fact that all dataset happen to have an ID doesn't mean those IDs are the same. Or even that they are suitable for use as a primary key.

在不了解问题域的情况下很难具体,但如果您的数据本质上是关系型的,请使用关系模型。如果你的数据本身并不是关系型的,那么我不会试图强迫它进入关系模型 - 事实上所有数据集碰巧都有ID并不意味着这些ID是相同的。或者甚至它们适合用作主键。

I'd suggest starting by having each data set in its own table (or tables if there are child records), and create a master table if you need to.

我建议首先将每个数据集放在自己的表中(如果有子记录,则为表),并在需要时创建主表。

I'd share zebediah49's question on "are you really going to use a database for this? Wouldn't flat files be better?"

我会分享zebediah49的问题“你真的要使用数据库了吗?平面文件不是更好吗?”

#4

We store a bunch of data like this in their own flat file. The header of the file contains enough information (timestamp, number of rows/cols...etc) so that it can be read. Then a meta information about this data is in the database. At minimum this is the file location, but could include other information about the data. For example we aggregate the data into proxy variables that summarize the details at a high level. Typically, this summary data is good enough, but when necessary we can read the file for all the details.

我们在他们自己的平面文件中存储了一堆这样的数据。该文件的标题包含足够的信息(时间戳,行/列数...等),以便可以读取它。然后,数据库中包含有关此数据的元信息。这至少是文件位置,但可能包含有关数据的其他信息。例如,我们将数据聚合到代理变量中,以高级别汇总细节。通常,此摘要数据足够好,但必要时我们可以读取文件以获取所有详细信息。

#1