We have a large and growing dataset of experimental data taken from around 30,000 subjects. For each subject, there are several recordings of data. Within each recording, there is a collection several time series of physiological data, each about 90 seconds long and sampled at 250Hz. I should note that any given instance of a time series is never extended, only additional recordings are added to the dataset. These recordings are not all of the same length, as well. Currently, the data for each recording is contained in its own flat file. These files are organized in a directory structure that is broken down hierarchically by version of the overall experiment, experiment location, date, and experiment terminal (in that hierarchical order).
我们有大量且不断增长的实验数据数据集来自约30,000名受试者。对于每个主题,有几个数据记录。在每个记录中,存在若干时间序列的生理数据,每个系列长约90秒并以250Hz采样。我应该注意,任何给定的时间序列实例都不会扩展,只会在数据集中添加其他记录。这些录音的长度也不尽相同。目前,每个记录的数据都包含在自己的平面文件中。这些文件按照目录结构进行组织,该结构按整个实验的版本,实验位置,日期和实验终端(按该分层次序)分层次细分。
Most of our analysis is done in MATLAB and we plan to continue to use MATLAB extensively for further analysis. The situation as it stands was workable (if undesirable) when all researchers were co-located. We are now spread around the globe and I am investigating the best solution to make all of this data available from remote locations. I am well-versed in MySQL and SQL Server, and could easily come up with a way to structure this data within such a paradigm. I am, however, skeptical as to the efficiency of this approach. I would value any suggestions that might point me in the right direction. Should I be considering something different? Time series databases (though those seem to me to be tuned for extending existing time series)? Something else?
我们的大多数分析都是在MATLAB中完成的,我们计划继续广泛使用MATLAB进行进一步分析。当所有研究人员共处时,目前的情况是可行的(如果不可取的话)。我们现在遍布全球,我正在研究最佳解决方案,以便从远程位置提供所有这些数据。我精通MySQL和SQL Server,并且很容易想出一种在这种范例内构建这些数据的方法。但是,我对这种方法的效率持怀疑态度。我会重视任何可能指向正确方向的建议。我应该考虑不同的东西吗?时间序列数据库(虽然我认为这些数据库可用于扩展现有的时间序列)?别的什么?
Analysis does not need to be done online, though the possibility of doing so would be a plus. For now, our typical use case would be to query for a specific subset of recordings and pull down the associated time series for local analysis. I appreciate any advice you might have!
分析不需要在线完成,但这样做的可能性是一个加分。目前,我们的典型用例是查询特定的记录子集,并下拉相关的时间序列以进行本地分析。我很感激您的任何建议!
Update:
In my research, I've found this paper, where they are storing and analyzing very similar signals. They've chosen MongoDB for the following reasons:
在我的研究中,我发现了这篇论文,他们在那里存储和分析非常相似的信号。他们之所以选择MongoDB,原因如下:
- Speed of development
- 发展速度
- The ease of adding fields to existing documents (features extracted from signals, etc.)
- 为现有文档添加字段的便利性(从信号中提取的特征等)
- Ease of MapReduce use through the MongoDB API itself
- 通过MongoDB API本身轻松使用MapReduce
These are all attractive advantages to me, as well. The development looks dead simple, and the ability to easily augment existing documents with the results of analysis is clearly helpful (though I know this isn't exactly difficult to do in the systems with which I am already familiar.
这些对我来说都是有吸引力的优势。开发看起来很简单,并且使用分析结果轻松扩充现有文档的能力显然是有帮助的(虽然我知道在我已经熟悉的系统中这并不是很难做到。
To be clear, I know that I can leave the data stored in flat files, and I know I could simply arrange for secure access to these flat files via MATLAB over the network. There are numerous reasons I want to store this data in a database. For instance:
为了清楚起见,我知道我可以将数据保存在平面文件中,我知道我可以通过网络安排通过MATLAB安全访问这些平面文件。我想将这些数据存储在数据库中的原因有很多。例如:
- There is little structure to the flat files now, other than the hierarchical structure stated above. It is impossible to pull all data from a particular day without pulling down all individual files for each terminal for a particular day, for instance.
- 除了上面提到的层次结构之外,现在平面文件的结构很少。例如,如果不拉下特定日期的每个终端的所有单个文件,则无法从特定日期提取所有数据。
- There is no way to query against metadata associated with a particular recording. I shudder to think of the hoops I'd need to jump through to pull all data for female subjects, for example.
- 无法查询与特定录制相关联的元数据。例如,我想到了我需要跳过来为女性主题提取所有数据的箍。
The long and short of it is that I want to store these data in a data base for myriad reasons (space, efficiency, and ease of access considerations, among many others).
它的长短之处在于我想将这些数据存储在数据库中,原因有很多(空间,效率和访问易用性等等)。
Update 2
I seem to not be sufficiently describing the nature of these data, so I will attempt to clarify. These recordings are certainly time series data, but not in the way many people think of time series. I am not continually capturing data to be appended to an existing time series. I am really making multiple recordings, all with varying metadata, but of the same three signals. These signals can be thought of as a vector of numbers, and the length of these vectors vary from recording to recording. In a traditional RDBMS, I might create one table for recording type A, one for B, etc. and treat each row as a data point in the time series. However, this does not work as recordings vary in length. Rather, I would prefer to have an entity that represents a person, and have that entity associated with the several recordings taken from that person. This is why I have considered MongoDB, as I can nest several arrays (of varying lengths) within one object in a collection.
我似乎没有充分描述这些数据的性质,所以我将尝试澄清。这些录音当然是时间序列数据,但不像很多人想到的时间序列。我不是在不断捕获要附加到现有时间序列的数据。我正在制作多个录音,所有录音都有不同的元数据,但是有相同的三个信号。这些信号可以被认为是数字矢量,并且这些矢量的长度因记录而异。在传统的RDBMS中,我可能会创建一个用于记录类型A的表,一个用于B的表等,并将每一行视为时间序列中的数据点。但是,这不起作用,因为录音的长度不同。相反,我宁愿拥有一个代表一个人的实体,并且该实体与从该人那里获得的几个记录相关联。这就是我考虑MongoDB的原因,因为我可以在一个集合中的一个对象中嵌套几个数组(不同长度)。
Potential MongoDB Structure
As an example, here's what I sketched as a potential MongoDB BSON structure for a subject:
举个例子,这是我为一个主题绘制的潜在MongoDB BSON结构:
{
"songs":
{
"order":
[
"R008",
"R017",
"T015"
],
"times": [
{
"start": "2012-07-02T17:38:56.000Z",
"finish": "2012-07-02T17:40:56.000Z",
"duration": 119188.445
},
{
"start": "2012-07-02T17:42:22.000Z",
"finish": "2012-07-02T17:43:41.000Z",
"duration": 79593.648
},
{
"start": "2012-07-02T17:44:37.000Z",
"finish": "2012-07-02T17:46:19.000Z",
"duration": 102450.695
}
]
},
"self_report":
{
"music_styles":
{
"none": false,
"world": true
},
"songs":
[
{
"engagement": 4,
"positivity": 4,
"activity": 3,
"power": 4,
"chills": 4,
"like": 4,
"familiarity": 4
},
{
"engagement": 4,
"positivity": 4,
"activity": 3,
"power": 4,
"chills": 4,
"like": 4,
"familiarity": 3
},
{
"engagement": 2,
"positivity": 1,
"activity": 2,
"power": 2,
"chills": 4,
"like": 1,
"familiarity": 1
}
],
"most_engaged": 1,
"most_enjoyed": 1,
"emotion_indices":
[
0.729994,
0.471576,
28.9082
]
},
"signals":
{
"test":
{
"timestamps":
[
0.010, 0.010, 0.021, ...
],
"eda":
[
149.200, 149.200, 149.200, ...
],
"pox":
[
86.957, 86.957, 86.957, ...
]
},
"songs":
[
{
"timestamps":
[
0.010, 0.010, 0.021, ...
],
"eda":
[
149.200, 149.200, 149.200, ...
],
"pox":
[
86.957, 86.957, 86.957, ...
]
},
{
"timestamps":
[
0.010, 0.010, 0.021, ...
],
"eda":
[
149.200, 149.200, 149.200, ...
],
"pox":
[
86.957, 86.957, 86.957, ...
]
},
{
"timestamps":
[
0.010, 0.010, 0.021, ...
],
"eda":
[
149.200, 149.200, 149.200, ...
],
"pox":
[
86.957, 86.957, 86.957, ...
]
}
]
},
"demographics":
{
"gender": "female",
"dob": 1980,
"nationality": "rest of the world",
"musical_background": false,
"musical_expertise": 1,
"impairments":
{
"hearing": false,
"visual": false
}
},
"timestamps":
{
"start": "2012-07-02T17:37:47.000Z",
"test": "2012-07-02T17:38:16.000Z",
"end": "2012-07-02T17:46:56.000Z"
}
}
Those signal
s are the time seria.
那些信号是时间序列。
3 个解决方案
#1
1
Quite often when people come to NoSQL databases, they come to it hearing that there's no schema and life's all good. However, IMHO this is a really wrong notion.
很多时候,当人们来到NoSQL数据库时,他们会听到没有架构和生活都很好。但是,恕我直言,这是一个非常错误的观念。
When dealing with NoSQL, You have to think in terms of "aggregates" . Typically an aggregate would be an entity that can be operated on as a single unit. In your case one possible (but not that efficient) way will be to model an user and his/her data as a single aggregate. This will ensure that your user aggregate can be data centre / shard agnostic. But if the data is going to grow - loading a user will also load all the related data and be a memory hog. (Mongo as such is bit greedy on memory)
在处理NoSQL时,你必须考虑“聚合”。通常,聚合体是可以作为单个单元操作的实体。在您的情况下,一种可能(但不是那么有效)的方式是将用户和他/她的数据建模为单个聚合。这将确保您的用户聚合可以是数据中心/分片不可知。但是如果数据将会增长 - 加载用户也将加载所有相关数据并成为内存耗尽。 (Mongo本身对内存有点贪心)
Another option will be to have the recordings stored as an aggregate and "linked" back to the user with an id - this can be a synthetic key that you can create like a GUID. Even though this superficially seems like a join, its just a "look up by property" - Since there's no real referential integrity here. This maybe the approach that I'll take if files are going to get added constantly.
另一种选择是将记录存储为聚合并将“链接”返回给具有id的用户 - 这可以是您可以像GUID一样创建的合成密钥。虽然这看起来像是一个联合,但它只是一个“按财产查找” - 因为这里没有真正的参照完整性。这可能是我将采取的方法,如果文件将不断添加。
The place where MongoDb shines is the part where you can do adhoc queries by a property in the document(you will create an index for this property if you don't want to lose hair later down the road.). You will not go wrong with your choice for time series data storage in Mongo. You can extract data that matches an id, within a date range for e.g., without doing any major stunts.
MongoDb闪耀的地方是您可以通过文档中的属性进行特殊查询的部分(如果您不希望在以后的路上丢失头发,则会为此属性创建索引。)。您在Mongo中选择时间序列数据存储时不会出错。您可以在日期范围内提取与id匹配的数据,例如,不进行任何重大特技。
Please do ensure that you have replica sets no matter which ever approach you take, and diligently chose your sharding approach early on - sharding later is no fun.
请确保您拥有副本集,无论您采用哪种方法,并尽早选择您的分片方法 - 稍后分片并不好玩。
#2
0
I feel like this may not answer the right question, but here is what I would probably go for (using SQL server):
我觉得这可能无法回答正确的问题,但这是我可能会采用的(使用SQL服务器):
User (table)
用户(表)
- UserId
- 用户名
- Gender
- 性别
- Expertise
- 专门知识
- etc...
- 等等...
Sample (table)
样品(表)
- SampleId
- SampleId
- UserId
- 用户名
- Startime
- STARTIME
- Duration
- 持续时间
- Order
- 订购
- etc...
- 等等...
Series (table)
系列(表)
- SampleId
- SampleId
- SecondNumber (about 1-90)
- SecondNumber(约1-90)
- Values (string with values)
- 值(带有值的字符串)
I think this should give you fairly flexible access, as well as reasonable memory efficency. As the values are stored in string format you cannot do analysis on the timeseries in sql (they will need to be parsed first) but I don't think that should be a problem. Of course you can also use MeasurementNumber
and Value
, then you have complete freedom.
我认为这应该给你相当灵活的访问,以及合理的内存效率。由于值以字符串格式存储,您无法对sql中的时间序列进行分析(它们需要先进行解析),但我不认为这应该是一个问题。当然,您也可以使用MeasurementNumber和Value,然后您就拥有了完全的*。
Of course this is not as complete as your MongoDB setup but the gaps should be fairly easy to fill.
当然,这不像MongoDB设置那么完整,但差距应该相当容易填充。
#3
0
You should really investigate LDAP and its data model. There is clearly a strong hierarchical character to your data, and LDAP is already commonly used to store attributes about people. It's a mature, standardized network protocol so you can choose from a variety of implementations, as opposed to being locked into a particular NoSQL flavor-of-the-month choice. LDAP is designed for distributed access, provides a security model for authentication (and authorization/access control as well) and is extremely efficient. More so than any of these HTTP-based protocols.
您应该真正研究LDAP及其数据模型。显然,您的数据具有强大的层次结构特征,LDAP已经常用于存储有关人员的属性。它是一个成熟的标准化网络协议,因此您可以从各种实现中进行选择,而不是被锁定在特定的NoSQL风格的月份选择中。 LDAP专为分布式访问而设计,为身份验证(以及授权/访问控制)提供安全模型,并且非常高效。比任何这些基于HTTP的协议都要多。
#1
1
Quite often when people come to NoSQL databases, they come to it hearing that there's no schema and life's all good. However, IMHO this is a really wrong notion.
很多时候,当人们来到NoSQL数据库时,他们会听到没有架构和生活都很好。但是,恕我直言,这是一个非常错误的观念。
When dealing with NoSQL, You have to think in terms of "aggregates" . Typically an aggregate would be an entity that can be operated on as a single unit. In your case one possible (but not that efficient) way will be to model an user and his/her data as a single aggregate. This will ensure that your user aggregate can be data centre / shard agnostic. But if the data is going to grow - loading a user will also load all the related data and be a memory hog. (Mongo as such is bit greedy on memory)
在处理NoSQL时,你必须考虑“聚合”。通常,聚合体是可以作为单个单元操作的实体。在您的情况下,一种可能(但不是那么有效)的方式是将用户和他/她的数据建模为单个聚合。这将确保您的用户聚合可以是数据中心/分片不可知。但是如果数据将会增长 - 加载用户也将加载所有相关数据并成为内存耗尽。 (Mongo本身对内存有点贪心)
Another option will be to have the recordings stored as an aggregate and "linked" back to the user with an id - this can be a synthetic key that you can create like a GUID. Even though this superficially seems like a join, its just a "look up by property" - Since there's no real referential integrity here. This maybe the approach that I'll take if files are going to get added constantly.
另一种选择是将记录存储为聚合并将“链接”返回给具有id的用户 - 这可以是您可以像GUID一样创建的合成密钥。虽然这看起来像是一个联合,但它只是一个“按财产查找” - 因为这里没有真正的参照完整性。这可能是我将采取的方法,如果文件将不断添加。
The place where MongoDb shines is the part where you can do adhoc queries by a property in the document(you will create an index for this property if you don't want to lose hair later down the road.). You will not go wrong with your choice for time series data storage in Mongo. You can extract data that matches an id, within a date range for e.g., without doing any major stunts.
MongoDb闪耀的地方是您可以通过文档中的属性进行特殊查询的部分(如果您不希望在以后的路上丢失头发,则会为此属性创建索引。)。您在Mongo中选择时间序列数据存储时不会出错。您可以在日期范围内提取与id匹配的数据,例如,不进行任何重大特技。
Please do ensure that you have replica sets no matter which ever approach you take, and diligently chose your sharding approach early on - sharding later is no fun.
请确保您拥有副本集,无论您采用哪种方法,并尽早选择您的分片方法 - 稍后分片并不好玩。
#2
0
I feel like this may not answer the right question, but here is what I would probably go for (using SQL server):
我觉得这可能无法回答正确的问题,但这是我可能会采用的(使用SQL服务器):
User (table)
用户(表)
- UserId
- 用户名
- Gender
- 性别
- Expertise
- 专门知识
- etc...
- 等等...
Sample (table)
样品(表)
- SampleId
- SampleId
- UserId
- 用户名
- Startime
- STARTIME
- Duration
- 持续时间
- Order
- 订购
- etc...
- 等等...
Series (table)
系列(表)
- SampleId
- SampleId
- SecondNumber (about 1-90)
- SecondNumber(约1-90)
- Values (string with values)
- 值(带有值的字符串)
I think this should give you fairly flexible access, as well as reasonable memory efficency. As the values are stored in string format you cannot do analysis on the timeseries in sql (they will need to be parsed first) but I don't think that should be a problem. Of course you can also use MeasurementNumber
and Value
, then you have complete freedom.
我认为这应该给你相当灵活的访问,以及合理的内存效率。由于值以字符串格式存储,您无法对sql中的时间序列进行分析(它们需要先进行解析),但我不认为这应该是一个问题。当然,您也可以使用MeasurementNumber和Value,然后您就拥有了完全的*。
Of course this is not as complete as your MongoDB setup but the gaps should be fairly easy to fill.
当然,这不像MongoDB设置那么完整,但差距应该相当容易填充。
#3
0
You should really investigate LDAP and its data model. There is clearly a strong hierarchical character to your data, and LDAP is already commonly used to store attributes about people. It's a mature, standardized network protocol so you can choose from a variety of implementations, as opposed to being locked into a particular NoSQL flavor-of-the-month choice. LDAP is designed for distributed access, provides a security model for authentication (and authorization/access control as well) and is extremely efficient. More so than any of these HTTP-based protocols.
您应该真正研究LDAP及其数据模型。显然,您的数据具有强大的层次结构特征,LDAP已经常用于存储有关人员的属性。它是一个成熟的标准化网络协议,因此您可以从各种实现中进行选择,而不是被锁定在特定的NoSQL风格的月份选择中。 LDAP专为分布式访问而设计,为身份验证(以及授权/访问控制)提供安全模型,并且非常高效。比任何这些基于HTTP的协议都要多。