I was thinking of using a database like mongodb or ravendb to store a lot of stock tick data and wanted to know if this would be viable compared to a standard relational such as Sql Server.
我正在考虑使用mongodb或ravendb这样的数据库来存储大量的股票交易数据,并想知道与Sql Server之类的标准关系相比,这是否可行。
The data would not really be relational and would be a couple of huge tables. I was also thinking that I could sum/min/max rows of data by minute/hour/day/week/month etc for even faster calculations.
这些数据并不是真正的关系型数据,而是一些巨大的表格。我还在想,我可以用分钟/小时/天/周/月等来求和/分钟/最大行数,以获得更快的计算速度。
Example data: 500 symbols * 60 min * 60sec * 300 days... (per record we store: date, open, high,low,close, volume, openint - all decimal/float)
示例数据:500个符号* 60分钟* 60秒* 300天…(我们存储的每条记录:日期、打开、高、低、关闭、音量、openint -全十进制/浮点数)
So what do you guys think?
你们怎么看?
4 个解决方案
#1
4
The answer here will depend on scope.
这里的答案取决于范围。
MongoDB is great way to get the data "in" and it's really fast at querying individual pieces. It's also nice as it is built to scale horizontally.
MongoDB是获取数据“in”的好方法,它查询单个数据块的速度非常快。它也很好,因为它被构建成水平伸缩。
However, what you'll have to remember is that all of your significant "queries" are actually going to result from "batch job output".
但是,您必须记住的是,所有重要的“查询”实际上都是“批处理作业输出”的结果。
As an example, Gilt Groupe has created a system called Hummingbird that they use for real-time analytics on their web site. Presentation here. They're basically dynamically rendering pages based on collected performance data in tight intervals (15 minutes).
例如,Gilt Groupe已经创建了一个名为蜂鸟的系统,用于在其网站上进行实时分析。展示在这里。它们基本上是基于收集的性能数据(15分钟)动态呈现页面。
In their case, they have a simple cycle: post data to mongo -> run map-reduce -> push data to webs for real-time optimization -> rinse / repeat.
在他们的例子中,他们有一个简单的循环:向mongo ->发送数据,运行map-reduce ->将数据推送到web上进行实时优化——>冲洗/重复。
This is honestly pretty close to what you probably want to do. However, there are some limitations here:
这和你想做的很接近。然而,这里有一些限制:
- Map-reduce is new to many people. If you're familiar with SQL, you'll have to accept the learning curve of Map-reduce.
- 地图减少对许多人来说是新的。如果您熟悉SQL,那么您将不得不接受Map-reduce的学习曲线。
- If you're pumping in lots of data, your map-reduces are going to be slower on those boxes. You'll probably want to look at slaving / replica pairs if response times are a big deal.
- 如果你输入大量的数据,你的地图减少速度就会慢一些。如果响应时间很重要,那么您可能需要研究slaving / replica pair。
On the other hand, you'll run into different variants of these problems with SQL.
另一方面,您将在SQL中遇到这些问题的不同变体。
Of course there are some benefits here:
当然这里有一些好处:
- Horizontal scalability. If you have lots of boxes then you can shard them and get somewhat linear performance increases on Map/Reduce jobs (that's how they work). Building such a "cluster" with SQL databases is lot more costly and expensive.
- 水平可伸缩性。如果您有很多框,那么您可以对它们进行切分,并在Map/Reduce任务上获得一定程度的线性性能提升(这就是它们的工作方式)。使用SQL数据库构建这样的“集群”要昂贵得多。
- Really fast speed and as with point #1, you get the ability to add RAM horizontally to keep up the speed.
- 速度非常快,就像第一点一样,你可以水平地添加RAM来保持速度。
As mentioned by others though, you're going to lose access to ETL and other common analysis tools. You'll definitely be on the hook to write a lot of your own analysis tools.
正如其他人提到的,您将失去对ETL和其他常见分析工具的访问。您肯定要编写大量自己的分析工具。
#2
4
Since when this question was asked in 2010, several database engines were released or have developed features that specifically handle time series such as stock tick data:
自从这个问题在2010年被提出以来,一些数据库引擎已经发布,或者已经开发出专门处理时间序列的特性,如stock tick数据:
- InfluxDB - see my other answer
- 查看我的另一个答案
- Cassandra
- 卡珊德拉
With MongoDB or other document-oriented databases, if you target performance, the advices is to contort your schema to organize ticks in an object keyed by seconds (or an object of minutes, each minute being another object with 60 seconds). With a specialized time series database, you can query data simply with
对于MongoDB或其他面向文档的数据库,如果您以性能为目标,建议对模式进行调整,以便在一个以秒为键的对象中组织节拍(或一个以分钟为单位的对象,每分钟为另一个以60秒为单位的对象)。使用专门的时间序列数据库,您可以简单地使用它查询数据
SELECT open, close FROM market_data
WHERE symbol = 'AAPL' AND time > '2016-09-14' AND time < '2016-09-21'
I was also thinking that I could sum/min/max rows of data by minute/hour/day/week/month etc for even faster calculations.
我还在想,我可以用分钟/小时/天/周/月等来求和/分钟/最大行数,以获得更快的计算速度。
With InfluxDB, this is very straightforward. Here's how to get the daily minimums and maximums:
使用influence xdb,这是非常简单的。下面是如何获得每日最小值和最大值的方法:
SELECT MIN("close"), MAX("close") FROM "market_data" WHERE WHERE symbol = 'AAPL'
GROUP BY time(1d)
You can group by time intervals which can be in microseconds (u
), seconds (s
), minutes (m
), hours (h
), days (d
) or weeks (w
).
您可以按时间间隔分组,时间间隔可以是微秒(u)、秒(s)、分钟(m)、小时(h)、天(d)或周(w)。
TL;DR
Time-series databases are better choices than document-oriented databases for storing and querying large amounts of stock tick data.
时间序列数据库比面向文档的数据库更适合存储和查询大量的股票数据。
#3
1
Here's my reservation with the idea - and I'm going to openly acknowledge that my working knowledge of document databases is weak. I’m assuming you want all of this data stored so that you can perform some aggregation or trend-based analysis on it.
这是我对这个想法的保留意见——我将公开承认我对文档数据库的工作知识很薄弱。我假设您希望所有这些数据都存储起来,以便您可以对其进行一些聚合或基于趋势的分析。
If you use a document based db to act as your source, the loading and manipulation of each row of data (CRUD operations) is very simple. Very efficient, very straight forward, basically lovely.
如果您使用基于db的文档作为源,那么对每一行数据(CRUD操作)的加载和操作是非常简单的。非常高效,非常直接,非常可爱。
What sucks is that there are very few, if any, options to extract this data and cram it into a structure more suitable for statistical analysis e.g. columnar database or cube. If you load it into a basic relational database, there are a host of tools, both commercial and open source such as pentaho that will accommodate the ETL and analysis very nicely.
令人讨厌的是,如果有的话,可以选择提取这些数据,并将其塞进更适合于统计分析的结构中,如:columnar数据库或多维数据集。如果您将它加载到一个基本的关系数据库中,有许多工具,包括商业和开源的工具,如pentaho,可以很好地适应ETL和分析。
Ultimately though, what you want to keep in mind is that every financial firm in the world has a stock analysis/ auto-trader application; they just caused a major U.S. stock market tumble and they are not toys. :)
最后,你要记住的是世界上每个金融公司都有股票分析/自动交易应用;它们只是导致美国股市大跌,而不是玩具。:)
#4
0
A simple datastore such as a key-value or document database is also beneficial in cases where performing analytics reasonably exceeds a single system's capacity. (Or it will require an exceptionally large machine to handle the load.) In these cases, it makes sense to use a simple store since the analytics require batch processing anyway. I would personally look at finding a horizontally scaling processing method to coming up with the unit/time analytics required.
简单的数据存储(如键值或文档数据库)在执行分析的能力超过单个系统容量的情况下也很有用。(或者需要一台特别大的机器来处理负载。)在这些情况下,使用一个简单的存储是有意义的,因为分析无论如何都需要批处理。我个人希望找到一种水平伸缩的处理方法来提供所需的单位/时间分析。
I would investigate using something built on Hadoop for parallel processing. Either use the framework natively in Java/C++ or some higher level abstraction: Pig, Wukong, binary executables through the streaming interface, etc. Amazon offers reasonably cheap processing time and storage if that route is of interest. (I have no personal experience but many do and depend on it for their businesses.)
我将研究如何使用构建在Hadoop上的东西进行并行处理。要么在Java/ c++中本地使用框架,要么在更高层次的抽象中使用框架:Pig、Wukong、通过流接口的二进制可执行文件等等。(我没有个人经验,但很多人都有,而且他们的生意都依赖于此。)
#1
4
The answer here will depend on scope.
这里的答案取决于范围。
MongoDB is great way to get the data "in" and it's really fast at querying individual pieces. It's also nice as it is built to scale horizontally.
MongoDB是获取数据“in”的好方法,它查询单个数据块的速度非常快。它也很好,因为它被构建成水平伸缩。
However, what you'll have to remember is that all of your significant "queries" are actually going to result from "batch job output".
但是,您必须记住的是,所有重要的“查询”实际上都是“批处理作业输出”的结果。
As an example, Gilt Groupe has created a system called Hummingbird that they use for real-time analytics on their web site. Presentation here. They're basically dynamically rendering pages based on collected performance data in tight intervals (15 minutes).
例如,Gilt Groupe已经创建了一个名为蜂鸟的系统,用于在其网站上进行实时分析。展示在这里。它们基本上是基于收集的性能数据(15分钟)动态呈现页面。
In their case, they have a simple cycle: post data to mongo -> run map-reduce -> push data to webs for real-time optimization -> rinse / repeat.
在他们的例子中,他们有一个简单的循环:向mongo ->发送数据,运行map-reduce ->将数据推送到web上进行实时优化——>冲洗/重复。
This is honestly pretty close to what you probably want to do. However, there are some limitations here:
这和你想做的很接近。然而,这里有一些限制:
- Map-reduce is new to many people. If you're familiar with SQL, you'll have to accept the learning curve of Map-reduce.
- 地图减少对许多人来说是新的。如果您熟悉SQL,那么您将不得不接受Map-reduce的学习曲线。
- If you're pumping in lots of data, your map-reduces are going to be slower on those boxes. You'll probably want to look at slaving / replica pairs if response times are a big deal.
- 如果你输入大量的数据,你的地图减少速度就会慢一些。如果响应时间很重要,那么您可能需要研究slaving / replica pair。
On the other hand, you'll run into different variants of these problems with SQL.
另一方面,您将在SQL中遇到这些问题的不同变体。
Of course there are some benefits here:
当然这里有一些好处:
- Horizontal scalability. If you have lots of boxes then you can shard them and get somewhat linear performance increases on Map/Reduce jobs (that's how they work). Building such a "cluster" with SQL databases is lot more costly and expensive.
- 水平可伸缩性。如果您有很多框,那么您可以对它们进行切分,并在Map/Reduce任务上获得一定程度的线性性能提升(这就是它们的工作方式)。使用SQL数据库构建这样的“集群”要昂贵得多。
- Really fast speed and as with point #1, you get the ability to add RAM horizontally to keep up the speed.
- 速度非常快,就像第一点一样,你可以水平地添加RAM来保持速度。
As mentioned by others though, you're going to lose access to ETL and other common analysis tools. You'll definitely be on the hook to write a lot of your own analysis tools.
正如其他人提到的,您将失去对ETL和其他常见分析工具的访问。您肯定要编写大量自己的分析工具。
#2
4
Since when this question was asked in 2010, several database engines were released or have developed features that specifically handle time series such as stock tick data:
自从这个问题在2010年被提出以来,一些数据库引擎已经发布,或者已经开发出专门处理时间序列的特性,如stock tick数据:
- InfluxDB - see my other answer
- 查看我的另一个答案
- Cassandra
- 卡珊德拉
With MongoDB or other document-oriented databases, if you target performance, the advices is to contort your schema to organize ticks in an object keyed by seconds (or an object of minutes, each minute being another object with 60 seconds). With a specialized time series database, you can query data simply with
对于MongoDB或其他面向文档的数据库,如果您以性能为目标,建议对模式进行调整,以便在一个以秒为键的对象中组织节拍(或一个以分钟为单位的对象,每分钟为另一个以60秒为单位的对象)。使用专门的时间序列数据库,您可以简单地使用它查询数据
SELECT open, close FROM market_data
WHERE symbol = 'AAPL' AND time > '2016-09-14' AND time < '2016-09-21'
I was also thinking that I could sum/min/max rows of data by minute/hour/day/week/month etc for even faster calculations.
我还在想,我可以用分钟/小时/天/周/月等来求和/分钟/最大行数,以获得更快的计算速度。
With InfluxDB, this is very straightforward. Here's how to get the daily minimums and maximums:
使用influence xdb,这是非常简单的。下面是如何获得每日最小值和最大值的方法:
SELECT MIN("close"), MAX("close") FROM "market_data" WHERE WHERE symbol = 'AAPL'
GROUP BY time(1d)
You can group by time intervals which can be in microseconds (u
), seconds (s
), minutes (m
), hours (h
), days (d
) or weeks (w
).
您可以按时间间隔分组,时间间隔可以是微秒(u)、秒(s)、分钟(m)、小时(h)、天(d)或周(w)。
TL;DR
Time-series databases are better choices than document-oriented databases for storing and querying large amounts of stock tick data.
时间序列数据库比面向文档的数据库更适合存储和查询大量的股票数据。
#3
1
Here's my reservation with the idea - and I'm going to openly acknowledge that my working knowledge of document databases is weak. I’m assuming you want all of this data stored so that you can perform some aggregation or trend-based analysis on it.
这是我对这个想法的保留意见——我将公开承认我对文档数据库的工作知识很薄弱。我假设您希望所有这些数据都存储起来,以便您可以对其进行一些聚合或基于趋势的分析。
If you use a document based db to act as your source, the loading and manipulation of each row of data (CRUD operations) is very simple. Very efficient, very straight forward, basically lovely.
如果您使用基于db的文档作为源,那么对每一行数据(CRUD操作)的加载和操作是非常简单的。非常高效,非常直接,非常可爱。
What sucks is that there are very few, if any, options to extract this data and cram it into a structure more suitable for statistical analysis e.g. columnar database or cube. If you load it into a basic relational database, there are a host of tools, both commercial and open source such as pentaho that will accommodate the ETL and analysis very nicely.
令人讨厌的是,如果有的话,可以选择提取这些数据,并将其塞进更适合于统计分析的结构中,如:columnar数据库或多维数据集。如果您将它加载到一个基本的关系数据库中,有许多工具,包括商业和开源的工具,如pentaho,可以很好地适应ETL和分析。
Ultimately though, what you want to keep in mind is that every financial firm in the world has a stock analysis/ auto-trader application; they just caused a major U.S. stock market tumble and they are not toys. :)
最后,你要记住的是世界上每个金融公司都有股票分析/自动交易应用;它们只是导致美国股市大跌,而不是玩具。:)
#4
0
A simple datastore such as a key-value or document database is also beneficial in cases where performing analytics reasonably exceeds a single system's capacity. (Or it will require an exceptionally large machine to handle the load.) In these cases, it makes sense to use a simple store since the analytics require batch processing anyway. I would personally look at finding a horizontally scaling processing method to coming up with the unit/time analytics required.
简单的数据存储(如键值或文档数据库)在执行分析的能力超过单个系统容量的情况下也很有用。(或者需要一台特别大的机器来处理负载。)在这些情况下,使用一个简单的存储是有意义的,因为分析无论如何都需要批处理。我个人希望找到一种水平伸缩的处理方法来提供所需的单位/时间分析。
I would investigate using something built on Hadoop for parallel processing. Either use the framework natively in Java/C++ or some higher level abstraction: Pig, Wukong, binary executables through the streaming interface, etc. Amazon offers reasonably cheap processing time and storage if that route is of interest. (I have no personal experience but many do and depend on it for their businesses.)
我将研究如何使用构建在Hadoop上的东西进行并行处理。要么在Java/ c++中本地使用框架,要么在更高层次的抽象中使用框架:Pig、Wukong、通过流接口的二进制可执行文件等等。(我没有个人经验,但很多人都有,而且他们的生意都依赖于此。)