I am currently building an application where I am importing statistical data for (currently) around 15,000 products. At current, if I was to maintain one database table for each day statistics from one source it would be increased by 15,000 rows of data (let's say 5-10 fields per row primarily float, int) per day. Obviously equating to over 5 million records per year into one table.
我目前正在构建一个应用程序,我正在为(目前)大约15,000种产品导入统计数据。目前,如果我要从一个源维护每天统计数据的一个数据库表,那么每天将增加15,000行数据(假设每行5-10个字段主要是浮点数,int)。显然每年将500多万条记录等同于一张表。
That doesn't concern me so much as the thought of bringing in data from other sources (and thus increasing the size the database by 5 million records for each new source).
这并不像我想到从其他来源引入数据(因此每个新来源增加500万条记录的数据库)。
Now the data is statistical / trending based data, and will have basically 1 write per day per record, and many reads. For purposes of on the fly reporting and graphing however I need fast access to subsets of the data based on rules (date ranges, value ranges, etc).
现在数据是基于统计/趋势的数据,并且每条记录每天基本上写入1次,并且读取数量很多。出于动态报告和绘图的目的,我需要根据规则(日期范围,值范围等)快速访问数据子集。
What my question is, is this the best way to store the data (MySQL InnoDb tables), or is there a better way to store and handle statistical/trend data?
我的问题是,这是存储数据的最佳方式(MySQL InnoDb表),还是有更好的方法来存储和处理统计/趋势数据?
Other options I have tossed around at this point: 1. Multiple databases (one per product), with separate tables for each data source within. (ie Database: ProductA, Table(s):Source_A, Source_B, Source_C) 2. One database, multiple tables (one for each product/data source) (ie Database: Products, Table(s): ProductA_SourceA, ProductA_SourceB, etc.) 3. All factual
or specific product information in the database and all statistical
data in csv, xml, json, (flat files) in separate directories.
我在这一点上抛出的其他选项:1。多个数据库(每个产品一个),每个数据源都有单独的表。 (即数据库:ProductA,表:Source_A,Source_B,Source_C)2。一个数据库,多个表(每个产品/数据源一个)(即数据库:产品,表:ProductA_SourceA,ProductA_SourceB等。 )3。数据库中的所有事实或特定产品信息以及csv,xml,json,(平面文件)中不同目录中的所有统计数据。
So far, none of these options are very manageable, each has its pros and cons. I need a reasonable solution before I move into the alpha stage of development.
到目前为止,这些选项中没有一个是可管理的,每个选项都有其优缺点。在进入alpha开发阶段之前,我需要一个合理的解决方案。
2 个解决方案
#1
2
You could try making use of a column based database. These kinds of databases are much better at analytical queries of the kind you're describing. There are several options:
您可以尝试使用基于列的数据库。这类数据库在您所描述的那种分析查询方面要好得多。有几种选择:
http://en.wikipedia.org/wiki/Column-oriented_DBMS
We've had good experience with InfiniDB:
我们对InfiniDB有很好的经验:
and Infobright looks good as well:
和Infobright看起来也很好:
Both InfiniDB and Infobright have free open source community editions, so I would recommend using these to get some benchmarks on the kinds of performance benefit you might get.
InfiniDB和Infobright都有免费的开源社区版本,因此我建议使用这些版本来获得一些可能获得的性能优势的基准测试。
You might also want to look at partitioning your data to improve performance.
您可能还希望查看对数据进行分区以提高性能。
#2
2
It's a little bit dependent upon what your data looks like, and the kind of aggregations/trends you're looking to run. Most relational databases work just fine for this sort of chronological data. Even with billions of records, proper indexing and partitioning can make quick work work of finding the records you need. DB's like Oracle, MySQL, SQL-Server fall within this category.
它有点依赖于您的数据的样子,以及您希望运行的聚合/趋势的类型。大多数关系数据库对于这种按时间顺序排列的数据都可以正常工作。即使有数十亿条记录,正确的索引和分区也可以快速完成查找所需记录的工作。 DB就像Oracle,MySQL,SQL-Server属于这一类。
Lets say the products you work with are stocks, and for each stock you get a new price every day (a very realistic case). New exchanges, stocks, trade frequencies will grow this data exponentially pretty quickly. You could however partition the data by exchange. Or region.
让我们说你使用的产品是股票,每个股票你每天都会得到一个新的价格(非常现实的情况)。新的交易所,股票,交易频率将以指数方式迅速增长。但是,您可以通过交换对数据进行分区。或地区。
Various Business Intelligence tools are also able to assist in, what effectively amounts to pre-aggregating data prior to retrieval. This is basically a Column-oriented database as was suggested. (Data Warehouses and OLAP structures can assist in massaging and aggregating data sets ahead of time).
各种商业智能工具也能够在检索之前有效地帮助实现预聚合数据。这基本上是一个面向列的数据库,如建议的那样。 (数据仓库和OLAP结构可以帮助提前按摩和聚合数据集)。
Similar to the idea of data warehousing, if it's just a matter of the aggregations taking too long, you can work-off the aggregations overnight into a structure which is more quick to query from. In my previous example, you may only need to retrieve large chunks of data very infrequently, but more often some aggregation such as 52 week high. You can store the large amount of raw data in one format, and then every night have a job work off only what you need into a table which rather than thousands of data points per stock, now has 3 or 4.
与数据仓库的概念类似,如果只是聚合花费的时间太长,您可以在一夜之间将聚合转换为更快速查询的结构。在我之前的示例中,您可能只需要很少检索大块数据,但更常见的是一些聚合,例如52周高。你可以用一种格式存储大量的原始数据,然后每晚只有你需要的工作,而不是每个库存数千个数据点,现在有3或4个。
If the trends you're tracking are really all over the place, or complex algorithms, a full fledged BI solution might be something to investigate so you can use pre-built analityic and data mining algorithms.
如果您正在跟踪的趋势确实存在,或者复杂的算法,那么可能需要研究完整的BI解决方案,以便您可以使用预先构建的analityic和数据挖掘算法。
If the data is not very structured, you may have better luck with a NoSQL database like Hadoop or Mongo, although admittedly my knowledge of databases is more focused around relational formats.
如果数据结构不是很好,那么你可能会对Hadoop或Mongo这样的NoSQL数据库运气好,尽管我对数据库的了解更多地集中在关系格式上。
#1
2
You could try making use of a column based database. These kinds of databases are much better at analytical queries of the kind you're describing. There are several options:
您可以尝试使用基于列的数据库。这类数据库在您所描述的那种分析查询方面要好得多。有几种选择:
http://en.wikipedia.org/wiki/Column-oriented_DBMS
We've had good experience with InfiniDB:
我们对InfiniDB有很好的经验:
and Infobright looks good as well:
和Infobright看起来也很好:
Both InfiniDB and Infobright have free open source community editions, so I would recommend using these to get some benchmarks on the kinds of performance benefit you might get.
InfiniDB和Infobright都有免费的开源社区版本,因此我建议使用这些版本来获得一些可能获得的性能优势的基准测试。
You might also want to look at partitioning your data to improve performance.
您可能还希望查看对数据进行分区以提高性能。
#2
2
It's a little bit dependent upon what your data looks like, and the kind of aggregations/trends you're looking to run. Most relational databases work just fine for this sort of chronological data. Even with billions of records, proper indexing and partitioning can make quick work work of finding the records you need. DB's like Oracle, MySQL, SQL-Server fall within this category.
它有点依赖于您的数据的样子,以及您希望运行的聚合/趋势的类型。大多数关系数据库对于这种按时间顺序排列的数据都可以正常工作。即使有数十亿条记录,正确的索引和分区也可以快速完成查找所需记录的工作。 DB就像Oracle,MySQL,SQL-Server属于这一类。
Lets say the products you work with are stocks, and for each stock you get a new price every day (a very realistic case). New exchanges, stocks, trade frequencies will grow this data exponentially pretty quickly. You could however partition the data by exchange. Or region.
让我们说你使用的产品是股票,每个股票你每天都会得到一个新的价格(非常现实的情况)。新的交易所,股票,交易频率将以指数方式迅速增长。但是,您可以通过交换对数据进行分区。或地区。
Various Business Intelligence tools are also able to assist in, what effectively amounts to pre-aggregating data prior to retrieval. This is basically a Column-oriented database as was suggested. (Data Warehouses and OLAP structures can assist in massaging and aggregating data sets ahead of time).
各种商业智能工具也能够在检索之前有效地帮助实现预聚合数据。这基本上是一个面向列的数据库,如建议的那样。 (数据仓库和OLAP结构可以帮助提前按摩和聚合数据集)。
Similar to the idea of data warehousing, if it's just a matter of the aggregations taking too long, you can work-off the aggregations overnight into a structure which is more quick to query from. In my previous example, you may only need to retrieve large chunks of data very infrequently, but more often some aggregation such as 52 week high. You can store the large amount of raw data in one format, and then every night have a job work off only what you need into a table which rather than thousands of data points per stock, now has 3 or 4.
与数据仓库的概念类似,如果只是聚合花费的时间太长,您可以在一夜之间将聚合转换为更快速查询的结构。在我之前的示例中,您可能只需要很少检索大块数据,但更常见的是一些聚合,例如52周高。你可以用一种格式存储大量的原始数据,然后每晚只有你需要的工作,而不是每个库存数千个数据点,现在有3或4个。
If the trends you're tracking are really all over the place, or complex algorithms, a full fledged BI solution might be something to investigate so you can use pre-built analityic and data mining algorithms.
如果您正在跟踪的趋势确实存在,或者复杂的算法,那么可能需要研究完整的BI解决方案,以便您可以使用预先构建的analityic和数据挖掘算法。
If the data is not very structured, you may have better luck with a NoSQL database like Hadoop or Mongo, although admittedly my knowledge of databases is more focused around relational formats.
如果数据结构不是很好,那么你可能会对Hadoop或Mongo这样的NoSQL数据库运气好,尽管我对数据库的了解更多地集中在关系格式上。