I need to store about 73,200 records per day consisting of 3 points of data: id, date, and integer.
我每天需要存储大约73,200条记录,包括3个数据点:id、日期和整数。
Some members of my team suggest creating tables using month's as the table name (september_2010), while others are suggesting having one table with lots of data in it...
我的一些团队成员建议使用month作为表名(september_2010)来创建表,而其他人则建议使用一个表,其中包含大量数据……
Any suggestions on how to deal with this amount of data? Thanks.
对于如何处理这些数据有什么建议吗?谢谢。
========== Thank you to all the feedback.
===感谢所有的反馈。
12 个解决方案
#1
20
I recommend against that. I call this antipattern Metadata Tribbles. It creates multiple problems:
我不推荐。我称这种反模式元数据为Tribbles。它创造了多个问题:
- You need to remember to create a new table every year or else your app breaks.
- 你需要记住每年都要创建一个新表,否则你的应用程序就会崩溃。
- Querying aggregates against all rows regardless of year is harder.
- 无论年份如何,对所有行进行查询都是比较困难的。
- Updating a date potentially means moving a row from one table to another.
- 更新日期可能意味着将一行从一个表移动到另一个表。
- It's harder to guarantee the uniqueness of pseudokeys across multiple tables.
- 很难保证多个表之间的伪键的唯一性。
My recommendation is to keep it in one table until and unless you've demonstrated that the size of the table is becoming a genuine problem, and you can't solve it any other way (e.g. caching, indexing, partitioning).
我的建议是将它保存在一个表中,直到并且除非您已经证明了这个表的大小正在成为一个真正的问题,并且您无法通过任何其他方式解决它(例如缓存、索引、分区)。
#2
3
Seems like it should be just fine holding everything in one table. It will make retrieval much easier in the future to maintain 1 table, as opposed to 12 tables per year. At 73,200 records per day it will take you almost 4 years to hit 100,000,000 which is still well within MySQLs capabilities.
好像把所有东西都放在一张桌子上就可以了。它将使检索在未来更容易维护一个表,而不是每年维护12个表。在每天73,200条记录的情况下,你要花近4年的时间才能达到1亿,这仍然在MySQLs的能力范围内。
#3
3
Absolutely not.
It will ruin relationship between tables.
Table relations being built based on field values, not table names.
绝对不是。它会破坏表之间的关系。表关系是基于字段值而不是表名构建的。
Especially for this very table that will grow by just 300Mb/year
特别是对于这个每年只增长300Mb的表
#4
3
so in 100 days you have 7.3 M rows, about 25M a year or so. 25M rows isn't a lot anymore. MySQL can handle tables with millions of rows. It really depends on your hardware and your query types and query frequency.
所以在100天内,你有730万行,大约每年2500万。25M行不再多了。MySQL可以处理数百万行的表。这实际上取决于您的硬件和查询类型和查询频率。
But you should be able to partition that table (if MySQL supports partitioning), what you're describing is an old SQL Server method of partition. After building those monthly tables you'd build a view that concatenates them together to look like one big table... which is essentially what partitioning does but it's all under-the-covers and fully optimized.
但是您应该能够对该表进行分区(如果MySQL支持分区),您所描述的是一个旧的SQL Server分区方法。在构建了这些月度表之后,您将构建一个视图,将它们连接在一起,看起来就像一个大表……这就是分区的本质但是它是隐藏的和完全优化的。
#5
3
Usually this creates more trouble than it's worth, it's more maintenance , your queries need more logic, and it's painful to pull data from more than one period.
通常,这样做会带来更多的麻烦,需要更多的维护,查询需要更多的逻辑,而且从多个周期中提取数据是很痛苦的。
We store 200+ million time based records in one (MyISAM) table, and queries are blazingly still fast.
我们在一个(MyISAM)表中存储了2亿多个基于时间的记录,查询仍然非常快。
You just need to ensure there's an index on your time/date column and that your queries makes use of the index (e.g. a query that messes around with DATE_FORMAT or similar on a date column will likely not use an index. I wouldn't put them in separate tables just for the sake of retreival performance.
您只需要确保您的time/date列中有一个索引,并且您的查询使用了索引(例如,在日期列中使用DATE_FORMAT或类似的查询可能不会使用索引)。我不会把它们放在单独的表中,只是为了表现出色。
One thing that gets very painful with such a large number of records is when you have to delete old data, this can take a long time (10 minutes to 2 hours for e.g. wiping a month worth of data in tables with hundreds of mullions rows). For that reason we've partitioning the tables, and use a time_dimension(see e.g. the time_dimension table a bit down here) relation table for managing the periods instead of simple date/datetime columns or strings/varchars representing dates.
对于如此多的记录,有一件非常痛苦的事情是,当您不得不删除旧数据时,这可能需要很长时间(10分钟到2小时,例如,在包含数百个多列的表中擦去一个月的数据)。出于这个原因,我们对表进行了分区,并使用了一个time_dimension表(例如,下面的time_dimension表)来管理句点,而不是简单的日期/日期时间列或字符串/varchars表示日期。
#6
2
Some members of my team suggest creating tables using month's as the table name (september_2010), while others are suggesting having one table with lots of data in it...
我的一些团队成员建议使用month作为表名(september_2010)来创建表,而其他人则建议使用一个表,其中包含大量数据……
Don't listen to them. You're already storing a date stamp, what about different months makes it a good idea to split the data that way? The engine will handle the larger data sets just fine, so splitting by month does nothing but artificially segregate the data.
不要听他们的。你已经在存储一个日期戳了,那么不同的月份如何使数据分割成一个好主意呢?该引擎可以很好地处理较大的数据集,因此按月拆分只能人为地隔离数据。
#7
1
My first reaction is: Aaaaaaaaahhhhhhhhh!!!!!!
我的第一反应是:啊!!!!!! !
Table names should not embed data values. You don't say what the data means, but supposing for the sake of argument it is, I don't know, temperature readings. Just imagine trying to write a query to find all the months in which average temperature increased over the previous month. You'd have to loop through table names. Worse yet, imagine trying to find all 30-day periods -- i.e. periods that might cross month boundaries -- where temperature increased over the previous 30-day period.
表名不应该嵌入数据值。你没有说数据是什么意思,但假设是为了论证,我不知道,温度读数。想象一下,你试着写一个查询,找出在过去一个月平均气温上升的几个月。您必须遍历表名。更糟糕的是,想象一下,试着找出所有的30天周期——比如那些可能跨越月份界限的周期——在过去的30天里,气温都有所上升。
Indeed, just retrieving an old record would go from a trivial operation -- "select * where id=whatever" -- would become a complex operation requiring you to have the program generate table names from the date on the fly. If you didn't know the date, you would have to scan through all the tables searching each one for the desired record. Yuck.
实际上,仅仅检索一个旧的记录就会从一个简单的操作(“select * where id=whatever”)变成一个复杂的操作,需要程序动态地生成表名。如果您不知道日期,您将不得不扫描所有的表,搜索每个表以获得所需的记录。讨厌的东西。
With all the data in one properly-normalized table, queries like the above are pretty trivial. With separate tables for each month, they're a nightmare.
对于一个适当规范化的表中的所有数据,像上面这样的查询是非常简单的。每个月都有不同的桌子,简直是噩梦。
Just make the date part of the index and the performance penalty of having all the records in one table should be very small. If the size of table really becomes a performance problem, I could dimply comprehend making one table for archive data with all the old stuff and one for current data with everything you retrieve regularly. But don't create hundreds of tables. Most database engines have ways to partition your data across multiple drives using "table spaces" or the like. Use the sophisticated features of the database if necessary, rather than hacking together a crude simulation.
只需要将日期作为索引的一部分,将所有记录放在一个表中的性能损失应该非常小。如果表的大小真的成为一个性能问题,我可以模糊地理解为使用所有旧的东西创建一个用于归档数据的表,使用常规检索的所有东西创建一个用于当前数据的表。但是不要创建数百个表。大多数数据库引擎都可以使用“表空间”之类的方法在多个驱动器之间对数据进行分区。如果需要的话,使用数据库的复杂特性,而不是简单地进行模拟。
#8
0
Depends on what searches you'll need to do. If normally constrained by date, splitting is good.
这取决于你需要做什么搜索。如果通常受到日期的限制,分割是好的。
If you do split, consider naming the tables like foo_2010_09 so the tables will sort alphanumerically.
如果要拆分,可以考虑将表命名为foo_2010_09,以便表按字母顺序排序。
#9
0
what is your DB platform?
你的DB平台是什么?
In SQL Server 2K5+ you can partition on date.
在SQL Server 2K5+中,您可以在日期上进行分区。
My bad, I didnt notice the tag. @thetaiko is right though and this is well within MySQL capabilities to deal with this.
糟糕,我没注意到标签。@thetaiko是对的,这在MySQL中是可以处理的。
#10
0
I would say it depends on how the data is used. If most queries are done over the complete data, it would be an overhead to always join the tables back together again. If you most times only need a part of the data (by date), it is a good idea to segment the tables into smaller pieces.
我认为这取决于数据的使用方式。如果大多数查询都是在完整的数据上完成的,那么总是将表重新连接在一起就会产生开销。如果您大多数时候只需要数据的一部分(按日期),那么最好将表分割成更小的部分。
For the naming i would do tablename_yyyymm.
对于命名,我会用tablename_yyyyymm。
Edit: For sure you should then also think about another layer between the DB and your app to handle the segmented tables depending on some date given. Which can then get pretty complicated.
编辑:当然,您还应该考虑DB和应用程序之间的另一层,根据给定的日期处理分段表。这会变得非常复杂。
#11
0
I'd suggest dropping the year and just having one table per month, named after the month. Archive your data annually by renaming all the tables $MONTH_$YEAR and re-creating the month tables. Or, since you're storing a timestamp with your data, just keep appending to the same tables. I assume by virtue of the fact that you're asking the question in the first place, that segregating your data by month fits your reporting requirements. If not, then I'd recommend keeping it all in one table and periodically archiving off historical records when performance gets to be an issue.
我建议放弃这一年,每月只开一张桌子,以月份命名。通过将所有表重命名为每月$ _$YEAR并重新创建月表,每年归档您的数据。或者,由于您将时间戳存储在数据中,所以只需要将其追加到相同的表中即可。我认为,由于您首先提出了这个问题,按月划分数据符合您的报告需求。如果不是,那么我建议将它们保存在一个表中,并在性能出现问题时定期归档历史记录。
#12
0
I agree with this idea complicating your database needlessly. Use a single table. As others have pointed out, it's not nearly enough data to warrent extraneous handling. Unless you use SQLite, your database will handle it well.
我同意这个想法,使你的数据库不必要地复杂化。使用一个表。正如其他人所指出的那样,它并没有足够的数据来进行额外的处理。除非使用SQLite,否则数据库会处理得很好。
However it also depends on how you want to access it. If the old entries are really only there for archival purposes, then the archive pattern is an option. It's common for versioning systems to have the infrequently used data separated out. In your case you'd only want everything >1 year to move out of the main table. And this is strictly an database administration task, not an application behavior. The application would only join the current list and the _archive list, if at all. Again, this highly depends on the use case. Are the old entries generally needed? Is there too much data to process regularily?
但是,这也取决于您希望如何访问它。如果旧条目仅用于归档目的,那么归档模式是一个选项。版本控制系统通常将不常用的数据分离出来。在你的情况下,你只希望>1年的所有东西都从主表中移出。这是严格的数据库管理任务,而不是应用程序行为。应用程序将只加入当前列表和_archive列表(如果有的话)。同样,这高度依赖于用例。通常需要旧的条目吗?是否有太多的数据需要定期处理?
#1
20
I recommend against that. I call this antipattern Metadata Tribbles. It creates multiple problems:
我不推荐。我称这种反模式元数据为Tribbles。它创造了多个问题:
- You need to remember to create a new table every year or else your app breaks.
- 你需要记住每年都要创建一个新表,否则你的应用程序就会崩溃。
- Querying aggregates against all rows regardless of year is harder.
- 无论年份如何,对所有行进行查询都是比较困难的。
- Updating a date potentially means moving a row from one table to another.
- 更新日期可能意味着将一行从一个表移动到另一个表。
- It's harder to guarantee the uniqueness of pseudokeys across multiple tables.
- 很难保证多个表之间的伪键的唯一性。
My recommendation is to keep it in one table until and unless you've demonstrated that the size of the table is becoming a genuine problem, and you can't solve it any other way (e.g. caching, indexing, partitioning).
我的建议是将它保存在一个表中,直到并且除非您已经证明了这个表的大小正在成为一个真正的问题,并且您无法通过任何其他方式解决它(例如缓存、索引、分区)。
#2
3
Seems like it should be just fine holding everything in one table. It will make retrieval much easier in the future to maintain 1 table, as opposed to 12 tables per year. At 73,200 records per day it will take you almost 4 years to hit 100,000,000 which is still well within MySQLs capabilities.
好像把所有东西都放在一张桌子上就可以了。它将使检索在未来更容易维护一个表,而不是每年维护12个表。在每天73,200条记录的情况下,你要花近4年的时间才能达到1亿,这仍然在MySQLs的能力范围内。
#3
3
Absolutely not.
It will ruin relationship between tables.
Table relations being built based on field values, not table names.
绝对不是。它会破坏表之间的关系。表关系是基于字段值而不是表名构建的。
Especially for this very table that will grow by just 300Mb/year
特别是对于这个每年只增长300Mb的表
#4
3
so in 100 days you have 7.3 M rows, about 25M a year or so. 25M rows isn't a lot anymore. MySQL can handle tables with millions of rows. It really depends on your hardware and your query types and query frequency.
所以在100天内,你有730万行,大约每年2500万。25M行不再多了。MySQL可以处理数百万行的表。这实际上取决于您的硬件和查询类型和查询频率。
But you should be able to partition that table (if MySQL supports partitioning), what you're describing is an old SQL Server method of partition. After building those monthly tables you'd build a view that concatenates them together to look like one big table... which is essentially what partitioning does but it's all under-the-covers and fully optimized.
但是您应该能够对该表进行分区(如果MySQL支持分区),您所描述的是一个旧的SQL Server分区方法。在构建了这些月度表之后,您将构建一个视图,将它们连接在一起,看起来就像一个大表……这就是分区的本质但是它是隐藏的和完全优化的。
#5
3
Usually this creates more trouble than it's worth, it's more maintenance , your queries need more logic, and it's painful to pull data from more than one period.
通常,这样做会带来更多的麻烦,需要更多的维护,查询需要更多的逻辑,而且从多个周期中提取数据是很痛苦的。
We store 200+ million time based records in one (MyISAM) table, and queries are blazingly still fast.
我们在一个(MyISAM)表中存储了2亿多个基于时间的记录,查询仍然非常快。
You just need to ensure there's an index on your time/date column and that your queries makes use of the index (e.g. a query that messes around with DATE_FORMAT or similar on a date column will likely not use an index. I wouldn't put them in separate tables just for the sake of retreival performance.
您只需要确保您的time/date列中有一个索引,并且您的查询使用了索引(例如,在日期列中使用DATE_FORMAT或类似的查询可能不会使用索引)。我不会把它们放在单独的表中,只是为了表现出色。
One thing that gets very painful with such a large number of records is when you have to delete old data, this can take a long time (10 minutes to 2 hours for e.g. wiping a month worth of data in tables with hundreds of mullions rows). For that reason we've partitioning the tables, and use a time_dimension(see e.g. the time_dimension table a bit down here) relation table for managing the periods instead of simple date/datetime columns or strings/varchars representing dates.
对于如此多的记录,有一件非常痛苦的事情是,当您不得不删除旧数据时,这可能需要很长时间(10分钟到2小时,例如,在包含数百个多列的表中擦去一个月的数据)。出于这个原因,我们对表进行了分区,并使用了一个time_dimension表(例如,下面的time_dimension表)来管理句点,而不是简单的日期/日期时间列或字符串/varchars表示日期。
#6
2
Some members of my team suggest creating tables using month's as the table name (september_2010), while others are suggesting having one table with lots of data in it...
我的一些团队成员建议使用month作为表名(september_2010)来创建表,而其他人则建议使用一个表,其中包含大量数据……
Don't listen to them. You're already storing a date stamp, what about different months makes it a good idea to split the data that way? The engine will handle the larger data sets just fine, so splitting by month does nothing but artificially segregate the data.
不要听他们的。你已经在存储一个日期戳了,那么不同的月份如何使数据分割成一个好主意呢?该引擎可以很好地处理较大的数据集,因此按月拆分只能人为地隔离数据。
#7
1
My first reaction is: Aaaaaaaaahhhhhhhhh!!!!!!
我的第一反应是:啊!!!!!! !
Table names should not embed data values. You don't say what the data means, but supposing for the sake of argument it is, I don't know, temperature readings. Just imagine trying to write a query to find all the months in which average temperature increased over the previous month. You'd have to loop through table names. Worse yet, imagine trying to find all 30-day periods -- i.e. periods that might cross month boundaries -- where temperature increased over the previous 30-day period.
表名不应该嵌入数据值。你没有说数据是什么意思,但假设是为了论证,我不知道,温度读数。想象一下,你试着写一个查询,找出在过去一个月平均气温上升的几个月。您必须遍历表名。更糟糕的是,想象一下,试着找出所有的30天周期——比如那些可能跨越月份界限的周期——在过去的30天里,气温都有所上升。
Indeed, just retrieving an old record would go from a trivial operation -- "select * where id=whatever" -- would become a complex operation requiring you to have the program generate table names from the date on the fly. If you didn't know the date, you would have to scan through all the tables searching each one for the desired record. Yuck.
实际上,仅仅检索一个旧的记录就会从一个简单的操作(“select * where id=whatever”)变成一个复杂的操作,需要程序动态地生成表名。如果您不知道日期,您将不得不扫描所有的表,搜索每个表以获得所需的记录。讨厌的东西。
With all the data in one properly-normalized table, queries like the above are pretty trivial. With separate tables for each month, they're a nightmare.
对于一个适当规范化的表中的所有数据,像上面这样的查询是非常简单的。每个月都有不同的桌子,简直是噩梦。
Just make the date part of the index and the performance penalty of having all the records in one table should be very small. If the size of table really becomes a performance problem, I could dimply comprehend making one table for archive data with all the old stuff and one for current data with everything you retrieve regularly. But don't create hundreds of tables. Most database engines have ways to partition your data across multiple drives using "table spaces" or the like. Use the sophisticated features of the database if necessary, rather than hacking together a crude simulation.
只需要将日期作为索引的一部分,将所有记录放在一个表中的性能损失应该非常小。如果表的大小真的成为一个性能问题,我可以模糊地理解为使用所有旧的东西创建一个用于归档数据的表,使用常规检索的所有东西创建一个用于当前数据的表。但是不要创建数百个表。大多数数据库引擎都可以使用“表空间”之类的方法在多个驱动器之间对数据进行分区。如果需要的话,使用数据库的复杂特性,而不是简单地进行模拟。
#8
0
Depends on what searches you'll need to do. If normally constrained by date, splitting is good.
这取决于你需要做什么搜索。如果通常受到日期的限制,分割是好的。
If you do split, consider naming the tables like foo_2010_09 so the tables will sort alphanumerically.
如果要拆分,可以考虑将表命名为foo_2010_09,以便表按字母顺序排序。
#9
0
what is your DB platform?
你的DB平台是什么?
In SQL Server 2K5+ you can partition on date.
在SQL Server 2K5+中,您可以在日期上进行分区。
My bad, I didnt notice the tag. @thetaiko is right though and this is well within MySQL capabilities to deal with this.
糟糕,我没注意到标签。@thetaiko是对的,这在MySQL中是可以处理的。
#10
0
I would say it depends on how the data is used. If most queries are done over the complete data, it would be an overhead to always join the tables back together again. If you most times only need a part of the data (by date), it is a good idea to segment the tables into smaller pieces.
我认为这取决于数据的使用方式。如果大多数查询都是在完整的数据上完成的,那么总是将表重新连接在一起就会产生开销。如果您大多数时候只需要数据的一部分(按日期),那么最好将表分割成更小的部分。
For the naming i would do tablename_yyyymm.
对于命名,我会用tablename_yyyyymm。
Edit: For sure you should then also think about another layer between the DB and your app to handle the segmented tables depending on some date given. Which can then get pretty complicated.
编辑:当然,您还应该考虑DB和应用程序之间的另一层,根据给定的日期处理分段表。这会变得非常复杂。
#11
0
I'd suggest dropping the year and just having one table per month, named after the month. Archive your data annually by renaming all the tables $MONTH_$YEAR and re-creating the month tables. Or, since you're storing a timestamp with your data, just keep appending to the same tables. I assume by virtue of the fact that you're asking the question in the first place, that segregating your data by month fits your reporting requirements. If not, then I'd recommend keeping it all in one table and periodically archiving off historical records when performance gets to be an issue.
我建议放弃这一年,每月只开一张桌子,以月份命名。通过将所有表重命名为每月$ _$YEAR并重新创建月表,每年归档您的数据。或者,由于您将时间戳存储在数据中,所以只需要将其追加到相同的表中即可。我认为,由于您首先提出了这个问题,按月划分数据符合您的报告需求。如果不是,那么我建议将它们保存在一个表中,并在性能出现问题时定期归档历史记录。
#12
0
I agree with this idea complicating your database needlessly. Use a single table. As others have pointed out, it's not nearly enough data to warrent extraneous handling. Unless you use SQLite, your database will handle it well.
我同意这个想法,使你的数据库不必要地复杂化。使用一个表。正如其他人所指出的那样,它并没有足够的数据来进行额外的处理。除非使用SQLite,否则数据库会处理得很好。
However it also depends on how you want to access it. If the old entries are really only there for archival purposes, then the archive pattern is an option. It's common for versioning systems to have the infrequently used data separated out. In your case you'd only want everything >1 year to move out of the main table. And this is strictly an database administration task, not an application behavior. The application would only join the current list and the _archive list, if at all. Again, this highly depends on the use case. Are the old entries generally needed? Is there too much data to process regularily?
但是,这也取决于您希望如何访问它。如果旧条目仅用于归档目的,那么归档模式是一个选项。版本控制系统通常将不常用的数据分离出来。在你的情况下,你只希望>1年的所有东西都从主表中移出。这是严格的数据库管理任务,而不是应用程序行为。应用程序将只加入当前列表和_archive列表(如果有的话)。同样,这高度依赖于用例。通常需要旧的条目吗?是否有太多的数据需要定期处理?