I have a large database of normalized order data that is becoming very slow to query for reporting. Many of the queries that I use in reports join five or six tables and are having to examine tens or hundreds of thousands of lines.
我有一个庞大的标准化订单数据数据库,查询报表变得非常缓慢。我在报告中使用的许多查询连接了5到6个表,并且需要检查成千上万的行。
There are lots of queries and most have been optimized as much as possible to reduce server load and increase speed. I think it's time to start keeping a copy of the data in a denormalized format.
有很多查询,大多数都尽可能地优化,以减少服务器负载和提高速度。我认为是时候开始以非规范化格式保存数据的副本了。
Any ideas on an approach? Should I start with a couple of my worst queries and go from there?
有什么办法吗?我应该从几个最糟糕的问题开始,然后从那里开始吗?
8 个解决方案
#1
10
I know more about mssql that mysql, but I don't think the number of joins or number of rows you are talking about should cause you too many problems with the correct indexes in place. Have you analyzed the query plan to see if you are missing any?
我知道更多关于mssql的mysql,但是我不认为您所讨论的连接数或行数会导致您在正确的索引中遇到太多的问题。你分析过查询计划了吗?
http://dev.mysql.com/doc/refman/5.0/en/explain.html
http://dev.mysql.com/doc/refman/5.0/en/explain.html
That being said, once you are satisifed with your indexes and have exhausted all other avenues, de-normalization might be the right answer. If you just have one or two queries that are problems, a manual approach is probably appropriate, whereas some sort of data warehousing tool might be better for creating a platform to develop data cubes.
也就是说,一旦您对自己的索引感到满意,并且已经用尽了所有其他方法,反规范化可能是正确的答案。如果您只有一个或两个查询是问题,那么手动方法可能是合适的,而有些数据仓库工具可能更适合创建一个平台来开发数据集。
Here's a site I found that touches on the subject:
这是我发现的一个关于这个主题的网站:
http://www.meansandends.com/mysql-data-warehouse/?link_body%2Fbody=%7Bincl%3AAggregation%7D
http://www.meansandends.com/mysql-data-warehouse/?link_body%2Fbody=%7Bincl%3AAggregation%7D
Here's a simple technique that you can use to keep denormalizing queries simple, if you're just doing a few at a time (and I'm not replacing your OLTP tables, just creating a new one for reporting purposes). Let's say you have this query in your application:
这里有一个简单的技巧,您可以使用它来保持非规范化查询的简单性,如果您一次只做一些事情(我不会替换您的OLTP表,而是创建一个用于报告目的的新表)。假设您的应用程序中有这个查询:
select a.name, b.address from tbla a
join tblb b on b.fk_a_id = a.id where a.id=1
You could create a denormalized table and populate with almost the same query:
您可以创建一个非规范化的表,并使用几乎相同的查询填充:
create table tbl_ab (a_id, a_name, b_address);
-- (types elided)
Notice the underscores match the table aliases you use
注意下划线匹配您使用的表别名
insert tbl_ab select a.id, a.name, b.address from tbla a
join tblb b on b.fk_a_id = a.id
-- no where clause because you want everything
Then to fix your app to use the new denormalized table, switch the dots for underscores.
然后,要修改应用程序以使用新的非规范化表,请切换圆点作为下划线。
select a_name as name, b_address as address
from tbl_ab where a_id = 1;
For huge queries this can save a lot of time and makes it clear where the data came from, and you can re-use the queries you already have.
对于大型查询,这可以节省大量时间,并且可以清楚地说明数据来自何处,您可以重用已经拥有的查询。
Remember, I'm only advocating this as the last resort. I bet there's a few indexes that would help you. And when you de-normalize, don't forget to account for the extra space on your disks, and figure out when you will run the query to populate the new tables. This should probably be at night, or whenever activity is low. And the data in that table, of course, will never exactly be up to date.
记住,我只是主张这是最后的手段。我打赌有几个索引可以帮助你。在反规范化时,不要忘记考虑磁盘上的额外空间,并确定何时运行查询来填充新表。这应该是在晚上,或者任何时候活动很低的时候。当然,表格中的数据永远不会是最新的。
[Yet another edit] Don't forget that the new tables you create need to be indexed too! The good part is that you can index to your heart's content and not worry about update lock contention, since aside from your bulk insert the table will only see selects.
不要忘记您创建的新表也需要被索引!好处在于,您可以索引到您的核心内容,而不必担心更新锁争用,因为除了批量插入之外,表将只看到select。
#2
2
In line with some of the other comments, i would definately have a look at your indexing.
和其他一些评论一样,我肯定会看看你的索引。
One thing i discovered earlier this year on our MySQL databases was the power of composite indexes. For example, if you are reporting on order numbers over date ranges, a composite index on the order number and order date columns could help. I believe MySQL can only use one index for the query so if you just had separate indexes on the order number and order date it would have to decide on just one of them to use. Using the EXPLAIN command can help determine this.
今年早些时候,我在MySQL数据库中发现了复合索引的强大功能。例如,如果您报告的是超过日期范围的订单号,那么订单号和订单日期列的复合索引可能会有所帮助。我相信MySQL只能为查询使用一个索引,所以如果你在订单号和订单号上有单独的索引,那么它只能决定其中一个。使用EXPLAIN命令可以帮助确定这一点。
To give an indication of the performance with good indexes (including numerous composite indexes), i can run queries joining 3 tables in our database and get almost instant results in most cases. For more complex reporting most of the queries run in under 10 seconds. These 3 tables have 33 million, 110 million and 140 millions rows respectively. Note that we had also already normalised these slightly to speed up our most common query on the database.
为了显示性能良好的索引(包括许多复合索引),我可以运行查询在我们的数据库中加入3个表,并且在大多数情况下几乎可以立即得到结果。对于更复杂的报告,大多数查询在10秒内运行。这3个表分别有3300万行、1.1亿行和1.4亿行。注意,我们已经稍微规范化了这些数据,以加快对数据库的最常见查询。
More information regarding your tables and the types of reporting queries may allow further suggestions.
有关您的表和报告查询类型的更多信息可以提供进一步的建议。
#3
1
I know this is a bit tangential, but have you tried seeing if there are more indexes you can add?
我知道这有点离题,但你有没有试过看看是否还有更多的索引可以添加?
I don't have a lot of DB background, but I am working with databases a lot recently, and I've been finding that a lot of the queries can be improved just by adding indexes.
我没有太多的DB背景,但是我最近经常使用数据库,我发现很多查询都可以通过添加索引来改进。
We are using DB2, and there is a command called db2expln and db2advis, the first will indicate whether table scans vs index scans are being used, and the second will recommend indexes you can add to improve performance. I'm sure MySQL has similar tools...
我们正在使用DB2,有一个名为db2expln和db2advis的命令,第一个命令将指示是否使用表扫描和索引扫描,第二个命令将建议可以添加索引以提高性能。我相信MySQL也有类似的工具……
Anyways, if this is something you haven't considered yet, it has been helping a lot with me... but if you've already gone this route, then I guess it's not what you are looking for.
不管怎样,如果这是你还没有考虑过的事情,它对我帮助很大……但如果你已经走了这条路,那么我猜这不是你想要的。
Another possibility is a "materialized view" (or as they call it in DB2), which lets you specify a table that is essentially built of parts from multiple tables. Thus, rather than normalizing the actual columns, you could provide this view to access the data... but I don't know if this has severe performance impacts on inserts/updates/deletes (but if it is "materialized", then it should help with selects since the values are physically stored separately).
另一种可能性是“物化视图”(或者在DB2中称为“物化视图”),它允许您指定一个由多个表组成的部分的表。因此,您可以提供这个视图来访问数据,而不是规范化实际的列。但是我不知道这对插入/更新/删除是否有严重的性能影响(但是如果它是“物化的”,那么它应该对select有所帮助,因为值是物理上单独存储的)。
#4
1
MySQL 5 does support views, which may be helpful in this scenario. It sounds like you've already done a lot of optimizing, but if not you can use MySQL's EXPLAIN syntax to see what indexes are actually being used and what is slowing down your queries.
MySQL 5确实支持视图,这在此场景中可能会有帮助。听起来你已经做了很多优化,但是如果不能,你可以使用MySQL的解释语法,看看哪些索引正在被使用,什么在减慢你的查询。
As far as going about normalizing data (whether you're using views or just duplicating data in a more efficient manner), I think starting with the slowest queries and working your way through is a good approach to take.
至于对数据进行规范化(无论是使用视图还是以更有效的方式复制数据),我认为从最慢的查询开始,然后按照自己的方式进行是一个很好的方法。
#5
1
For MySQL I like this talk: Real World Web: Performance & Scalability, MySQL Edition. This contains a lot of different pieces of advice for getting more speed out of MySQL.
对于MySQL,我喜欢这个演讲:真实世界的Web:性能和可扩展性,MySQL版本。这包含了许多不同的建议,以使MySQL的速度更快。
#6
0
You might also want to consider selecting into a temporary table and then performing queries on that temporary table. This would avoid the need to rejoin your tables for every single query you issue (assuming that you can use the temporary table for numerous queries, of course). This basically gives you denormalized data, but if you are only doing select calls, there's no concern about data consistency.
您可能还想考虑选择到临时表,然后对临时表执行查询。这将避免为您发出的每个查询重新联接表(当然,假设您可以对许多查询使用临时表)。这基本上为您提供了非规范化数据,但是如果您只执行select调用,则不需要考虑数据一致性。
#7
0
Further to my previous answer, another approach we have taken in some situations is to store key reporting data in separate summary tables. There are certain reporting queries which are just going to be slow even after denormalising and optimisations and we found that creating a table and storing running totals or summary information throughout the month as it came in made the end of month reporting much quicker as well.
继我之前的回答之后,我们在某些情况下采用的另一种方法是将关键报告数据存储在单独的汇总表中。有一些报告查询,即使在去核化和优化之后,也会很慢。我们发现,创建一个表并存储整个月的运行总数或汇总信息,也会使月末报告速度更快。
We found this approach easy to implement as it didn't break anything that was already working - it's just additional database inserts at certain points.
我们发现这种方法很容易实现,因为它不会破坏任何已经在工作的东西——它只是在某些点上附加的数据库插入。
#8
0
I've been toying with composite indexes and have seen some real benefits...maybe I'll setup some tests to see if that can save me here..at least for a little longer.
我一直在研究复合指数,看到了一些真正的好处……也许我会设置一些测试,看看这是否能救我。至少还有一段时间。
#1
10
I know more about mssql that mysql, but I don't think the number of joins or number of rows you are talking about should cause you too many problems with the correct indexes in place. Have you analyzed the query plan to see if you are missing any?
我知道更多关于mssql的mysql,但是我不认为您所讨论的连接数或行数会导致您在正确的索引中遇到太多的问题。你分析过查询计划了吗?
http://dev.mysql.com/doc/refman/5.0/en/explain.html
http://dev.mysql.com/doc/refman/5.0/en/explain.html
That being said, once you are satisifed with your indexes and have exhausted all other avenues, de-normalization might be the right answer. If you just have one or two queries that are problems, a manual approach is probably appropriate, whereas some sort of data warehousing tool might be better for creating a platform to develop data cubes.
也就是说,一旦您对自己的索引感到满意,并且已经用尽了所有其他方法,反规范化可能是正确的答案。如果您只有一个或两个查询是问题,那么手动方法可能是合适的,而有些数据仓库工具可能更适合创建一个平台来开发数据集。
Here's a site I found that touches on the subject:
这是我发现的一个关于这个主题的网站:
http://www.meansandends.com/mysql-data-warehouse/?link_body%2Fbody=%7Bincl%3AAggregation%7D
http://www.meansandends.com/mysql-data-warehouse/?link_body%2Fbody=%7Bincl%3AAggregation%7D
Here's a simple technique that you can use to keep denormalizing queries simple, if you're just doing a few at a time (and I'm not replacing your OLTP tables, just creating a new one for reporting purposes). Let's say you have this query in your application:
这里有一个简单的技巧,您可以使用它来保持非规范化查询的简单性,如果您一次只做一些事情(我不会替换您的OLTP表,而是创建一个用于报告目的的新表)。假设您的应用程序中有这个查询:
select a.name, b.address from tbla a
join tblb b on b.fk_a_id = a.id where a.id=1
You could create a denormalized table and populate with almost the same query:
您可以创建一个非规范化的表,并使用几乎相同的查询填充:
create table tbl_ab (a_id, a_name, b_address);
-- (types elided)
Notice the underscores match the table aliases you use
注意下划线匹配您使用的表别名
insert tbl_ab select a.id, a.name, b.address from tbla a
join tblb b on b.fk_a_id = a.id
-- no where clause because you want everything
Then to fix your app to use the new denormalized table, switch the dots for underscores.
然后,要修改应用程序以使用新的非规范化表,请切换圆点作为下划线。
select a_name as name, b_address as address
from tbl_ab where a_id = 1;
For huge queries this can save a lot of time and makes it clear where the data came from, and you can re-use the queries you already have.
对于大型查询,这可以节省大量时间,并且可以清楚地说明数据来自何处,您可以重用已经拥有的查询。
Remember, I'm only advocating this as the last resort. I bet there's a few indexes that would help you. And when you de-normalize, don't forget to account for the extra space on your disks, and figure out when you will run the query to populate the new tables. This should probably be at night, or whenever activity is low. And the data in that table, of course, will never exactly be up to date.
记住,我只是主张这是最后的手段。我打赌有几个索引可以帮助你。在反规范化时,不要忘记考虑磁盘上的额外空间,并确定何时运行查询来填充新表。这应该是在晚上,或者任何时候活动很低的时候。当然,表格中的数据永远不会是最新的。
[Yet another edit] Don't forget that the new tables you create need to be indexed too! The good part is that you can index to your heart's content and not worry about update lock contention, since aside from your bulk insert the table will only see selects.
不要忘记您创建的新表也需要被索引!好处在于,您可以索引到您的核心内容,而不必担心更新锁争用,因为除了批量插入之外,表将只看到select。
#2
2
In line with some of the other comments, i would definately have a look at your indexing.
和其他一些评论一样,我肯定会看看你的索引。
One thing i discovered earlier this year on our MySQL databases was the power of composite indexes. For example, if you are reporting on order numbers over date ranges, a composite index on the order number and order date columns could help. I believe MySQL can only use one index for the query so if you just had separate indexes on the order number and order date it would have to decide on just one of them to use. Using the EXPLAIN command can help determine this.
今年早些时候,我在MySQL数据库中发现了复合索引的强大功能。例如,如果您报告的是超过日期范围的订单号,那么订单号和订单日期列的复合索引可能会有所帮助。我相信MySQL只能为查询使用一个索引,所以如果你在订单号和订单号上有单独的索引,那么它只能决定其中一个。使用EXPLAIN命令可以帮助确定这一点。
To give an indication of the performance with good indexes (including numerous composite indexes), i can run queries joining 3 tables in our database and get almost instant results in most cases. For more complex reporting most of the queries run in under 10 seconds. These 3 tables have 33 million, 110 million and 140 millions rows respectively. Note that we had also already normalised these slightly to speed up our most common query on the database.
为了显示性能良好的索引(包括许多复合索引),我可以运行查询在我们的数据库中加入3个表,并且在大多数情况下几乎可以立即得到结果。对于更复杂的报告,大多数查询在10秒内运行。这3个表分别有3300万行、1.1亿行和1.4亿行。注意,我们已经稍微规范化了这些数据,以加快对数据库的最常见查询。
More information regarding your tables and the types of reporting queries may allow further suggestions.
有关您的表和报告查询类型的更多信息可以提供进一步的建议。
#3
1
I know this is a bit tangential, but have you tried seeing if there are more indexes you can add?
我知道这有点离题,但你有没有试过看看是否还有更多的索引可以添加?
I don't have a lot of DB background, but I am working with databases a lot recently, and I've been finding that a lot of the queries can be improved just by adding indexes.
我没有太多的DB背景,但是我最近经常使用数据库,我发现很多查询都可以通过添加索引来改进。
We are using DB2, and there is a command called db2expln and db2advis, the first will indicate whether table scans vs index scans are being used, and the second will recommend indexes you can add to improve performance. I'm sure MySQL has similar tools...
我们正在使用DB2,有一个名为db2expln和db2advis的命令,第一个命令将指示是否使用表扫描和索引扫描,第二个命令将建议可以添加索引以提高性能。我相信MySQL也有类似的工具……
Anyways, if this is something you haven't considered yet, it has been helping a lot with me... but if you've already gone this route, then I guess it's not what you are looking for.
不管怎样,如果这是你还没有考虑过的事情,它对我帮助很大……但如果你已经走了这条路,那么我猜这不是你想要的。
Another possibility is a "materialized view" (or as they call it in DB2), which lets you specify a table that is essentially built of parts from multiple tables. Thus, rather than normalizing the actual columns, you could provide this view to access the data... but I don't know if this has severe performance impacts on inserts/updates/deletes (but if it is "materialized", then it should help with selects since the values are physically stored separately).
另一种可能性是“物化视图”(或者在DB2中称为“物化视图”),它允许您指定一个由多个表组成的部分的表。因此,您可以提供这个视图来访问数据,而不是规范化实际的列。但是我不知道这对插入/更新/删除是否有严重的性能影响(但是如果它是“物化的”,那么它应该对select有所帮助,因为值是物理上单独存储的)。
#4
1
MySQL 5 does support views, which may be helpful in this scenario. It sounds like you've already done a lot of optimizing, but if not you can use MySQL's EXPLAIN syntax to see what indexes are actually being used and what is slowing down your queries.
MySQL 5确实支持视图,这在此场景中可能会有帮助。听起来你已经做了很多优化,但是如果不能,你可以使用MySQL的解释语法,看看哪些索引正在被使用,什么在减慢你的查询。
As far as going about normalizing data (whether you're using views or just duplicating data in a more efficient manner), I think starting with the slowest queries and working your way through is a good approach to take.
至于对数据进行规范化(无论是使用视图还是以更有效的方式复制数据),我认为从最慢的查询开始,然后按照自己的方式进行是一个很好的方法。
#5
1
For MySQL I like this talk: Real World Web: Performance & Scalability, MySQL Edition. This contains a lot of different pieces of advice for getting more speed out of MySQL.
对于MySQL,我喜欢这个演讲:真实世界的Web:性能和可扩展性,MySQL版本。这包含了许多不同的建议,以使MySQL的速度更快。
#6
0
You might also want to consider selecting into a temporary table and then performing queries on that temporary table. This would avoid the need to rejoin your tables for every single query you issue (assuming that you can use the temporary table for numerous queries, of course). This basically gives you denormalized data, but if you are only doing select calls, there's no concern about data consistency.
您可能还想考虑选择到临时表,然后对临时表执行查询。这将避免为您发出的每个查询重新联接表(当然,假设您可以对许多查询使用临时表)。这基本上为您提供了非规范化数据,但是如果您只执行select调用,则不需要考虑数据一致性。
#7
0
Further to my previous answer, another approach we have taken in some situations is to store key reporting data in separate summary tables. There are certain reporting queries which are just going to be slow even after denormalising and optimisations and we found that creating a table and storing running totals or summary information throughout the month as it came in made the end of month reporting much quicker as well.
继我之前的回答之后,我们在某些情况下采用的另一种方法是将关键报告数据存储在单独的汇总表中。有一些报告查询,即使在去核化和优化之后,也会很慢。我们发现,创建一个表并存储整个月的运行总数或汇总信息,也会使月末报告速度更快。
We found this approach easy to implement as it didn't break anything that was already working - it's just additional database inserts at certain points.
我们发现这种方法很容易实现,因为它不会破坏任何已经在工作的东西——它只是在某些点上附加的数据库插入。
#8
0
I've been toying with composite indexes and have seen some real benefits...maybe I'll setup some tests to see if that can save me here..at least for a little longer.
我一直在研究复合指数,看到了一些真正的好处……也许我会设置一些测试,看看这是否能救我。至少还有一段时间。