为什么SQL服务器突然决定使用这样糟糕的执行计划?

Background

背景

We recently had an issue with query plans sql server was using on one of our larger tables (around 175,000,000 rows). The column and index structure of the table has not changed for 5+ years.

我们最近遇到了查询计划sql server在一个较大的表(大约175,000,000行)上使用的问题。表的列和索引结构5年以上没有变化。

The table and indexes looks like this:

表和索引如下:

create table responses (
    response_uuid uniqueidentifier not null,
    session_uuid uniqueidentifier not null,
    create_datetime datetime not null,
    create_user_uuid uniqueidentifier not null,
    update_datetime datetime not null,
    update_user_uuid uniqueidentifier not null,
    question_id int not null,
    response_data varchar(4096) null,
    question_type_id varchar(3) not null,
    question_length tinyint null,
    constraint pk_responses primary key clustered (response_uuid),
    constraint idx_responses__session_uuid__question_id unique nonclustered (session_uuid asc, question_id asc) with (fillfactor=80),
    constraint fk_responses_sessions__session_uuid foreign key(session_uuid) references dbo.sessions (session_uuid),
    constraint fk_responses_users__create_user_uuid foreign key(create_user_uuid) references dbo.users (user_uuid),
    constraint fk_responses_users__update_user_uuid foreign key(update_user_uuid) references dbo.users (user_uuid)
)

create nonclustered index idx_responses__session_uuid_fk on responses(session_uuid) with (fillfactor=80)

The query that was performing poorly (~2.5 minutes instead of the normal <1 second performance) looks like this:

执行较差的查询(大约2.5分钟而不是正常的<1秒的性能)如下:

SELECT 
[Extent1].[response_uuid] AS [response_uuid], 
[Extent1].[session_uuid] AS [session_uuid], 
[Extent1].[create_datetime] AS [create_datetime], 
[Extent1].[create_user_uuid] AS [create_user_uuid], 
[Extent1].[update_datetime] AS [update_datetime], 
[Extent1].[update_user_uuid] AS [update_user_uuid], 
[Extent1].[question_id] AS [question_id], 
[Extent1].[response_data] AS [response_data], 
[Extent1].[question_type_id] AS [question_type_id], 
[Extent1].[question_length] AS [question_length]
FROM [dbo].[responses] AS [Extent1]
WHERE [Extent1].[session_uuid] = @f6_p__linq__0;

(The query is generated by entity framework and executed using sp_executesql)

(查询由实体框架生成，并使用sp_executesql执行)

The execution plan during the poor performance period looked like this:

表现不佳期间的执行计划如下:

为什么SQL服务器突然决定使用这样糟糕的执行计划?

Some background on the data- running the query above would never return more than 400 rows. In other words, filtering on session_uuid really pares down the result set.

数据的一些背景——运行上面的查询不会返回超过400行。换句话说，过滤session_uuid实际上会降低结果集。

Some background on scheduled maintenance- a scheduled job runs on a weekly basis to rebuild the database's statistics and rebuild the table's indexes. The job runs a script that looks like this:

一些关于计划维护的背景——计划的作业每周运行一次，以重建数据库的统计数据并重建表的索引。该作业运行的脚本如下所示:

alter index all on responses rebuild with (fillfactor=80)

The resolution for the performance problem was to run the rebuild index script (above) on this table.

性能问题的解决方案是在这个表上运行reserticindex脚本(上面)。

Other possibly relevant tidbits of information... The data distribution didn't change at all since the last index rebuild. There are no joins in the query. We're a SAAS shop, we have at 50 - 100 live production databases with exactly the same schema, some with more data, some with less, all with the same queries executing against them spread across a few sql servers.

其他可能相关的信息……自上次索引重建以来，数据分布没有任何变化。查询中没有连接。我们是一个SAAS商店，我们有50 - 100个实时生产数据库，具有完全相同的模式，有些数据比较多，有些数据更少，所有的查询都在几个sql服务器上执行。

Question:

问题:

What could have happened that would make sql server start using this terrible execution plan in this particular database?

如果sql server在这个特定的数据库中开始使用这个糟糕的执行计划，会发生什么情况呢?

Keep in mind the problem was solved by simply rebuilding the indexes on the table.

记住，问题是通过简单地重新构建表上的索引来解决的。

Maybe a better question is "what are the circumstances where sql server would stop using an index?"

也许一个更好的问题是“sql server停止使用索引的情况是什么?”

Another way of looking at it is "why would the optimizer not use an index that was rebuilt a few days ago and then start using it again after doing an emergency rebuild of the index once we noticed the bad query plan?"

另一种看待它的方式是“为什么优化器不使用几天前重新构建的索引，然后在发现错误的查询计划之后重新开始使用它?”

1 个解决方案

#1

This is too long for a comment.

这对评论来说太长了。

The reason is simple: the optimizer changes its mind on what the best plan is. This can be due to subtle changes in the distribution of the data (or other reasons, such as a type incompatibility in a join key). I wish there were a tool that not only gave the execution plan for a query but also showed thresholds for how close you are to another execution plan. Or a tool that would let you stash an execution plan and give an alert if the same query starts using a different plan.

原因很简单:优化器会在最佳计划上改变主意。这可能是由于数据分布的细微变化(或者其他原因，比如连接键中的类型不兼容性)。我希望有这样一个工具，它不仅为查询提供了执行计划，而且还显示了您与另一个执行计划的接近程度的阈值。或者一个工具，如果相同的查询开始使用不同的计划，它可以让您隐藏执行计划并发出警报。

I've asked myself this exact same question on more than one occasion. You have a system that's running nightly, for months on end. It processes lots of data using really complicated queries. Then, one day, you come in in the morning and the job that normally finishes by 11:00 p.m. is still running. Arrrggg!

我不止一次地问过自己这个同样的问题。你有一个每晚运行的系统，连续运行几个月。它使用非常复杂的查询处理大量数据。然后，有一天，你早上来上班，通常在晚上11点前完成的工作还在继续。Arrrggg !

The solution that we came up with was to use explicit join hints for the failed joins. (option (merge join, hash join)). We also started saving the execution plans for all our complex queries, so we could compare changes from one night to the next. In the end, this was of more academic interest than practical interest -- when the plans changed, we were already suffering from a bad execution plan.

我们提出的解决方案是为失败的连接使用显式连接提示。(选项(合并连接、散列连接))。我们还开始为所有复杂的查询保存执行计划，以便从一个晚上到另一个晚上比较更改。最后，这比实际利益更有学术价值——当计划改变时，我们已经在忍受糟糕的执行计划。

#1

This is too long for a comment.

这对评论来说太长了。

秒客网

为什么SQL服务器突然决定使用这样糟糕的执行计划?

1 个解决方案

#1

#1

相关文章