I have a table of error logs with around 300 million rows. There is an index on the Date column but I am trying to query by both date and error message. When I query by date it is fast but I need to query by message as well which slows it down.
我有一个大约3亿行的错误日志表。在Date列上有一个索引,但是我试图通过日期和错误消息进行查询。当我按日期查询时,速度很快,但我也需要按消息查询,这会减慢查询速度。
My query is as follows
我的查询如下
WITH data_cte(errorhour, message)
AS (SELECT Datepart(hh, date) AS ErrorDay,
message
FROM cloud.errorlog
WHERE date <= '2016-06-02'
AND date >= '2016-06-01')
SELECT errorhour,
Count(*) AS count,
message
FROM data_cte
WHERE message = 'error connecting to the server'
GROUP BY errorhour
ORDER BY errorhour
adding the where clause slows it down because Message is not indexed. How can I speed it up?
添加where子句会减慢速度,因为消息没有被索引。我怎样才能加快速度呢?
EDIT: I cannot index on Message
because it is defined as varchar(max)
.
编辑:我不能对消息进行索引,因为它被定义为varchar(max)。
4 个解决方案
#1
1
If you will ALWAYS be searching for the text 'error connecting to the server'
then you can use a filtered index:
如果您总是在搜索“连接到服务器的错误”,那么您可以使用过滤索引:
CREATE INDEX ix_ectts ON ErrorLog (Date)
WHERE (Date between '2016-06-01' and '2016-06-02')
AND Message='error connecting to the server';
This index should be fairly small in bytes consumed, and quick to consult. It may be fairly slow to update however; consider creating it every time you need to run this query and dropping it afterward.
这个索引的字节数应该相当小,并且可以快速查询。不过,更新速度可能比较慢;考虑在每次需要运行此查询时创建它,然后删除它。
Another choice is to use a computed column on the first few hundred characters of Message
, and index on that:
另一种选择是对消息的前几百个字符使用计算列,并对其进行索引:
ALTER TABLE ErrorLog
ADD Message_index AS (cast (Message as varchar(400)));
CREATE INDEX theIndex ON ErrorLog (Message_index, [date]);
EDIT: added missing parentheses after cast
编辑:在转换后添加缺少的圆括号
#2
2
Just create a composite index for (date, message)
and filter on the internal cte, not outside.
只需为(日期、消息)创建复合索引,并在内部cte上进行筛选,而不是在外部。
WITH data_cte(errorhour, message)
AS (SELECT Datepart(hh, date) AS ErrorDay,
message
FROM cloud.errorlog
WHERE date BETWEEN '2016-06-01' AND '2016-06-02'
AND message = 'error connecting to the server'
)
#3
0
If it is possible to extract a short summary of the error message, you could then include that in the INSERT to the log into a new column say error_summary
and you could index on that and use it in the SELECT.
如果可以提取错误消息的简短摘要,那么您可以将其包含到日志的INSERT到一个新的列中,比如error_summary,并可以对其进行索引,并在SELECT中使用它。
You'd parse the full error message and strip out timestamps, userid's and specifics such as server name and maybe stack traces. If there is no clear parsing, leave error_summary
as null
. You could then do a preliminary search on error_summary
and fall back to a search on Message
if that failed.
您将解析完整的错误消息,并删除时间戳、userid和诸如服务器名称和可能的堆栈跟踪等细节。如果没有明确的解析,则将error_summary保留为null。然后,您可以对error_summary进行初步搜索,如果失败,则返回到对消息的搜索。
#4
0
You can simplify the query to:
您可以将查询简化为:
SELECT Datepart(day, date) AS ErrorDay, datepart(hour, date) as ErrorHour
count(*)
FROM cloud.errorlog
WHERE date <= '2016-06-02' AND date >= '2016-06-01') AND
message = 'error connecting to the server'
GROUP BY Datepart(day, date), datepart(hour, date);
Then for this query, you want an index on errorlog(message, date)
. It is important that the message
be first in the index, because of the equality comparison.
然后,对于这个查询,需要在errorlog(消息、日期)上建立一个索引。重要的是,消息首先在索引中,因为相等比较。
EDIT:
编辑:
If the message is too long and you want queries like this, I would recommend adding a computed column and use that for the index and where
clause:
如果消息太长,并且您想要这样的查询,我建议添加一个计算列,并将其用于索引和where子句:
alter table errlog add message250 as (left(message, 250));
create index idx_errlog_message250_date on (message250, date);
And then write the query as:
然后将查询写为:
SELECT Datepart(day, date) AS ErrorDay, datepart(hour, date) as ErrorHour
count(*)
FROM cloud.errorlog
WHERE date <= '2016-06-02' AND date >= '2016-06-01') AND
message250 = 'error connecting to the server'
GROUP BY Datepart(day, date), datepart(hour, date);
#1
1
If you will ALWAYS be searching for the text 'error connecting to the server'
then you can use a filtered index:
如果您总是在搜索“连接到服务器的错误”,那么您可以使用过滤索引:
CREATE INDEX ix_ectts ON ErrorLog (Date)
WHERE (Date between '2016-06-01' and '2016-06-02')
AND Message='error connecting to the server';
This index should be fairly small in bytes consumed, and quick to consult. It may be fairly slow to update however; consider creating it every time you need to run this query and dropping it afterward.
这个索引的字节数应该相当小,并且可以快速查询。不过,更新速度可能比较慢;考虑在每次需要运行此查询时创建它,然后删除它。
Another choice is to use a computed column on the first few hundred characters of Message
, and index on that:
另一种选择是对消息的前几百个字符使用计算列,并对其进行索引:
ALTER TABLE ErrorLog
ADD Message_index AS (cast (Message as varchar(400)));
CREATE INDEX theIndex ON ErrorLog (Message_index, [date]);
EDIT: added missing parentheses after cast
编辑:在转换后添加缺少的圆括号
#2
2
Just create a composite index for (date, message)
and filter on the internal cte, not outside.
只需为(日期、消息)创建复合索引,并在内部cte上进行筛选,而不是在外部。
WITH data_cte(errorhour, message)
AS (SELECT Datepart(hh, date) AS ErrorDay,
message
FROM cloud.errorlog
WHERE date BETWEEN '2016-06-01' AND '2016-06-02'
AND message = 'error connecting to the server'
)
#3
0
If it is possible to extract a short summary of the error message, you could then include that in the INSERT to the log into a new column say error_summary
and you could index on that and use it in the SELECT.
如果可以提取错误消息的简短摘要,那么您可以将其包含到日志的INSERT到一个新的列中,比如error_summary,并可以对其进行索引,并在SELECT中使用它。
You'd parse the full error message and strip out timestamps, userid's and specifics such as server name and maybe stack traces. If there is no clear parsing, leave error_summary
as null
. You could then do a preliminary search on error_summary
and fall back to a search on Message
if that failed.
您将解析完整的错误消息,并删除时间戳、userid和诸如服务器名称和可能的堆栈跟踪等细节。如果没有明确的解析,则将error_summary保留为null。然后,您可以对error_summary进行初步搜索,如果失败,则返回到对消息的搜索。
#4
0
You can simplify the query to:
您可以将查询简化为:
SELECT Datepart(day, date) AS ErrorDay, datepart(hour, date) as ErrorHour
count(*)
FROM cloud.errorlog
WHERE date <= '2016-06-02' AND date >= '2016-06-01') AND
message = 'error connecting to the server'
GROUP BY Datepart(day, date), datepart(hour, date);
Then for this query, you want an index on errorlog(message, date)
. It is important that the message
be first in the index, because of the equality comparison.
然后,对于这个查询,需要在errorlog(消息、日期)上建立一个索引。重要的是,消息首先在索引中,因为相等比较。
EDIT:
编辑:
If the message is too long and you want queries like this, I would recommend adding a computed column and use that for the index and where
clause:
如果消息太长,并且您想要这样的查询,我建议添加一个计算列,并将其用于索引和where子句:
alter table errlog add message250 as (left(message, 250));
create index idx_errlog_message250_date on (message250, date);
And then write the query as:
然后将查询写为:
SELECT Datepart(day, date) AS ErrorDay, datepart(hour, date) as ErrorHour
count(*)
FROM cloud.errorlog
WHERE date <= '2016-06-02' AND date >= '2016-06-01') AND
message250 = 'error connecting to the server'
GROUP BY Datepart(day, date), datepart(hour, date);