温馨提示:如果是要通过分析跟踪文件得到消耗高的语句/存储过程,不需要花哨的图表数据,建议移步至ClearTrace^_-
很早以前有看到过关于使用RML Utilities工具分析SQL Trace(.trc)的文章,但一直没有具体实践。最近接管一台数据库服务器,跟踪出一批高消耗的语句,老大需要跟踪分析报表,罗列出过程(语句)按CPU、Duration、Reads的执行情况(总消耗、平均消耗、执行次数)。很多存储过程调用或动态语句总体是一样的,只是带的参数会有不同。所以重点是针对这一类的跟踪结果进行处理。
一开始是使用fn_trace_gettable对跟踪文件进行分析,按TextData排序,通过截取TextData左边N位尝试分组,观察总体相同的过程或语句中的关键字,逐一得出跟踪分析报表。之后再次使用这种方法分析了跟踪文件,感觉这样分析太过于傻瓜,而且耗时。于是搜索RML Utilities的使用方法:
Step1:生成trc文件
开启服务端跟踪,对生产服务器进行跟踪一段时间,生成HighDuration20151212.trc文件,将文件从服务器拷贝到本地F:\TroubleShooting\Trace\HighDuration20151212.trc
Step2:打开RML Utilities Command Prompt工具
该工具在:开始->所有程序->RML Utilities for SQL Server->RML Cmd Prompt 或者进入安装目录(我的安装目录是:D:\Program Files\Microsoft Corporation\RMLUtils\)双击RML.cmd文件,打开的界面如图:
Step3:开启本地SQL server 服务
我本地安装有SQL08R2和SQL12两个实例,本次使用SQL08R2实例。
Step4:运行导入命令
ReadTrace -I"F:\TroubleShooting\Trace\HighDuration20151212.trc" -o"F:\TroubleShooting\Trace\output" -S"127.0.0.1,7777" -d"PerfAnalysis_CL" -E
注意,这里的参数是区分大小写的。具体参数说明(-I:输入文件,-o:输出文件,-S:服务器,-d:数据库,-E:windows认证),-d后面的数据库如果存在,先删除库下ReadTrace架构下的表,然后重新创建此类表;-d后面的数据库如果不存在,则创建指定数据库,并创建ReadTrace相关表。
输出结果如下:
12/12/15 16:56:41.390 [0X00002E94] Readtrace a SQL Server trace processing utility.
Version 9.04.0013 built for x86.
Copyright © 1997-2014 Microsoft. All Rights Reserved
12/12/15 16:56:41.392 [0X00002E94] Computer: WORK
12/12/15 16:56:41.392 [0X00002E94] Base Module: D:\Program Files\Microsoft Corporation\RMLUtils\ReadTrace.exe
12/12/15 16:56:41.392 [0X00002E94] Process Id: 11852
12/12/15 16:56:41.393 [0X00002E94] Active proc mask(0): 0x0000000F
12/12/15 16:56:41.393 [0X00002E94] Architecture: 0
12/12/15 16:56:41.393 [0X00002E94] Page size: 4096
12/12/15 16:56:41.394 [0X00002E94] CPUs: 4
12/12/15 16:56:41.394 [0X00002E94] Processor groups: 1
12/12/15 16:56:41.394 [0X00002E94] Highest node: 0
12/12/15 16:56:41.395 [0X00002E94] Proximity: 00 Node: 00
12/12/15 16:56:41.395 [0X00002E94] ---------------------------------------
12/12/15 16:56:41.395 [0X00002E94] Group: 0
12/12/15 16:56:41.396 [0X00002E94] ---------------------------------------
12/12/15 16:56:41.396 [0X00002E94] Processor(s): 0x00000003 Function units: Shared
12/12/15 16:56:41.399 [0X00002E94] Package mask: 0x0000000F
12/12/15 16:56:41.403 [0X00002E94] Processor(s): 0x0000000C Function units: Shared
12/12/15 16:56:41.406 [0X00002E94] Processor(s): 0x0000000F assigned to Numa node: 0
12/12/15 16:56:41.412 [0X00002E94] Current time bias: -480 minutes -8.00 hours DST not used
12/12/15 16:56:41.413 [0X00002E94] -IF:\TroubleShooting\Trace\HighDuration20151212.trc
12/12/15 16:56:41.416 [0X00002E94] -oF:\TroubleShooting\Trace\output
12/12/15 16:56:41.417 [0X00002E94] -S127.0.0.1,7777
12/12/15 16:56:41.419 [0X00002E94] -dPerfAnalysis_CL
12/12/15 16:56:41.421 [0X00002E94] -E
12/12/15 16:56:41.426 [0X00002E94] Using language id (LCID): 1024 [Chinese (Simplified)_People's *.936] for character formatting with NLS: 0x00060101 and Defined: 0x00060101
12/12/15 16:56:41.427 [0X00002E94] Attempting to cleanup existing RML files from previous execution
12/12/15 16:56:41.432 [0X00002E94] Using extended RowsetFastload synchronization
12/12/15 16:56:41.433 [0X00002E94] Establishing initial database connection
12/12/15 16:56:41.434 [0X00002E94] Server: 127.0.0.1,7777
12/12/15 16:56:41.436 [0X00002E94] Database: PerfAnalysis_CL
12/12/15 16:56:41.438 [0X00002E94] Authentication: Windows
12/12/15 16:56:41.563 [0X00002E94] Using SQLOLEDB version 11.0.2100.60
12/12/15 16:56:41.565 [0X00002E94] Connected to SQL Server Version, Major: 10, Minor: 50, Build: 1600
12/12/15 16:56:41.566 [0X00002E94] Creating or clearing the performance database
12/12/15 16:56:46.378 [0X00002E94] Processing file: F:\TroubleShooting\Trace\HighDuration20151212.trc (SQL 2008 / SQL 2008 R2 / SQL Azure*)
12/12/15 16:56:46.378 [0X00002E94] Validating core events exist
12/12/15 16:56:46.379 [0X00002E94] WARNING: The 'EventSequence' column for event id=10 [POST_RPC_EVENT_CLASS] is missing from the trace.
12/12/15 16:56:46.380 [0X00002E94] WARNING: The 'EventSequence' column for event id=12 [POST_LANG_EVENT_CLASS] is missing from the trace.
12/12/15 16:56:46.381 [0X00002E94] WARNING: The 'EventSequence' column for event id=13 [PRE_LANG_EVENT_CLASS] is missing from the trace.
12/12/15 16:56:46.382 [0X00002E94] WARNING: The 'EventSequence' column is missing from the trace. Physical file order will be used and this can lead to inaccurate results.
12/12/15 16:56:46.382 [0X00002E94] Validating necessary events exist for analysis
12/12/15 16:56:46.383 [0X00002E94] WARNING: The following trace events were not captured: [RPC:Starting, SP:StmtStarting, SP:StmtCompleted, Audit:Login, Audit:Logout, Showplan Statistics]. Review the help file to ensure that you have collected the appropriate set of events and columns for your intended analysis.
12/12/15 16:56:46.384 [0X00002E94] Validating necessary events exist for RML breakout
12/12/15 16:56:46.385 [0X00002E94] WARNING: Event [Server: Server Memory Change] missing from trace
12/12/15 16:56:46.386 [0X00002E94] WARNING REPLAY: The following trace events were not captured: [Sessions:ExistingConnection, Security Audit: Audit Login, Security Audit: Audit Logout, Stored Procedures: RPC Output Parameter, Cursors:Unprepare, Cursors:CursorClose, Cursors:CursorPrepare, Cursors:CursorCreated, Cursors:CursorExecute, TSQL: Prepare SQL, TSQL: Unprepare SQL, Errors and Warnings:Attention, Transactions:SQLTransaction, Transactions:DTCTransaction, Transactions:TM:Begin Tran starting, Transactions:TM:Begin Tran completed, Transactions:TM:Commit Tran starting, Transactions:TM:Commit Tran completed, Transactions:TM:Rollback Tran starting, Transactions:TM:Rollback Tran completed, Transactions:TM:Save Tran starting, Transactions:TM:Save Tran completed, Transactions:TM:Promote Tran starting, Transactions:TM:Promote Tran completed, RPC:Starting and RPC:Completed]. Review the help file to ensure that you have collected the appropriate set of events and columns for RML replay needs. Lack of these events can lead to severe replay problems.
12/12/15 16:56:46.387 [0X00002E94] ERROR: RML output has been disabled because the required events or columns were not captured. Consult the log and help file for the recommended set of trace events and necessary columns.
Trace flag -T28 disables the check allowing RML output processing. *** Use with caution as the output and behavior could be unpredictable.
12/12/15 16:56:46.418 [0X00002E94] Events Read: 1000 Queued: 977 Processed/sec: 4
12/12/15 16:56:46.456 [0X00002E94] Events Read: 2000 Queued: 1641 Processed/sec: 70
12/12/15 16:56:46.471 [0X00002E94] Events Read: 3000 Queued: 2634 Processed/sec: 71
12/12/15 16:56:46.483 [0X00002E94] Reached the end of the trace; found [TRACE_STOP_EVENT].
12/12/15 16:56:46.493 [0X00002E94] Reads completed
12/12/15 16:56:46.493 [0X00002E94] Signaling worker threads to complete final actions.
12/12/15 16:56:46.496 [0X00002E94] Waiting for the worker threads to complete final actions.
12/12/15 16:56:46.839 [0X00002E94] Performing general cleanup actions.
12/12/15 16:56:46.839 [0X00002E94] Reducing cached memory in background
12/12/15 16:56:46.841 [0X00002E94] Total Events Processed: 3839
12/12/15 16:56:46.841 [0X00002E94] Total Events Filtered: 0
12/12/15 16:56:46.866 [0X00002E94] Table ReadTrace.tblUniqueBatches: loaded ~26 rows
12/12/15 16:56:46.866 [0X00002E94] Table ReadTrace.tblUniqueStatements: loaded ~0 rows
12/12/15 16:56:46.867 [0X00002E94] Table ReadTrace.tblUniquePlans: loaded ~0 rows
12/12/15 16:56:46.867 [0X00002E94] Table ReadTrace.tblUniquePlanRows: loaded ~0 rows
12/12/15 16:56:46.924 [0X00002E94] Table ReadTrace.tblBatches: loaded ~3837 rows
12/12/15 16:56:46.926 [0X00002E94] Table ReadTrace.tblStatements: loaded ~0 rows
12/12/15 16:56:46.928 [0X00002E94] Table ReadTrace.tblPlans: loaded ~0 rows
12/12/15 16:56:46.929 [0X00002E94] Table ReadTrace.tblPlanRows: loaded ~0 rows
12/12/15 16:56:46.930 [0X00002E94] Table ReadTrace.tblInterestingEvents: loaded ~0 rows
12/12/15 16:56:46.946 [0X00002E94] Table ReadTrace.tblConnections: loaded ~61 rows
12/12/15 16:56:46.950 [0X00002E94] WARNING: One or more warning conditions exist that may affect the quality of the analysis data. See PerfAnalysis_CL.ReadTrace.tblWarnings table and the ReadTrace log for complete details.
12/12/15 16:56:46.954 [0X00002E94] INFO: Cleaning up unique batch hash table
12/12/15 16:56:46.955 [0X00002E94] INFO: Cleaning up unique statement hash table
12/12/15 16:56:46.956 [0X00002E94] INFO: Cleaning up unique plan hash table
12/12/15 16:56:46.957 [0X00002E94] INFO: Cleaning up unique procedure hash table
12/12/15 16:56:46.958 [0X00002E94] Indexing tables...
12/12/15 16:56:49.691 [0X00002E94] Doing post-load data cleanup...
12/12/15 16:56:51.311 [0X00002E94] Computing partial aggregates...
12/12/15 16:56:53.180 [0X00002E94] Building analysis indexes ...
12/12/15 16:56:53.188 [0X00002E94] Data load completed.
12/12/15 16:56:53.189 [0X00002E94] Using execution path: D:\Program Files\Microsoft Corporation\RMLUtils\
12/12/15 16:56:53.190 [0X00002E94] Launching [D:\Program Files\Microsoft Corporation\RMLUtils\Reporter.exe]
12/12/15 16:56:53.195 [0X00002E94] Launch failure exit code: 259
12/12/15 16:56:53.195 [0X00002E94] Attempt to launch Reporter failed. Check the error log and your RML installation.
12/12/15 16:56:53.196 [0X00002E94] *******************************************************************************
* ReadTrace encountered one or more ERRORS. An error condition typically *
* stops processing early and the ReadTrace output may be unusable. *
* Review the log file for details. *
*******************************************************************************
12/12/15 16:56:53.197 [0X00002E94] ***** ReadTrace exit code: -23
Step5:查看报表
上面的Step4完成后,会自动打开Reporter应用程序:
->总体报告
报表工具会根据Trace记录里的开始时间、结束时间、CPU、Reads、Writes、Duration等,按时间段做统计,画出一张图来。能帮助管理员了解在不同时间段SQL Server的繁忙程度。
->按程序(Application Name)、数据库(Database Id)、用户名(Login Name)排序,找出造成最大SQL工作量的程序/数据库/用户。
例如,点击"Database Id",就可以得到在Trace文件所包含的那个时间段,所有曾经访问过SQL Server的数据库,它们总共发了多少条语句(Started/Completed),有过几次超时(Attentions),总共花在SQL Server里的时间是多少(Duration),CPU的时间是多少,Reads/Writes各是多少。点击这些栏位边上的小箭头,可以按这个栏位排序。
例如下面报表就是按Duration的总数排序的:
->按照同一类型的语句,统计最昂贵的语句
点击"Unique Batches",就可以得到一个关于Batch级别的报表。Batch级别的报表针对的是存储过程或是一个TSQL Batch,存储过程或Batch内部的TSQL语句不会单独列在该报表上。很多存储过程调用或TSQL Batch语句总体是一样的,只是带的参数会有不同,ReadTrace会把这些仅是参数不同的存储过程或语句归为一类。这样,如果一个存储过程可能每次执行并不是很长时间,但是它在SQL Server里被很频繁地调用。这个工具也能按它的总调用时间把它找出来:
例如按Duration排序,列出总Duration最多的语句:
可以看出第二条语句是GSP_GetWeixinBindsPagedWithAccount被调用1222次,总耗时1768295276(µs),我们可以直接用SQL语句查看此过程执行情况:
select 'DBName' DatabaseName,count(*) [count],sum(CPU_ms) sum_CPU_ms,avg(CPU_ms) avg_CPU_ms
,sum(Duration_ms) sum_Duration_ms,avg(Duration_ms) avg_Duration_ms
,sum(Reads) sum_Reads,avg(Reads) avg_Reads
,sum(Writes) sum_Writes,avg(Writes) avg_Writes
from(
select convert(varchar(max),TextData) TextData,DatabaseName,Duration/1000.0 Duration_ms,CPU CPU_ms,Reads,Writes,StartTime,EndTime,LoginName,ApplicationName
from fn_trace_gettable(N'F:\TroubleShooting\Trace\HighDuration20151212.trc', default)
where TextData is not null
and textdata like '%KeyWords%'
) a
group by left(TextData,1),DatabaseName
order by left(TextData,1)
where条件中的KeyWords用对应过程替换,执行结果如下:
它的总Duration较高,而且调用比较频繁,平均下来每次达到1.4秒。如果能将它每次执行时间降一半下来,对整个系统都是十分有利的。我们再来看第六条语句的执行结果:
DBA_RecordUserScoreIPStat被调用5次,总耗时143476147(µs),平均耗时28695229.4(µs),可知这句话本身比较昂贵。我们点击这句话,不但可以看到它的全文,还能看到它是在哪些时间段被调用的,这个对验证管理员的猜测很有帮助。
图中Started/Completed有5次,分别在12:53、13:53、14:53、15:53、16:53(图中绿点),这个是作业每小时执行的。
如果我们在Performance Overview页面点击"Database Id",然后在Resource Consumption页面点击的DBID=1,将弹出带有过滤条件DBID=1的Top Unique Batches页面:
我们可以点击Parameters进入到Report Parameters设置页:
这里可以设置可用参数的取值,注意此处的Top/Bottom N,它返回的是CPU or Duration or Reads or Writes各项对应总值的最好N条及最差N条记录,因此最多可能返回4*2*N条记录。通常Reads/Writes与CPU/Duration存在正比关系,实际返回的记录数会小于4*2*N(当然还要受跟踪过滤后的记录数限制)。我们还可以根据自己的需求修改Top/Bottom N的取值,要实现这个只需修改过程[ReadTrace].[spReporter_TopN]:
USE [PerfAnalysis_CL]
GO
ALTER procedure [ReadTrace].[spReporter_TopN]
as
begin
set nocount on select 1 as [TopN]--add
union all
select 2 as [TopN]--add
union all
select 3 as [TopN]
union all
select 5 as [TopN]
union all
select 10 as [TopN]
union all
select 25 as [TopN]
union all
select 50 as [TopN]
union all
select 100 as [TopN] order by TopN asc
END
重新打开报表参数设置页:
在我们运行导入命令后,数据处理完会自动打开Reporter应用程序。那电脑关闭后,下次要查看之前的数据怎么办呢,难道又得重新运行命令?完全不必要,第一次导入时已经把相关数据保存到对应的数据库中,我们可以在开始->所有程序->RML Utilities for SQL Server->Reporter,只要设置好连接,点击OK就会打开报表:
Performance Overview页面的Additional Reports下有一个Data Lineage链接,对应是分析服务器信息,跟踪文件、跟踪事件等信息:
怎样的跟踪文件可以用RML分析
最开始安装好RML后,命令导入之前的跟踪文件,结果一两秒就结束了,而且不会自动打开Reporter应用程序。后来查看输出日志(日志保存于-o参数的路径下),说是缺少必要的跟踪列:
重点关注日志中ERROR罗列的Event/column(必需项),适当添加WARNING罗列的Event/column。测试创建一个默认模板的跟踪,然后用RML分析,Reporter报表正常出现,对比默认模板和自己更改过的模板的事件和列,发现必需添加BinaryData列,才能使用RML工具分析。Performance Overview页有程序(Application Name)、数据库(Database Id)、用户名(Login Name)链接,建议将这些列全部选上,这样使用RML分析时可以按这些对象进行Grouping.
疑问:曲线图中的Scale值在何处设置?是否可更改?发现部分曲线图由于Scale的取值,导致坐标在图形中不够直观。
--15:44 2015/12/14 解答:查看报表的时候开启Profiler,这样就可以知道各个页面调用的过程。实践中发现与Scale相关的过程和函数如下:
查看spReporter_BatchDetailsScaleFactor过程和fn_ReporterCalculateScaleFactor函数的代码是根据各Counter的最大值,计算Scale以让其落在坐标轴的Value内。我们可以适当修改,spReporter_BatchDetailsScaleFactor过程修改部分:
fn_ReporterCalculateScaleFactor函数修改部分:
主要是改善取值为0-1之间的Scale,修改后查看DBA_RecordUserScoreIPStat语句的执行情况如下:
图中我们可以看到Started/Completed的Scale为10,总调用5次,比之前的图形明显要直观。
另外两个Scale过程可以参照修改,三者分别对应Performance Overview、Unique Batch Details、Unique Statement Details三个界面曲线图中的Scale取值。其他相关筛选涉及过程可以查看Profiler跟踪结果去分析。
--14:37 2016/1/16 补充
Performance Overview页面有两个报表没有讲述(Unique Statements和Interesting Events)
RML实用工具的一个常见陷阱是假设一个批处理总是拥有与它相关的语句。一个批处理拥有一个或多个T-SQL语句,这是事实,但如果语句没有SP:StmtStarting和SP:StmtCompleted事件,那么批处理在ReadTrace.tblStatements表中将不会有一个条目,ReadTrace.tblStatements表是Unique Statements报表为了呈现报表从中取出数据的地方。
Interesting Events报表按照类型以及运行事件间隔将事件进行分组,显示了一组在分析的特定时间间隔期间捕获的所有值得关注的事件。需要在跟踪添加对应的事件(例如Auto Stats、Sort Warnings、Missing Column Statistics、Server Memory Change、SP:Recompile)
点击Event Name/Number of Events可以钻取对应Event Details,此页可以查看是什么Batch Template引起Event的产生: