在上一篇文章中,我们使用Grafana从SQL Server的表中取数,然后通过Dashboard实现了对Lync/Skype服务器的功能状态监控。那么在本篇文章中,我们将继续使用Grafana来实现对Lync/Skype服务器端的报错TOP 10以及用户TOP的监控数据。
如何使用Grafana监控Lync/Skype服务器端报错TOP 10以及用户TOP
Skype服务器报错数据来源
Skype服务器的报错数据来源于后端SQL Server的LcsCDR数据库的ErrorReportView和MsDiagMetaDataView视图。
- ErrorReportView:主要存储报错的信息
- MsDiagMetaDataView:主要存储报错的分类
报错主要分为两类:
- Expected Failure: 即“预期故障”。例如,如果用户已将其状态设置为“请勿打扰”,则对该用户的任何调用都会失败。
- Unexpected Failure: “意外故障”是指在看似正常的系统中发生的故障。例如,如果呼叫者处于保持状态,则不应终止呼叫。如果发生这种情况,则会被标记为意外故障。
报错信息获取
第一次处理 - 获取报错信息元数据
我们可以通过命令从上文提到的视图ErrorReportView中筛选出我们需要的数据,然后将其导入到对应的SQL表中,示例如下:
这些主要的字段包括:
- ErrorTime : 报错发生时间
- MsDiagHeader :报错信息
- FromUri:报错from方
- ToUri :报错to方
- ClientVersion:报错发生的客户端
- Source:报错发生的服务器
完整脚本如下:
#Connect to SQL database
Function SQLConnectionWindows
{
param
(
[string]$server,
[string]$database
)
$SQLCon = New-Object System.Data.SqlClient.SqlConnection
$SQLCon.ConnectionString = "Data Source=$server;Initial Catalog=$database;Integrated Security=SSPI;"
try
{
$SQLCon.Open()
return $SQLCon
}
catch [exception]
{
Write-Warning ('Connect to database failed with error message:{0}' -f ,$_)
$SQLCon.Dispose()
return $null
}
}
Function SQLConnectionSQL
{
param
(
[string]$server,
[string]$database,
[string]$uid,
[string]$pwd
)
$SQLCon = New-Object System.Data.SqlClient.SqlConnection
$SQLCon.ConnectionString = "Data Source=$server;Initial Catalog=$database;User ID=$uid;pwd=$pwd;"
try
{
$SQLCon.Open()
return $SQLCon
}
catch [exception]
{
Write-Warning ('Connect to database failed with error message:{0}' -f ,$_)
$SQLCon.Dispose()
return $null
}
}
Function SQLCommand
{
param
(
[System.Data.SqlClient.SqlConnection]$SQLConnection,
[string]$command
)
$dataset = New-Object System.Data.DataSet
$dataAdapter = New-Object System.Data.SqlClient.SqlDataAdapter($command,$SQLConnection)
$dataAdapter.Fill($dataset) | Out-Null
return $dataset.Tables[0]
}
$SQLServer = 'SkypeBackEndSQL'
$SQLDBName = 'LcsCDR'
$conn = SQLConnectionWindows -server $SQLServer -database $SQLDBName -uid 'Account' -pwd 'password'
$SqlQuery = "
declare @date1 datetime
declare @date2 datetime
set @date1 = (SELECT DATEADD(minute,-5,GETUTCDATE()))
set @date2 = (SELECT GETUTCDATE());
SELECT
ErrorTime
,MsDiagHeader
,FromUri
,ToUri
,ClientVersion
,Source
FROM [LcsCDR].[dbo].[ErrorReportView]
where [ErrorTime] > @date1 and [ErrorTime] < @date2
"
$DataTables = SQLCommand -SQLConnection $conn -command $SqlQuery
$arr = @()
foreach($Data in $DataTables){
$IsReasonExist = ($Data.MsDiagHeader -split ";") -like '*reason=*'
if($IsReasonExist){
$DiagnosticId = $Data.MsDiagHeader.split(";")[0]
$DiagnosticReason = $IsReasonExist.Trim(" .-`t`n`r").split('"')[1]
$obj = New-Object PSObject
$obj | Add-Member -MemberType NoteProperty -Name ErrorTime -Value $Data.ErrorTime
$obj | Add-Member -MemberType NoteProperty -Name DiagnosticId -Value $DiagnosticId
$obj | Add-Member -MemberType NoteProperty -Name DiagnosticReason -Value $DiagnosticReason
$obj | Add-Member -MemberType NoteProperty -Name FromUri -Value $Data.FromUri
$obj | Add-Member -MemberType NoteProperty -Name ToUri -Value $Data.ToUri
$obj | Add-Member -MemberType NoteProperty -Name ClientVersion -Value $Data.ClientVersion
$obj | Add-Member -MemberType NoteProperty -Name Source -Value $Data.Source
$arr += $obj
}
}
$Server = 'SqlServer'
$DBName = 'DBName'
$userName = "UserName"
$Cred = 'Password'
$Newconn = SQLConnectionSQL -server $Server -database $DBName -uid $userName -pwd $Cred
#更新至数据库
foreach($i in $arr){
$time = $i.ErrorTime.ToString('yyyy-MM-dd HH:mm:ss.fff')
$id = $i.DiagnosticId
$reason = $i.DiagnosticReason
$fromUri = $i.FromUri
$toUri = $i.ToUri
$clientVersion = $i.ClientVersion
$source = $i.Source
$cmd = "INSERT INTO Skype_EnterpriseVoice_ErrorReportMetaData VALUES ('$time','$id','$reason','$fromUri','$toUri','$clientVersion','$source')"
SQLCommand -SQLConnection $Newconn -command $cmd
}
第二次处理 - 获取报错分类信息
接下来,我们需要对上文中提到的视图MsDiagMetaDataView中的数据进行处理,以获取到对应的报错分类信息,然后将其导入到对应的SQL表中,示例如下:
这些主要的字段包括:
- MsDiagID: 报错ID
- ErrorCategory:报错分类
- Type:报错类型
- ReasonString:报错原因
- Description:报错描述信息
完整脚本如下:
#Connect to SQL database
Function SQLConnectionWindows
{
param
(
[string]$server,
[string]$database
)
$SQLCon = New-Object System.Data.SqlClient.SqlConnection
$SQLCon.ConnectionString = "Data Source=$server;Initial Catalog=$database;Integrated Security=SSPI;"
try
{
$SQLCon.Open()
return $SQLCon
}
catch [exception]
{
Write-Warning ('Connect to database failed with error message:{0}' -f ,$_)
$SQLCon.Dispose()
return $null
}
}
Function SQLConnectionSQL
{
param
(
[string]$server,
[string]$database,
[string]$uid,
[string]$pwd
)
$SQLCon = New-Object System.Data.SqlClient.SqlConnection
$SQLCon.ConnectionString = "Data Source=$server;Initial Catalog=$database;User ID=$uid;pwd=$pwd;"
try
{
$SQLCon.Open()
return $SQLCon
}
catch [exception]
{
Write-Warning ('Connect to database failed with error message:{0}' -f ,$_)
$SQLCon.Dispose()
return $null
}
}
Function SQLCommand
{
param
(
[System.Data.SqlClient.SqlConnection]$SQLConnection,
[string]$command
)
$dataset = New-Object System.Data.DataSet
$dataAdapter = New-Object System.Data.SqlClient.SqlDataAdapter($command,$SQLConnection)
$dataAdapter.Fill($dataset) | Out-Null
return $dataset.Tables[0]
}
$SQLServer = 'SkypeBackEndSQL'
$SQLDBName = 'LcsCDR'
$conn = SQLConnectionWindows -server $SQLServer -database $SQLDBName -uid 'Account' -pwd 'password'
$SqlQuery = "
SELECT [MsDiagId]
,[ErrorCategory]
,[Type]
,[ReasonString]
,[Description]
FROM [LcsCDR].[dbo].[MsDiagMetaDataView]
"
$DataTables = SQLCommand -SQLConnection $conn -command $SqlQuery
$Server = 'SqlServer'
$DBName = 'DBName'
$userName = "UserName"
$Cred = 'Password'
$Newconn = SQLConnectionSQL -server $Server -database $DBName -uid $userName -pwd $Cred
#更新至数据库
$bulkCopy = new-object ("Data.SqlClient.SqlBulkCopy") $conn
$bulkCopy.DestinationTableName = "Skype_EnterpriseVoice_MsDiagMetaDataView"
$bulkCopy.WriteToServer($DataTables)
第三次处理 - 统计报错数量
当我们进行了第一,二次数据处理后,为方便之后的Grafana取数,我们可以将前面两张导入的表关联起来,对数据进行计算以统计出报错的数量信息,示例如下:
完整的脚本如下:
#Connect to SQL database
Function SQLConnectionSQL
{
param
(
[string]$server,
[string]$database,
[string]$uid,
[string]$pwd
)
$SQLCon = New-Object System.Data.SqlClient.SqlConnection
$SQLCon.ConnectionString = "Data Source=$server;Initial Catalog=$database;User ID=$uid;pwd=$pwd;"
try
{
$SQLCon.Open()
return $SQLCon
}
catch [exception]
{
Write-Warning ('Connect to database failed with error message:{0}' -f ,$_)
$SQLCon.Dispose()
return $null
}
}
Function SQLCommand
{
param
(
[System.Data.SqlClient.SqlConnection]$SQLConnection,
[string]$command
)
$dataset = New-Object System.Data.DataSet
$dataAdapter = New-Object System.Data.SqlClient.SqlDataAdapter($command,$SQLConnection)
$dataAdapter.Fill($dataset) | Out-Null
return $dataset.Tables[0]
}
$SQLServer = 'SQLServer'
$SQLDBName = 'DBName'
$userName = "UserName"
$Cred = 'Password'
$conn = SQLConnectionSQL -server $SQLServer -database $SQLDBName -uid $userName -pwd $Cred
$SqlQuery = "
declare @date1 datetime
declare @date2 datetime
declare @date3 datetime
set @date1 = (SELECT DATEADD(minute,-5,GETUTCDATE()))
set @date2 = (SELECT GETUTCDATE());
select
getdate() as InsertDate,
count(ErrorReport.DiagnosticId) as Sessions,
ErrorReport.DiagnosticId,
ErrorReport.DiagnosticReason,
Metadata.ErrorCategory
--CONVERT(VARCHAR(16),ErrorReport.ErrorTime,120) as ErrorTime
FROM
[DBName].[dbo].[Skype_EnterpriseVoice_MsDiagMetaDataView] MetaData
inner join
[DBName].[dbo].[Skype_EnterpriseVoice_ErrorReport] ErrorReport on MetaData.MsDiagID = ErrorReport.DiagnosticId
where
ErrorReport.ErrorTime > @date1 AND ErrorReport.ErrorTime < @date2
group by
ErrorReport.DiagnosticId,ErrorReport.DiagnosticReason,MetaData.ErrorCategory
"
$DataTables = SQLCommand -SQLConnection $conn -command $SqlQuery
#更新至数据库
$bulkCopy = new-object ("Data.SqlClient.SqlBulkCopy") $conn
$bulkCopy.DestinationTableName = "Skype_EnterpriseVoice_ErrorReportCount"
$bulkCopy.WriteToServer($DataTables)
创建自动化任务计划(Task Schedule)
在上面的步骤中,我们主要使用了3个脚本:
- 脚本1:从Skype后端数据库获取到报错元数据,存储在SQL表1中。
- 脚本2:从Skype后端数据库获取到报错分类信息,存储在SQL表2中。- 只导入一次即可
- 脚本3:关联表1,2,之后对数据进行计算,并存储在SQL表3中。
基于以上信息,我们需要分别创建2个任务计划,以调用前文中的脚本1和3,将数据每隔一定时间导入到SQL表中。由于任务计划在此系列前几篇文章中已经有所展示,将不在此文中赘述。
在Grafana中创建Dashboard
和之前一样,我们需要在Grafana中通过新建Query来生成Dashboard,由于步骤基本相同,在此我们只展示一些关键的查询语句以及图形配置参数:
Dashboard 1: Unexpected Failure Count
select
$__timeEpoch(InsertDate),
Sessions,
DiagnosticReason
from [DBName].[dbo].[Skype_EnterpriseVoice_ErrorReportCount]
where ErrorCategory = 'UnexpectedFailure' and $__timeFilter(InsertDate)
Dashboard 2: Expected Failure Count
select
$__timeEpoch(InsertDate),
Sessions,
DiagnosticReason
from [DBName].[dbo].[Skype_EnterpriseVoice_ErrorReportCount]
where ErrorCategory = 'ExpectedFailure' and $__timeFilter(InsertDate)
Dashboard 3: Event ID Sessions - UnexpectedFailure
select TOP 10
count(ErrorReport.DiagnosticId) as Sessions,
ErrorReport.DiagnosticId,
ErrorReport.DiagnosticReason
FROM
[DBName].[dbo].[Skype_EnterpriseVoice_MsDiagMetaDataView] MetaData
inner join
[DBName].[dbo].[Skype_EnterpriseVoice_ErrorReport] ErrorReport on MetaData.MsDiagID = ErrorReport.DiagnosticId
where
ErrorCategory = 'UnexpectedFailure'
and
$__timeFilter(ErrorReport.ErrorTime)
group by
ErrorReport.DiagnosticId,ErrorReport.DiagnosticReason
Order by
Sessions desc
Dashboard 4: Event ID Sessions - ExpectedFailure
select TOP 10
count(ErrorReport.DiagnosticId) as Sessions,
ErrorReport.DiagnosticId,
ErrorReport.DiagnosticReason
FROM
[DBName].[dbo].[Skype_EnterpriseVoice_MsDiagMetaDataView] MetaData
inner join
[DBName].[dbo].[Skype_EnterpriseVoice_ErrorReport] ErrorReport on MetaData.MsDiagID = ErrorReport.DiagnosticId
where
ErrorCategory = 'ExpectedFailure'
and
$__timeFilter(ErrorReport.ErrorTime)
group by
ErrorReport.DiagnosticId,ErrorReport.DiagnosticReason
Order by
Sessions DESC
Dashboard 5: Top From Users
With Temp as
(select
Count(FromUri) as FromUserCount,ErrorReportMetaData.FromUri
from [DBName].[dbo].[Skype_EnterpriseVoice_ErrorReportMetaData] ErrorReportMetaData
inner join
[DBName].[dbo].[Skype_EnterpriseVoice_MsDiagMetaDataView] MsDiagMetaData on ErrorReportMetaData.DiagnosticId = MsDiagMetaData.MsDiagID
where $__timeFilter(ERRORTIME) and MsDiagMetaData.ErrorCategory = 'UnexpectedFailure'
group by FromUri
)
select TOP(10) *
From Temp
where FromUri <> ''
order by FromUserCount desc
Dashboard 6: Top To Users
With Temp as
(select
Count(ToUri) as ToUserCount,ErrorReportMetaData.ToUri
from [DBName].[dbo].[Skype_EnterpriseVoice_ErrorReportMetaData] ErrorReportMetaData
inner join
[DBName].[dbo].[Skype_EnterpriseVoice_MsDiagMetaDataView] MsDiagMetaData on ErrorReportMetaData.DiagnosticId = MsDiagMetaData.MsDiagID
where $__timeFilter(ERRORTIME)
and MsDiagMetaData.ErrorCategory = 'UnexpectedFailure'
and ErrorReportMetaData.ToUri not like '%FrontendServername%'
group by ToUri
)
select TOP(10) *
From Temp
where ToUri <> ''
order by ToUserCount desc
Dashboard 7: Top from user agents
With Temp as
(select
Count(ClientVersion) as ClientVersionCount,ErrorReportMetaData.ClientVersion
from [DBName].[dbo].[Skype_EnterpriseVoice_ErrorReportMetaData] ErrorReportMetaData
inner join
[DBName].[dbo].[Skype_EnterpriseVoice_MsDiagMetaDataView] MsDiagMetaData on ErrorReportMetaData.DiagnosticId = MsDiagMetaData.MsDiagID
where $__timeFilter(ERRORTIME)
and MsDiagMetaData.ErrorCategory = 'UnexpectedFailure'
group by ClientVersion
)
select TOP(10) *
From Temp
where ClientVersion <> '' and ClientVersion <> 'UCWA'
order by ClientVersionCount desc
Dashboard 8: Top sources
With Temp as
(select
Count(Sources) as SourcesCount,ErrorReportMetaData.Sources
from [DBName].[dbo].[Skype_EnterpriseVoice_ErrorReportMetaData] ErrorReportMetaData
inner join
[DBName].[dbo].[Skype_EnterpriseVoice_MsDiagMetaDataView] MsDiagMetaData on ErrorReportMetaData.DiagnosticId = MsDiagMetaData.MsDiagID
where $__timeFilter(ERRORTIME)
and MsDiagMetaData.ErrorCategory = 'UnexpectedFailure'
and ErrorReportMetaData.Sources <> ''
and ErrorReportMetaData.Sources not like '%not available%'
and ErrorReportMetaData.Sources not like 'sip.contoso.com'
group by ErrorReportMetaData.Sources
)
select *
From Temp
order by SourcesCount desc
Dashboard最终效果展示
同当我们将以上Dashboard整合后,最终效果示例如下:
本章总结
在此篇文章中,我们了解了:
- 如何获取Skype服务器报错信息
- 如何对数据进行二次处理并导入到SQL Server表
- 如何通过Grafana展示报错信息
我们搜集报错信息的好处主要有:
- 可以对主要的报错信息搜集并集中处理,以发现潜在的风险。
- 当某些报错大量出现时,我们可以先于用户报障,来发现并解决问题。
- 主动联系用户,获得用户使用反馈,提升用户体验。
本系列文章《使用Grafana监控Lync/Skype服务器》到这一篇就告一段落了。监控作为运维工作中重要的一环,无论您运维的产品是什么,以及使用什么样的产品来实现运维监控,其逻辑都是相通的。也希望这系列文章能够给作为运维工程师的您一些启发。感谢您的阅读。