Best practices for Grafana SLOs

时间:2024-04-26 22:32:26

Best practices for Grafana SLOs

Best practices for Grafana SLOs | Grafana Cloud documentation

Because SLOs are still a relatively new practice, it can feel overwhelming when you start to create SLOs for the first time. To help simplify things, some best practices for SLOs and SLO queries are provided on this page.

What is a good SLO?

A Service Level Objective (SLO) is meant to define specific, measurable targets which represent the quality of service provided by a service provider to its users. The best place to start is with the level of service your customers expect. Sometimes these are written into formal service level agreements (SLAs) with customers, and sometimes they are implicit in customers expectations for a service.

Good SLOs are simple. Don’t use every metric you can track as an SLI; choose the ones that really matter to the consumers of your service. If you choose too many, it’ll make it hard to pay attention to the ones that matter.

A good SLO is attainable, not aspirational

Start with a realistic target. Unrealistic goals create unnecessary frustration which can then eclipse useful feedback from the SLO. Remember, this is meant to be achievable and it is meant to reflect the user experience. An SLO is not an OKR.

It’s also important to make your SLO simple and understandable. The most effective SLOs are the ones that are readable for all stakeholders.

Target services with good traffic

Too little traffic is insufficient for monitoring trends and can cause noisy alerts and irregularities can be reflected disproportionately with low-traffic environments. Conversely, too much traffic can mask customer-specific issues.

Team alignment

Teams should be the ones to create SLOs and SLIs, not managers. Your SLOs should communicate to you feedback for your services and the customer experience with them, so it’s good for the team to work together to create the SLOs.

Embed SLO review in team rituals

As you work with SLOs, the information they provide can help guide decision-making because they add context and correlate patterns. This can help when there’s a need to balance reliability and feature velocity. Early on, it’s good practice for teams to review SLOs at regular intervals.

Iterate and adjust

Once SLO review is a part of your team rituals it’s important to iterate on the information you have to be able to make continuously more informed decisions.

As you learn more from your SLOs, you may learn your assumptions don’t reflect practical reality. In the early period of SLO implementation, you may find there are a number of factors you hadn’t previously considered. If you have a lot of error budget left over, you can adjust your objectives accordingly.

Alerts and labels

SLO alerts are different from typical data source alerts. Because alerts for SLOs let you know there is a trend in your burn rate that needs attention, it’s important to understand how to set up and balance fast-burn and slow-burn alerts to keep you informed without inducing alerting fatigue.

Prioritize your alerts

Have your alerts routed first to designated individuals to validate your SLI. Send notifications to designated engineers through OnCall or your main escalation channel when fast-burn alerts fire so that the appropriate people can quickly respond to possible pressing issues. Send group notifications for slow-burn alerts to analyze and respond to as a team during normal working hours.

Use labels

Set up good label practices. Keep them limited to make them navigable and consumable for triage.

Grafana SLOs use two label types: SLO labels and Alert labels. SLO labels are for grouping and filtering SLOs. Alert labels are added to slow and fast burn alerts and are used to route notifications and add metadata to alerts.

Query tips and pitfalls

There are many approaches to how you configure your SLO queries. Ultimately, it all depends on your needs. Ultimately, just remember: if you don’t have metrics that represent your user’s experience then you need new metrics.

Keep queries simple

The best SLIs are based on Prometheus counter metrics (such as monotonically increasing series) and use labels to encode the counted event as either a success or failure (for example: requests_total{code=”200”}). If your metrics don’t look like this, it’s probably better to reinstrument your service with well-suited metrics than to try and work around the issue with complex SLI query definitions.

Availability and Latency are the most common SLOs to start with for request driven services. For example:

  • Availability (non-5xx responses): requests_total{code!~”5..”} / requests_total
  • Latency (less than 1 second): requests_duration_seconds_bucket{code!~”5..”, le=”1.0”} / requests_duration_seconds_count{code!~”5..”}

Freshness is a common SLO for message queues or batch processes where you want to ensure that each item (perhaps after several retries) gets completed before the work request grows too stale.

  • Freshness (work spent less than 120 sec in queue): completed_duration_seconds_bucket{le=”120”} / completed_duration_seconds_count

Advanced SLIs

Define advanced SLIs as a “success/total” ratio for best dashboards. The “Ratio” SLO type enforces this success/total style, but you’ll get more dashboard features if you follow the same approach with your advanced SLOs.

  • Do <success rate> / <total rate>
  • Avoid: 1 - (<failure rate> / <total rate>)

If you can’t reinstrument your metrics to encode success/failure with labels and you must work with failure_total and all_total counters, you can do (total - fail) / total. For example: `( sum by (…) (rate(all_total[$__rate_interval]))

  • sum by (…) (rate(failure_total[$__rate_interval])) ) / sum by (…) (rate(all_total[$__rate_interval]))`

Know your SLIs

There are many SLI types. A brief explanation of Multidimensional and Rollup SLIs follows below.

Multidimensional SLI

A Multidimensional SLI reports a ratio for each value of a given label. for example: sum by (cluster) (rate(<success>[5m])) / sum by (cluster) (rate<total>[5m])). When you specify “group by” labels on the ratio SLO type, it makes it a multidimensional SLI. A common use is to specify cluster and/or namespace in the grouping. Multidimensional SLIs enables per-cluster alerting and supports more flexible dashboards where you can include or exclude values for the chosen dimension labels (see rollup SLI below).

Rollup SLI

A rollup SLI (or aggregated SLI) is a calculation of a multidimensional SLI where the numerator and denominator is further aggregated before the final ratio calculation. When you select cluster=all on the dashboard of a multidimensional SLO that defined cluster as a group label, the dashboard calculates the aggregate ratio of the sum of all successes/over sum of all requests. This provides alerting on each cluster and reporting on the rollup overall results.

Additional reference materials

Google provides very clear documentation on SLOs in their [SRE Book(https://sre.google/sre-book/service-level-objectives/)]. They also provide useful guides on SLO implementation and alerting on SLOs.

因为 SLO 仍然是一种相对较新的实践,当你第一次开始为其创建 SLO 时,可能会感到不知所措。为了帮助简化事情,本页提供了一些关于 SLO 和 SLO 查询的最佳实践。 什么是好的 SLO? 服务水平目标 (SLO) 旨在定义代表服务提供商为其用户提供的服务质量的特定、可衡量的目标。最好从客户期望的服务水平开始。有时这些写在与客户的正式服务水平协议 (SLA) 中,有时它们是客户对服务的隐含期望。 好的 SLO 是简单的。不要使用你可以跟踪的每一个指标作为 SLI;选择真正重要的那些。如果你选择太多,就会很难关注那些重要的。 一个好的 SLO 是可以达到的,而不是空想的 从现实的目标开始。不切实际的目标会造成不必要的挫折,然后会掩盖 SLO 的有用反馈。记住,这是为了能够实现,它是为了反映用户体验。SLO 不是 OKR。 让你的 SLO 简单易懂也很重要。最有效的 SLO 是所有利益相关者都能阅读的。 针对有良好流量的目标服务 流量太少不足以监测趋势,并且在低流量环境中可能会导致嘈杂的警报和不规则性。相反,太多的流量可能会掩盖客户特定的问题。 团队协调 团队应该是创建 SLO 和 SLI 的人,而不是经理。你的 SLO 应该为你的服务和客户体验提供反馈,因此团队共同努力创建 SLO 是很好的。 将 SLO 审查嵌入团队仪式中 当你使用 SLO 时,它们提供的信息可以帮助指导决策,因为它们添加了上下文并关联了模式。这在需要平衡可靠性和功能速度时很有帮助。早期,团队定期审查 SLO 是一个很好的实践。 迭代和调整 一旦 SLO 审查成为团队仪式的一部分,重要的是根据你所拥有的信息进行迭代,以便能够做出更有见地的决策。 随着你从 SLO 中了解更多,你可能会发现你的假设并不反映实际情况。在 SLO 实施的早期阶段,你可能会发现有许多之前没有考虑到的因素。如果你还有很多剩余的错误预算,你可以相应地调整你的目标。 警报和标签 SLO 警报不同于典型的数据源警报。因为 SLO 的警报让你知道你的燃烧率趋势需要关注,所以了解如何设置和平衡快速燃烧和缓慢燃烧的警报以让你保持知情而不会导致警报疲劳是很重要的。 优先处理你的警报 将你的警报首先路由到指定的个人,以验证你的 SLI。在快速燃烧警报触发时,通过 OnCall 或主要升级渠道向指定的工程师发送通知,以便适当的人员能够快速响应可能存在的紧迫问题。在正常工作时间内向团队发送缓慢燃烧警报以进行分析和响应。 使用标签 建立良好的标签实践。保持它们的有限性,以便于进行分诊。 Grafana SLO 使用两种标签类型:SLO 标签和警报标签。SLO 标签用于对 SLO 进行分组和过滤。警报标签添加到缓慢和快速燃烧警报中,并用于路由通知和添加警报的元数据。 查询提示和陷阱 有许多配置 SLO 查询的方法。最终,这一切都取决于你的需求。最终,只要记住:如果你没有代表用户体验的指标,那么你就需要新的指标。 保持查询简单 最好的 SLI 基于 Prometheus 计数器指标(例如单调增加的系列),并使用标签将已计数的事件编码为成功或失败(例如:requests_total{code=”200”})。如果你的指标看起来不像这样,那么重新用适合的指标工具你的服务可能比尝试用复杂的 SLI 查询定义来解决问题更好。 可用性和延迟是针对请求驱动服务开始的最常见的 SLO。例如: 可用性(非 5xx 响应):requests_total{code!~”5..”} / requests_total 延迟(小于 1 秒):requests_duration_seconds_bucket{code!~”5..”, le=”1.0”} / requests_duration_seconds_count{code!~”5..”} 新鲜度是消息队列或批处理流程中常见的 SLO,在这里你希望确保在工作请求变得过时时,每个项目(也许经过多次重试)都能完成。 新鲜度