PushBackLog

SLOs, SLIs, and Error Budgets

Advisory enforcement Complete by PushBackLog team
Topic: observability Topic: reliability Topic: sre Skillset: devops Skillset: platform Technology: generic Stage: operations Stage: architecture

SLOs, SLIs, and Error Budgets

Status: Complete Category: Observability Default enforcement: Advisory Author: PushBackLog team


Tags

  • Topic: observability, reliability, sre
  • Skillset: devops, platform
  • Technology: generic
  • Stage: operations, architecture

Summary

Service Level Indicators (SLIs) are quantitative measures of service behaviour. Service Level Objectives (SLOs) are target thresholds for those measures. Error budgets are the permitted amount of unreliability before corrective action is required. Together, SLIs, SLOs, and error budgets provide a data-driven framework for making explicit trade-offs between reliability, feature velocity, and operational investment.


Rationale

Without defined reliability targets, every outage is equally critical and every deployment is equally risky. Teams without SLOs either over-react to minor degradations (treating everything as an emergency) or under-react to genuine reliability problems (because there is no agreed standard to compare against).

SLOs externalise reliability expectations as shared agreements. They enable rational decision-making: when the error budget is healthy, the team can ship features; when the error budget is depleted, reliability work takes priority. This makes reliability a technical and business conversation rather than a purely reactive operational one.


Guidance

Choosing SLIs

An SLI is a carefully defined quantitative measure of the property you care about. Good SLIs:

  • Are measurable from data you already collect (or can collect cheaply)
  • Reflect the user experience — not internal implementation details
  • Map to the four golden signals: availability, latency, throughput, and error rate

Common SLIs:

Service typeTypical SLIs
HTTP APIRequest success rate, p95 latency, p99 latency
Async workerJob completion rate, processing lag
Batch jobCompletion rate, duration vs. target
Data pipelineData freshness, record error rate

Setting SLOs

An SLO is a target percentile or rate for an SLI over a rolling window. SLOs should:

  • Be set based on actual user impact, not aspirationally
  • Begin conservatively — it is easier to tighten an SLO than to explain why you missed one
  • Be agreed upon by both engineering and the business/product teams it serves
  • Reflect a rolling window (28-day rolling is common) rather than a calendar month

Example SLO definition:

“99.5% of requests to the /api/tasks endpoint return a successful response (HTTP 2xx) within 300ms over a rolling 28-day window.”

An SLO is not an SLA. An SLA (Service Level Agreement) is an external commitment with financial consequences. An SLO is an internal operational target. SLOs should be slightly more ambitious than SLAs to preserve headroom.

Error budgets

The error budget is the complement of the SLO: an SLO of 99.5% implies a 0.5% error budget — the permitted proportion of “bad” events before the SLO is breached.

Error budgets answer: “how much unreliability do we have remaining this period?”

When error budget is healthy (> 50% remaining):

  • Feature development can proceed at normal pace
  • Risky deployments (large changes, schema migrations) are acceptable

When error budget is tight (< 25% remaining):

  • Reliability improvements are prioritised in the backlog
  • Risky deployments are deferred or require additional validation

When error budget is exhausted:

  • A feature freeze on risky changes is triggered
  • Reliability work takes priority until the budget recovers

Burn rate alerts

Rather than alerting when an SLO is breached (too late), alert on burn rate: how fast the error budget is being consumed relative to the expected pace.

Burn rateMeaningAlert action
Budget consumed at exactly the expected rateNo alert
Budget will be exhausted in 4–5 daysPagerDuty / ticket
14.4×Budget will be exhausted in 1 hourWake someone up

Multi-window burn rate alerts (compare a short window against a longer window) reduce false positives from brief spikes.


Common failure modes

FailureDescription
Aspirational SLOsSLOs set to 99.99% “because that sounds good”; never achievable; error budget always depleted
Measuring the wrong thingSLIs track internal metrics (CPU, memory) rather than user-visible outcomes
SLO as a pass/fail gradeTeams optimise for SLO compliance rather than actual reliability
No error budget policyError budgets calculated but no agreed process for what to do when depleted
Alerting on SLO breachAlerts fire only after the SLO is already breached, providing no time to react

Examples

Defining SLIs and SLOs for an API service

# slo-definitions.yaml — checked into version control alongside service code
service: auth-api
owner: platform-team

slos:
  - name: Availability
    description: Proportion of successful HTTP requests (non-5xx responses)
    sli:
      type: request_based
      good_events: http_requests{code!~"5..", service="auth-api"}
      total_events: http_requests{service="auth-api"}
    target: 99.9%   # 43.8 minutes downtime budget per month
    window: 30d

  - name: Latency
    description: Proportion of requests completing within 300ms (p95)
    sli:
      type: request_based
      good_events: http_request_duration_seconds_bucket{le="0.3", service="auth-api"}
      total_events: http_request_duration_seconds_count{service="auth-api"}
    target: 95%
    window: 30d

Prometheus recording rules for an availability SLO

# prometheus/rules/auth-api-slo.yaml
groups:
  - name: auth-api-slo
    rules:
      # 5-minute burn-rate window
      - record: job:http_availability:ratio_rate5m
        expr: |
          sum(rate(http_requests_total{code!~"5..", service="auth-api"}[5m]))
          /
          sum(rate(http_requests_total{service="auth-api"}[5m]))

      # 1-hour burn-rate window (for slow-burn alerts)
      - record: job:http_availability:ratio_rate1h
        expr: |
          sum(rate(http_requests_total{code!~"5..", service="auth-api"}[1h]))
          /
          sum(rate(http_requests_total{service="auth-api"}[1h]))

Multi-window, multi-burn-rate alerting

# A two-window alert fires only when burn rate is elevated at both short and long scales,
# reducing false positives while staying responsive
alerts:
  - name: HighBurnRateAuthAPI
    condition: |
      (
        # Fast burn: 14x over 5m AND 1h — depletes 5% of budget in 1 hour
        job:http_availability:ratio_rate5m < (1 - 14 * (1 - 0.999))
        and
        job:http_availability:ratio_rate1h < (1 - 14 * (1 - 0.999))
      )
      or
      (
        # Slow burn: 2x over 6h AND 3d — depletes 10% of budget in 3 days
        job:http_availability:ratio_rate6h < (1 - 2 * (1 - 0.999))
        and
        job:http_availability:ratio_rate3d < (1 - 2 * (1 - 0.999))
      )
    severity: page    # Fast burn = page; slow burn = ticket
    labels:
      team: platform

Error budget policy

auth-api error budget policy — SLO: 99.9% availability / 30-day rolling window

Budget: 43.8 minutes downtime per month

> 75% budget remaining  →  Normal. No policy changes.
50–75% remaining        →  Review deployment frequency. Deprioritise non-essential risk.
25–50% remaining        →  Freeze optional feature deployments. Focus sprint on reliability.
< 25% remaining         →  Incident response posture: engineering focus on restoring SLO.
                            Escalate to engineering manager for resourcing decision.
0% (budget exhausted)   →  Stop all non-critical deployments until budget resets.


Part of the PushBackLog Best Practices Library. Suggest improvements →