SLOs, SLIs, and Error Budgets

Status: Complete Category: Observability Default enforcement: Advisory Author: PushBackLog team

Summary

Service Level Indicators (SLIs) are quantitative measures of service behaviour. Service Level Objectives (SLOs) are target thresholds for those measures. Error budgets are the permitted amount of unreliability before corrective action is required. Together, SLIs, SLOs, and error budgets provide a data-driven framework for making explicit trade-offs between reliability, feature velocity, and operational investment.

Rationale

Without defined reliability targets, every outage is equally critical and every deployment is equally risky. Teams without SLOs either over-react to minor degradations (treating everything as an emergency) or under-react to genuine reliability problems (because there is no agreed standard to compare against).

SLOs externalise reliability expectations as shared agreements. They enable rational decision-making: when the error budget is healthy, the team can ship features; when the error budget is depleted, reliability work takes priority. This makes reliability a technical and business conversation rather than a purely reactive operational one.

Guidance

Choosing SLIs

An SLI is a carefully defined quantitative measure of the property you care about. Good SLIs:

Are measurable from data you already collect (or can collect cheaply)
Reflect the user experience — not internal implementation details
Map to the four golden signals: availability, latency, throughput, and error rate

Common SLIs:

Service type	Typical SLIs
HTTP API	Request success rate, p95 latency, p99 latency
Async worker	Job completion rate, processing lag
Batch job	Completion rate, duration vs. target
Data pipeline	Data freshness, record error rate

Setting SLOs

An SLO is a target percentile or rate for an SLI over a rolling window. SLOs should:

Be set based on actual user impact, not aspirationally
Begin conservatively — it is easier to tighten an SLO than to explain why you missed one
Be agreed upon by both engineering and the business/product teams it serves
Reflect a rolling window (28-day rolling is common) rather than a calendar month

Example SLO definition:

“99.5% of requests to the /api/tasks endpoint return a successful response (HTTP 2xx) within 300ms over a rolling 28-day window.”

An SLO is not an SLA. An SLA (Service Level Agreement) is an external commitment with financial consequences. An SLO is an internal operational target. SLOs should be slightly more ambitious than SLAs to preserve headroom.

Error budgets

The error budget is the complement of the SLO: an SLO of 99.5% implies a 0.5% error budget — the permitted proportion of “bad” events before the SLO is breached.

Error budgets answer: “how much unreliability do we have remaining this period?”

When error budget is healthy (> 50% remaining):

Feature development can proceed at normal pace
Risky deployments (large changes, schema migrations) are acceptable

When error budget is tight (< 25% remaining):

Reliability improvements are prioritised in the backlog
Risky deployments are deferred or require additional validation

When error budget is exhausted:

A feature freeze on risky changes is triggered
Reliability work takes priority until the budget recovers

Burn rate alerts

Rather than alerting when an SLO is breached (too late), alert on burn rate: how fast the error budget is being consumed relative to the expected pace.

Burn rate	Meaning	Alert action
1×	Budget consumed at exactly the expected rate	No alert
6×	Budget will be exhausted in 4–5 days	PagerDuty / ticket
14.4×	Budget will be exhausted in 1 hour	Wake someone up

Multi-window burn rate alerts (compare a short window against a longer window) reduce false positives from brief spikes.

Common failure modes

Failure	Description
Aspirational SLOs	SLOs set to 99.99% “because that sounds good”; never achievable; error budget always depleted
Measuring the wrong thing	SLIs track internal metrics (CPU, memory) rather than user-visible outcomes
SLO as a pass/fail grade	Teams optimise for SLO compliance rather than actual reliability
No error budget policy	Error budgets calculated but no agreed process for what to do when depleted
Alerting on SLO breach	Alerts fire only after the SLO is already breached, providing no time to react

Examples

Defining SLIs and SLOs for an API service

# slo-definitions.yaml — checked into version control alongside service code
service: auth-api
owner: platform-team

slos:
  - name: Availability
    description: Proportion of successful HTTP requests (non-5xx responses)
    sli:
      type: request_based
      good_events: http_requests{code!~"5..", service="auth-api"}
      total_events: http_requests{service="auth-api"}
    target: 99.9%   # 43.8 minutes downtime budget per month
    window: 30d

  - name: Latency
    description: Proportion of requests completing within 300ms (p95)
    sli:
      type: request_based
      good_events: http_request_duration_seconds_bucket{le="0.3", service="auth-api"}
      total_events: http_request_duration_seconds_count{service="auth-api"}
    target: 95%
    window: 30d

Prometheus recording rules for an availability SLO

# prometheus/rules/auth-api-slo.yaml
groups:
  - name: auth-api-slo
    rules:
      # 5-minute burn-rate window
      - record: job:http_availability:ratio_rate5m
        expr: |
          sum(rate(http_requests_total{code!~"5..", service="auth-api"}[5m]))
          /
          sum(rate(http_requests_total{service="auth-api"}[5m]))

      # 1-hour burn-rate window (for slow-burn alerts)
      - record: job:http_availability:ratio_rate1h
        expr: |
          sum(rate(http_requests_total{code!~"5..", service="auth-api"}[1h]))
          /
          sum(rate(http_requests_total{service="auth-api"}[1h]))

Multi-window, multi-burn-rate alerting

# A two-window alert fires only when burn rate is elevated at both short and long scales,
# reducing false positives while staying responsive
alerts:
  - name: HighBurnRateAuthAPI
    condition: |
      (
        # Fast burn: 14x over 5m AND 1h — depletes 5% of budget in 1 hour
        job:http_availability:ratio_rate5m < (1 - 14 * (1 - 0.999))
        and
        job:http_availability:ratio_rate1h < (1 - 14 * (1 - 0.999))
      )
      or
      (
        # Slow burn: 2x over 6h AND 3d — depletes 10% of budget in 3 days
        job:http_availability:ratio_rate6h < (1 - 2 * (1 - 0.999))
        and
        job:http_availability:ratio_rate3d < (1 - 2 * (1 - 0.999))
      )
    severity: page    # Fast burn = page; slow burn = ticket
    labels:
      team: platform

Error budget policy

auth-api error budget policy — SLO: 99.9% availability / 30-day rolling window

Budget: 43.8 minutes downtime per month

> 75% budget remaining  →  Normal. No policy changes.
50–75% remaining        →  Review deployment frequency. Deprioritise non-essential risk.
25–50% remaining        →  Freeze optional feature deployments. Focus sprint on reliability.
< 25% remaining         →  Incident response posture: engineering focus on restoring SLO.
                            Escalate to engineering manager for resourcing decision.
0% (budget exhausted)   →  Stop all non-critical deployments until budget resets.

Part of the PushBackLog Best Practices Library. Suggest improvements →

SLOs, SLIs, and Error Budgets

SLOs, SLIs, and Error Budgets

Tags

Summary

Rationale

Guidance

Choosing SLIs

Setting SLOs

Error budgets

Burn rate alerts

Common failure modes

Examples

Defining SLIs and SLOs for an API service

Prometheus recording rules for an availability SLO

Multi-window, multi-burn-rate alerting

Error budget policy

Related practices