PushBackLog

Engineering Metrics

Soft enforcement Complete by PushBackLog team
Topic: management Topic: metrics Topic: delivery Skillset: management Technology: generic Stage: planning Stage: operations

Engineering Metrics

Status: Complete Category: Management Default enforcement: Soft Author: PushBackLog team


Tags

  • Topic: management, metrics, delivery
  • Skillset: management
  • Technology: generic
  • Stage: planning, operations

Summary

Engineering metrics are quantitative signals used to understand and improve the effectiveness of software delivery. When chosen carefully, metrics expose bottlenecks, predict reliability risk, and guide improvement. When chosen carelessly — or used as individual performance measures — they are gamed, misleading, and harmful to team culture.


Rationale

You cannot improve what you do not measure. But you will also not improve what you measure badly. Engineering metrics provide a feedback loop between team practices and delivery outcomes — but only if the team measures outcomes (deployment frequency, lead time, incident recovery) rather than proxies (lines of code, story points closed, tickets resolved).

The DORA research programme (DevOps Research and Assessment) has produced the most robust empirical evidence for which metrics predict high-performing software delivery. DORA’s four key metrics are the recommended starting point for any engineering metrics programme.


Guidance

DORA’s four key metrics

MetricWhat it measuresTop performer benchmark
Deployment frequencyHow often code is deployed to productionOn-demand (multiple times per day)
Lead time for changesTime from commit to productionLess than one hour
Change failure ratePercentage of deployments causing incidents / rollbacks0–15%
Time to restore service (MTTR)How long to recover from a production incidentLess than one hour

These four metrics form two pairs: the throughput metrics (deployment frequency + lead time) and the stability metrics (change failure rate + MTTR). Elite teams score well on all four simultaneously. The data shows throughput and stability are not in tension — high-performing teams achieve both.

Tracking and baselines

Before optimising, establish baselines. Collect at least 90 days of data before drawing conclusions about trends. Metrics without baselines enable only relative comparisons (“better or worse than last sprint”) but not absolute ones (“are we high-performing relative to industry benchmarks?”).

Measurement tooling should be automated and continuous, not collected manually — manual data collection introduces bias and is unsustainable.

Additional signals

Beyond DORA, other signals can expose specific problem areas:

SignalWhat it detects
Test coverage trendWhether test confidence is growing or eroding
Build/pipeline durationCI feedback loop quality
Alert noise ratio (alerts page per actionable alert)On-call sustainability
P90/P99 latency trendUser-visible performance degradation
Escaped defect rateDefects reaching production that should have been caught earlier

Goodhart’s Law and gaming

“When a measure becomes a target, it ceases to be a good measure.” — Goodhart’s Law

Metrics used to evaluate individual engineers or teams produce gaming:

  • Deployment frequency is inflated by deploying no-ops
  • Story points closed increases by inflating estimates
  • Test coverage improves by adding trivial tests that do not validate behaviour

Engineering metrics should be used by the team for self-diagnosis, not by management to score individuals. Publish metrics at the team and organisational level; do not attach them to performance reviews.

Reviewing metrics

A monthly or quarterly metrics review with the team serves several purposes:

  • Identifies genuine bottlenecks (slow CI, high change failure rate in one service)
  • Celebrates genuine improvements
  • Keeps the metric set fresh — stop measuring things that are no longer informative
  • Grounds improvement initiatives in data rather than intuition

Common failure modes

FailureDescription
Proxy metrics as goalsStory points, lines of code, and test count used as performance targets
Individual-level measurementMetrics used to evaluate or rank individual engineers
No baselinesTeams track metrics but have no context for whether numbers are good or bad
Data collected manuallyManual collection is biased, inconsistent, and abandoned under pressure
Vanity metricsNumbers that look good but do not reflect delivery health (e.g., total commits)

Examples

DORA metrics: definitions and measurement sources

MetricWhat to measureSource
Deployment frequencyNumber of production deployments per weekCI/CD pipeline webhook → dashboard
Lead time for changesCommit timestamp to production deploy timestampGitHub API + deployment events
Change failure ratefailed_deployments / total_deployments (rollbacks + hotfixes)Deployment events tagged rollback or hotfix
Time to restore serviceIncident open timestamp to resolution timestampPagerDuty incidents closed with severity >= P2

Tracking lead time in GitHub Actions

# On every production deploy, emit the commit timestamp
- name: Record deploy metadata
  run: |
    echo "Deploy SHA: ${{ github.sha }}"
    COMMIT_TIME=$(git log -1 --format='%aI' ${{ github.sha }})
    DEPLOY_TIME=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
    LEAD_SECONDS=$(( $(date -d "$DEPLOY_TIME" +%s) - $(date -d "$COMMIT_TIME" +%s) ))
    echo "Lead time (seconds): $LEAD_SECONDS"
    # Emit as a custom CloudWatch metric
    aws cloudwatch put-metric-data \
      --namespace "Engineering/DORA" \
      --metric-name "LeadTime" \
      --value "$LEAD_SECONDS" \
      --unit Seconds

Change failure rate Datadog query

# Deployment events tagged by outcome
sum:deploys.total{env:production, outcome:failure}.as_count() /
sum:deploys.total{env:production}.as_count()

Set an alert at > 15% (DORA “medium” threshold). Investigate before the next sprint if triggered.

Metric review format for sprint retrospective

Metrics review — Sprint 27

Deployment frequency:  4.2 / week    (target: ≥ 5)    ⚠️ below target
Lead time:             42 min        (target: ≤ 60)    ✅
Change failure rate:    4.8%         (target: ≤ 15%)   ✅
Time to restore:       22 min        (target: ≤ 60)    ✅

Trend: Lead time is steady. Deployment frequency fell this sprint due to
Christmas holiday; expected to normalise. No action required on DORA.

Debt register: 26 items (+1 vs. Sprint 26). TD-047 should close S28.


Part of the PushBackLog Best Practices Library. Suggest improvements →