Engineering Metrics

Status: Complete Category: Management Default enforcement: Soft Author: PushBackLog team

Summary

Engineering metrics are quantitative signals used to understand and improve the effectiveness of software delivery. When chosen carefully, metrics expose bottlenecks, predict reliability risk, and guide improvement. When chosen carelessly — or used as individual performance measures — they are gamed, misleading, and harmful to team culture.

Rationale

You cannot improve what you do not measure. But you will also not improve what you measure badly. Engineering metrics provide a feedback loop between team practices and delivery outcomes — but only if the team measures outcomes (deployment frequency, lead time, incident recovery) rather than proxies (lines of code, story points closed, tickets resolved).

The DORA research programme (DevOps Research and Assessment) has produced the most robust empirical evidence for which metrics predict high-performing software delivery. DORA’s four key metrics are the recommended starting point for any engineering metrics programme.

Guidance

DORA’s four key metrics

Metric	What it measures	Top performer benchmark
Deployment frequency	How often code is deployed to production	On-demand (multiple times per day)
Lead time for changes	Time from commit to production	Less than one hour
Change failure rate	Percentage of deployments causing incidents / rollbacks	0–15%
Time to restore service (MTTR)	How long to recover from a production incident	Less than one hour

These four metrics form two pairs: the throughput metrics (deployment frequency + lead time) and the stability metrics (change failure rate + MTTR). Elite teams score well on all four simultaneously. The data shows throughput and stability are not in tension — high-performing teams achieve both.

Tracking and baselines

Before optimising, establish baselines. Collect at least 90 days of data before drawing conclusions about trends. Metrics without baselines enable only relative comparisons (“better or worse than last sprint”) but not absolute ones (“are we high-performing relative to industry benchmarks?”).

Measurement tooling should be automated and continuous, not collected manually — manual data collection introduces bias and is unsustainable.

Additional signals

Beyond DORA, other signals can expose specific problem areas:

Signal	What it detects
Test coverage trend	Whether test confidence is growing or eroding
Build/pipeline duration	CI feedback loop quality
Alert noise ratio (alerts page per actionable alert)	On-call sustainability
P90/P99 latency trend	User-visible performance degradation
Escaped defect rate	Defects reaching production that should have been caught earlier

Goodhart’s Law and gaming

“When a measure becomes a target, it ceases to be a good measure.” — Goodhart’s Law

Metrics used to evaluate individual engineers or teams produce gaming:

Deployment frequency is inflated by deploying no-ops
Story points closed increases by inflating estimates
Test coverage improves by adding trivial tests that do not validate behaviour

Engineering metrics should be used by the team for self-diagnosis, not by management to score individuals. Publish metrics at the team and organisational level; do not attach them to performance reviews.

Reviewing metrics

A monthly or quarterly metrics review with the team serves several purposes:

Identifies genuine bottlenecks (slow CI, high change failure rate in one service)
Celebrates genuine improvements
Keeps the metric set fresh — stop measuring things that are no longer informative
Grounds improvement initiatives in data rather than intuition

Common failure modes

Failure	Description
Proxy metrics as goals	Story points, lines of code, and test count used as performance targets
Individual-level measurement	Metrics used to evaluate or rank individual engineers
No baselines	Teams track metrics but have no context for whether numbers are good or bad
Data collected manually	Manual collection is biased, inconsistent, and abandoned under pressure
Vanity metrics	Numbers that look good but do not reflect delivery health (e.g., total commits)

Examples

DORA metrics: definitions and measurement sources

Metric	What to measure	Source
Deployment frequency	Number of production deployments per week	CI/CD pipeline webhook → dashboard
Lead time for changes	Commit timestamp to production deploy timestamp	GitHub API + deployment events
Change failure rate	`failed_deployments / total_deployments` (rollbacks + hotfixes)	Deployment events tagged `rollback` or `hotfix`
Time to restore service	Incident open timestamp to resolution timestamp	PagerDuty incidents closed with `severity >= P2`

Tracking lead time in GitHub Actions

# On every production deploy, emit the commit timestamp
- name: Record deploy metadata
  run: |
    echo "Deploy SHA: ${{ github.sha }}"
    COMMIT_TIME=$(git log -1 --format='%aI' ${{ github.sha }})
    DEPLOY_TIME=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
    LEAD_SECONDS=$(( $(date -d "$DEPLOY_TIME" +%s) - $(date -d "$COMMIT_TIME" +%s) ))
    echo "Lead time (seconds): $LEAD_SECONDS"
    # Emit as a custom CloudWatch metric
    aws cloudwatch put-metric-data \
      --namespace "Engineering/DORA" \
      --metric-name "LeadTime" \
      --value "$LEAD_SECONDS" \
      --unit Seconds

Change failure rate Datadog query

# Deployment events tagged by outcome
sum:deploys.total{env:production, outcome:failure}.as_count() /
sum:deploys.total{env:production}.as_count()

Set an alert at > 15% (DORA “medium” threshold). Investigate before the next sprint if triggered.

Metric review format for sprint retrospective

Metrics review — Sprint 27

Deployment frequency:  4.2 / week    (target: ≥ 5)    ⚠️ below target
Lead time:             42 min        (target: ≤ 60)    ✅
Change failure rate:    4.8%         (target: ≤ 15%)   ✅
Time to restore:       22 min        (target: ≤ 60)    ✅

Trend: Lead time is steady. Deployment frequency fell this sprint due to
Christmas holiday; expected to normalise. No action required on DORA.

Debt register: 26 items (+1 vs. Sprint 26). TD-047 should close S28.

Part of the PushBackLog Best Practices Library. Suggest improvements →

Engineering Metrics

Engineering Metrics

Tags

Summary

Rationale

Guidance

DORA’s four key metrics

Tracking and baselines

Additional signals

Goodhart’s Law and gaming

Reviewing metrics

Common failure modes

Examples

DORA metrics: definitions and measurement sources

Tracking lead time in GitHub Actions

Change failure rate Datadog query

Metric review format for sprint retrospective

Related practices