Distributed Tracing

Status: Complete
Category: Observability
Default enforcement: Advisory
Author: PushBackLog team

Summary

Distributed tracing tracks a single request or operation as it flows through multiple services, capturing a timeline of spans that show exactly where time was spent and where failures occurred. In a microservices or service-oriented environment, it is the primary tool for diagnosing latency and cross-service failures that are invisible from logs alone.

Rationale

The distributed debugging problem

In a monolith, a slow request means one process to interrogate: one set of logs, one profiler, one stack trace. In a system with multiple services, the same slow request might touch an API gateway, an auth service, an order service, a notification service, and three database calls. Logs from each service exist in isolation; without a shared identifier and timeline, stitching together what happened requires painful manual correlation.

Distributed tracing solves this by:

Assigning a trace ID to the inbound request
Propagating that ID through all downstream service calls (via HTTP headers)
Having each service record spans — named, timed operations — with the trace ID attached
Assembling those spans into a waterfall diagram in a tracing backend

The three pillars of observability

Distributed tracing is one of the three pillars:

Pillar	What it answers
Logs	What happened at a point in time
Metrics	How is the system behaving in aggregate
Traces	Why is this specific request slow or failing

Traces complement logs (correlation ID in logs = link to trace) and metrics (tracing reveals which service is responsible for a metric degradation).

Guidance

OpenTelemetry: the standard instrumentation layer

OpenTelemetry (OTel) is the CNCF standard for emitting traces, metrics, and logs. It provides:

Language-specific SDKs (Node.js, Java, Python, Go, .NET, etc.)
Auto-instrumentation for common frameworks (Express, Fastify, Spring, Django)
A vendor-neutral protocol (OTLP) that sends to Jaeger, Zipkin, Datadog, Honeycomb, Grafana Tempo, etc.

import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT }),
  instrumentations: [getNodeAutoInstrumentations()], // Auto-instruments HTTP, DB, etc.
});
sdk.start();
// Express routes, database queries, outbound HTTP calls are now automatically traced

Trace context propagation

For traces to span service boundaries, each outbound call must carry the trace context headers:

traceparent: standard W3C trace context header (replaces X-B3-TraceId etc.)
Auto-instrumentation handles this for HTTP; manual propagation required for queues/workers

// Manual context propagation to a message queue
import { propagation, context } from '@opentelemetry/api';

const carrier: Record<string, string> = {};
propagation.inject(context.active(), carrier);
await queue.publish('order.placed', { ...orderData, traceContext: carrier });

// Consumer side: extract and restore context
const activeContext = propagation.extract(context.active(), message.traceContext);
context.with(activeContext, () => processOrder(message));

What to instrument manually

Auto-instrumentation covers HTTP and DB. Manually add spans for:

Business logic operations worth timing (calculatePricing, validateInventory)
External service calls not covered by auto-instrumentation
Expensive in-process operations

import { trace } from '@opentelemetry/api';

const tracer = trace.getTracer('orders-service');

async function placeOrder(order: Order) {
  return tracer.startActiveSpan('order.place', async (span) => {
    span.setAttributes({ 'order.id': order.id, 'order.value': order.total });
    try {
      const result = await processOrder(order);
      return result;
    } catch (err) {
      span.recordException(err as Error);
      span.setStatus({ code: SpanStatusCode.ERROR });
      throw err;
    } finally {
      span.end();
    }
  });
}

Examples

Trace waterfall for a slow checkout

A checkout request taking 2.3 seconds produces this span tree:

GET /checkout                         2300ms
├─ auth.validateToken                   12ms
├─ inventory.checkAvailability         180ms
├─ pricing.calculate                  1800ms   ← bottleneck
│   ├─ db.query (pricing_rules)           20ms
│   └─ external.taxApi                  1760ms   ← slow external call
└─ cart.reserve                          40ms

Without tracing, the slow checkout is diagnosed as “something in checkout is slow.” With tracing, the cause is immediately visible: the external tax API.

Anti-patterns

1. Not propagating trace context across service boundaries

If Service A creates a trace but doesn’t pass traceparent to Service B, the trace breaks. Each service’s spans appear as disconnected traces, losing the end-to-end view. Auto-instrumentation handles HTTP; manually propagate context for queues and async workers.

2. Sampling too aggressively

Tracing backends charge per span ingested; teams respond by sampling at 0.1%. This means 1 in 1,000 requests is traced. The rare-but-critical slow paths — the ones that only manifest for specific users or data patterns — are never sampled. Use head-based sampling at a reasonable rate (1-5%) plus tail-based sampling (always capture errors and slow requests).

3. Adding traces without linking to logs

Tracing is most powerful when trace IDs appear in log lines too. Always inject the current trace ID and span ID into the logging context so logs and traces are joinable.

4. Treating tracing as optional dev tooling

Tracing needs to be in production to be useful for diagnosing production incidents. Instrument early; configure tracing before your first production deployment.

Part of the PushBackLog Best Practices Library. Suggest improvements →

Distributed Tracing

Distributed Tracing

Tags

Summary

Rationale

The distributed debugging problem

The three pillars of observability

Guidance

OpenTelemetry: the standard instrumentation layer

Trace context propagation

What to instrument manually

Examples

Trace waterfall for a slow checkout

Anti-patterns

1. Not propagating trace context across service boundaries

2. Sampling too aggressively

3. Adding traces without linking to logs

4. Treating tracing as optional dev tooling

Related practices