Distributed Tracing
Status: Complete
Category: Observability
Default enforcement: Advisory
Author: PushBackLog team
Tags
- Topic: observability, architecture
- Skillset: backend, devops
- Technology: generic
- Stage: execution, deployment
Summary
Distributed tracing tracks a single request or operation as it flows through multiple services, capturing a timeline of spans that show exactly where time was spent and where failures occurred. In a microservices or service-oriented environment, it is the primary tool for diagnosing latency and cross-service failures that are invisible from logs alone.
Rationale
The distributed debugging problem
In a monolith, a slow request means one process to interrogate: one set of logs, one profiler, one stack trace. In a system with multiple services, the same slow request might touch an API gateway, an auth service, an order service, a notification service, and three database calls. Logs from each service exist in isolation; without a shared identifier and timeline, stitching together what happened requires painful manual correlation.
Distributed tracing solves this by:
- Assigning a trace ID to the inbound request
- Propagating that ID through all downstream service calls (via HTTP headers)
- Having each service record spans — named, timed operations — with the trace ID attached
- Assembling those spans into a waterfall diagram in a tracing backend
The three pillars of observability
Distributed tracing is one of the three pillars:
| Pillar | What it answers |
|---|---|
| Logs | What happened at a point in time |
| Metrics | How is the system behaving in aggregate |
| Traces | Why is this specific request slow or failing |
Traces complement logs (correlation ID in logs = link to trace) and metrics (tracing reveals which service is responsible for a metric degradation).
Guidance
OpenTelemetry: the standard instrumentation layer
OpenTelemetry (OTel) is the CNCF standard for emitting traces, metrics, and logs. It provides:
- Language-specific SDKs (Node.js, Java, Python, Go, .NET, etc.)
- Auto-instrumentation for common frameworks (Express, Fastify, Spring, Django)
- A vendor-neutral protocol (OTLP) that sends to Jaeger, Zipkin, Datadog, Honeycomb, Grafana Tempo, etc.
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT }),
instrumentations: [getNodeAutoInstrumentations()], // Auto-instruments HTTP, DB, etc.
});
sdk.start();
// Express routes, database queries, outbound HTTP calls are now automatically traced
Trace context propagation
For traces to span service boundaries, each outbound call must carry the trace context headers:
traceparent: standard W3C trace context header (replaces X-B3-TraceId etc.)- Auto-instrumentation handles this for HTTP; manual propagation required for queues/workers
// Manual context propagation to a message queue
import { propagation, context } from '@opentelemetry/api';
const carrier: Record<string, string> = {};
propagation.inject(context.active(), carrier);
await queue.publish('order.placed', { ...orderData, traceContext: carrier });
// Consumer side: extract and restore context
const activeContext = propagation.extract(context.active(), message.traceContext);
context.with(activeContext, () => processOrder(message));
What to instrument manually
Auto-instrumentation covers HTTP and DB. Manually add spans for:
- Business logic operations worth timing (
calculatePricing,validateInventory) - External service calls not covered by auto-instrumentation
- Expensive in-process operations
import { trace } from '@opentelemetry/api';
const tracer = trace.getTracer('orders-service');
async function placeOrder(order: Order) {
return tracer.startActiveSpan('order.place', async (span) => {
span.setAttributes({ 'order.id': order.id, 'order.value': order.total });
try {
const result = await processOrder(order);
return result;
} catch (err) {
span.recordException(err as Error);
span.setStatus({ code: SpanStatusCode.ERROR });
throw err;
} finally {
span.end();
}
});
}
Examples
Trace waterfall for a slow checkout
A checkout request taking 2.3 seconds produces this span tree:
GET /checkout 2300ms
├─ auth.validateToken 12ms
├─ inventory.checkAvailability 180ms
├─ pricing.calculate 1800ms ← bottleneck
│ ├─ db.query (pricing_rules) 20ms
│ └─ external.taxApi 1760ms ← slow external call
└─ cart.reserve 40ms
Without tracing, the slow checkout is diagnosed as “something in checkout is slow.” With tracing, the cause is immediately visible: the external tax API.
Anti-patterns
1. Not propagating trace context across service boundaries
If Service A creates a trace but doesn’t pass traceparent to Service B, the trace breaks. Each service’s spans appear as disconnected traces, losing the end-to-end view. Auto-instrumentation handles HTTP; manually propagate context for queues and async workers.
2. Sampling too aggressively
Tracing backends charge per span ingested; teams respond by sampling at 0.1%. This means 1 in 1,000 requests is traced. The rare-but-critical slow paths — the ones that only manifest for specific users or data patterns — are never sampled. Use head-based sampling at a reasonable rate (1-5%) plus tail-based sampling (always capture errors and slow requests).
3. Adding traces without linking to logs
Tracing is most powerful when trace IDs appear in log lines too. Always inject the current trace ID and span ID into the logging context so logs and traces are joinable.
4. Treating tracing as optional dev tooling
Tracing needs to be in production to be useful for diagnosing production incidents. Instrument early; configure tracing before your first production deployment.
Related practices
Part of the PushBackLog Best Practices Library. Suggest improvements →