Search site

Troubleshooting Microservices with OpenTelemetry Distributed Tracing

Updated on: February 15, 2026

Distributed tracing doesn’t just show you what happened. It shows you why things broke. While logs tell you a service returned a 500 error and metrics show latency spiked, only traces reveal the full chain of causation: the upstream timeout that triggered a retry storm, the N+1 query pattern that saturated your connection pool, or the missing cache hit that turned a 50ms call into a 3-second database roundtrip.

This guide covers practical, trace-based troubleshooting patterns for production microservices. You’ll learn how to use OpenTelemetry distributed traces to diagnose the most common, and most frustrating, problems that surface in distributed architectures.

What you’ll learn:

How to identify latency bottlenecks using trace waterfall analysis
Detecting N+1 query patterns and database performance issues in traces
Diagnosing retry storms, timeout cascades, and circuit breaker failures
Using error propagation traces to find root causes across service boundaries
Spotting connection pool exhaustion, cache misses, and queue backlogs
Correlating traces with logs and metrics for full-context debugging

For step-by-step instrumentation setup, see our companion guide: How to Implement Distributed Tracing in Microservices with OpenTelemetry Auto-Instrumentation. For production-hardening your instrumentation, see OpenTelemetry Instrumentation Best Practices for Microservices Observability.

Why Traces Are the Best Tool for Microservices Troubleshooting

Logs, metrics, and traces each serve a different purpose. But when a production incident hits a distributed system, traces are uniquely positioned to answer the hardest questions, especially those that span service boundaries.

Troubleshooting Question	Logs	Metrics	Traces
Which service is slow?	❌ Scattered across services	✅ Latency dashboards	✅ Waterfall shows exact span
Why is it slow?	🟡 If you logged enough context	❌ No causal detail	✅ Child spans reveal cause
Which upstream call caused the error?	❌ Requires correlation IDs	❌ Only shows error rate	✅ Error propagation is visible
Is it a single request or systemic?	❌ Hard to aggregate	✅ Rate/error trends	✅ Trace grouping by pattern
What was the exact sequence of calls?	❌ Requires reconstruction	❌ No ordering info	✅ Waterfall shows call graph

The key insight is that traces give you causation, not just correlation. When service A calls service B, which calls service C, and C fails, a trace shows you the entire chain, the timing of each call, and exactly where things went wrong.

Anatomy of a Troubleshooting Trace

Before diving into specific patterns, let’s establish what you’re looking at in a trace waterfall. Understanding the structure makes pattern recognition faster during incidents.

A distributed trace consists of spans organized in a parent-child hierarchy, as defined by the OpenTelemetry Trace specification. Each span represents a single operation: an HTTP request, a database query, a cache lookup, a message publish. The root span represents the entry point, and child spans represent downstream operations.

[Root Span: GET /api/orders/12345] ─────────── 1,247ms

├── [auth-service: POST /validate] ── 23ms
├── [order-service: GET /orders/12345] ────── 1,180ms
│     ├── [PostgreSQL: SELECT * FROM orders] ── 12ms
│     ├── [inventory-service: GET /stock] ─── 890ms  ← BOTTLENECK
│     │     ├── [Redis: GET inventory:12345] ── 2ms (miss)
│     │     └── [PostgreSQL: SELECT ...] ── 875ms  ← ROOT CAUSE
│     └── [pricing-service: GET /calculate] ── 45ms
└── [notification-service: POST /email] ── 18ms

In this trace, the total request took 1,247ms. The trace waterfall immediately shows that inventory-service consumed 890ms, and within it, a database query took 875ms following a cache miss. Without the trace, you’d see a slow /api/orders endpoint in your metrics and have to investigate each service individually.

Key span attributes to examine during troubleshooting (see the full OpenTelemetry Semantic Conventions for reference):

Attribute	What It Tells You
http.status_code	HTTP response status for service calls
db.statement	The actual SQL query executed
db.system	Which database (PostgreSQL, MySQL, Redis)
http.method + http.url	Which endpoint was called
otel.status_code = ERROR	Span completed with an error
exception.message	Error details if an exception occurred
net.peer.name	Which host the call went to
messaging.system	Message broker involved (Kafka, RabbitMQ)
Span duration	How long the operation took

Diagnosing Latency Bottlenecks with Trace Waterfall Analysis

Latency issues are the most common reason teams reach for traces. The waterfall view transforms a vague “the API is slow” complaint into a precise diagnosis.

Pattern: The Slow Database Query

Symptoms in metrics: Elevated p95/p99 latency on a specific endpoint. Database CPU or connection usage may appear normal.

What the trace reveals:

[order-service: GET /orders] ────────── 2,340ms

├── [PostgreSQL: SELECT o.*, oi.* FROM orders o
│    JOIN order_items oi ON o.id = oi.order_id
│    WHERE o.customer_id = $1
│    ORDER BY o.created_at DESC] ──── 2,280ms  ← Problem
└── [Redis: SET order-cache:customer:789] ── 3ms

The trace shows a single database query consuming 97% of the request time. The db.statement attribute reveals the actual SQL, which is a full table scan joining orders with order items, likely missing an index on customer_id.

What to look for in spans:

db.statement: Check for missing WHERE clauses, full table scans, large JOINs, or unoptimized queries. Use EXPLAIN to confirm.
Span duration vs. typical duration: Compare against baseline traces for the same operation
Sequential vs. parallel queries: Are queries running sequentially when they could be parallelized?

Pattern: Sequential Service Calls (Missed Parallelization)

Symptoms in metrics: High latency that seems disproportionate to what any single service reports.

What the trace reveals:

[api-gateway: GET /dashboard] ──────────── 1,850ms

├── [user-service: GET /profile] ── 320ms
├── [order-service: GET /recent] ─── 480ms    (starts after user-svc)
├── [notification-svc: GET /unread] ── 410ms  (starts after order-svc)
└── [recommendation-svc: GET /for-you] ── 590ms (starts after notif.)

The waterfall reveals that four independent service calls are executing sequentially. Total time is the sum of all calls (1,800ms) instead of the max (590ms), a 3x penalty. The trace makes this immediately visible because spans don’t overlap.

The fix: Refactor to concurrent calls. With parallelization, the trace collapses to ~620ms as all four spans overlap.

Pattern: Fan-out Amplification

Symptoms in metrics: Latency increases with load, but individual service latencies look normal.

The trace reveals a product catalog page making 50 individual HTTP calls to the inventory service, one per product. Each call is fast (45–60ms), but the accumulated overhead of 50 sequential HTTP roundtrips adds up to over 3 seconds.

The fix: Replace individual calls with a batch API (GET /stock?skus=A001,A002,…,A050) or use a GraphQL-style query that returns all needed data in a single request.

Detecting N+1 Query Patterns in Traces

N+1 queries are one of the most common performance killers in microservices, and traces make them trivially easy to spot. The pattern appears as one initial query followed by N repetitive queries, and in the trace waterfall, it’s unmistakable.

Pattern: Classic ORM N+1

What the trace reveals:

[order-service: GET /orders] ─────────── 1,890ms

├── [PostgreSQL: SELECT * FROM orders WHERE status = 'active'
│    LIMIT 50] ── 15ms                              (1 query)
├── [PostgreSQL: SELECT * FROM customers WHERE id = 101] ── 8ms
├── [PostgreSQL: SELECT * FROM customers WHERE id = 102] ── 9ms
├── [PostgreSQL: SELECT * FROM customers WHERE id = 103] ── 7ms
│   ... (47 more identical-pattern queries)
└── [PostgreSQL: SELECT * FROM customers WHERE id = 150] ── 11ms

The trace shows 1 query to fetch orders + 50 individual queries to fetch each order’s customer. ORM lazy loading is the usual culprit. Each query is fast individually, but 51 database roundtrips add up to nearly 2 seconds.

How to spot N+1 patterns in your tracing tool:

High span count on a single trace: A trace with 50+ database spans for a simple endpoint is almost always an N+1
Repetitive db.statement patterns: Same query template with different parameter values
Low individual span duration but high total trace duration: Each query is fast, but there are too many

The fix: Replace lazy loading with eager loading (JOIN or IN clause):

— Instead of 51 queries, use 1:

SELECT o.*, c.* FROM orders o

JOIN customers c ON o.customer_id = c.id

WHERE o.status = 'active' LIMIT 50

Pattern: Service-Level N+1 (Microservice Fan-out)

The N+1 pattern isn’t limited to databases. It manifests across service boundaries too:

[checkout-service: POST /checkout] ───────── 4,100ms
├── [cart-service: GET /cart/items] ── 35ms
│    Response: [{productId: "P1"}, ..., {productId: "P20"}]
├── [product-service: GET /products/P1] ── 120ms
├── [product-service: GET /products/P2] ── 135ms
│   ... (18 more calls)
└── [product-service: GET /products/P20] ── 128ms

The checkout service fetches cart items, then calls the product service individually for each item. The fix: implement a batch endpoint (POST /products/batch accepting a list of IDs) or use request collapsing.

Diagnosing Timeout Cascades and Retry Storms

Timeout cascades are among the most dangerous failure modes in microservices. Patterns like the circuit breaker exist specifically to contain them. A single slow dependency can cause cascading failures across your entire system, and traces are the fastest way to understand the chain reaction.

Pattern: Timeout Cascade

Symptoms in metrics: Multiple services show elevated error rates simultaneously. Latency spikes propagate across services.

What the trace reveals:

[api-gateway: POST /orders] ──────────── 30,012ms (TIMEOUT)
└── [order-service: POST /create] ─────── 30,005ms (TIMEOUT)
├── [inventory-svc: POST /reserve] ──── 30,001ms (TIMEOUT)
│     └── [PostgreSQL: UPDATE inventory ...] ── 30,000ms
│           otel.status_code: ERROR
│           exception.message: "Lock wait timeout exceeded"
└── [payment-service: POST /charge] (NOT REACHED)

The trace reveals the cascade: a database lock timeout in inventory causes inventory to time out, which causes order-service to time out, which causes the gateway to time out. Without the trace, you’d see three services all timing out and might investigate the wrong one first.

Key diagnostic signals in timeout traces:

Span duration equals the configured timeout value exactly (e.g., 30,000ms), which confirms a timeout rather than slow processing
otel.status_code: ERROR with timeout-related exception messages
Child spans that were never started (like payment-service above), which confirms the timeout interrupted the flow
Multiple parent spans with identical durations, meaning each parent waited for the full timeout of its child

Pattern: Retry Storm

Symptoms in metrics: Sudden traffic spike to a downstream service. Error rates increase rather than decrease.

What the trace reveals:

[order-service: POST /create] ─────────── 12,450ms
├── [inventory-svc: POST /reserve] ── 5,001ms TIMEOUT
├── [inventory-svc: POST /reserve] ── 5,002ms TIMEOUT (retry 1)
├── [inventory-svc: POST /reserve] ── 2,410ms TIMEOUT (retry 2)
│     exception.message: "Connection pool exhausted"
└── Result: ERROR "Failed after 3 retries"

The trace shows the order service retrying the inventory call three times. With 100 concurrent requests all doing the same, the inventory service receives 300 requests instead of 100, a 3x amplification. The connection pool exhaustion on retry 2 confirms the retry storm is making things worse.

Multi-layer retry amplification: When multiple layers retry, the multiplication compounds:

Gateway (3 retries) → Order Service (3 retries) → Inventory

= 3 × 3 = 9 requests to inventory per user request

Troubleshooting Error Propagation Across Service Boundaries

When an error surfaces at the API boundary, the root cause often lies several services deep. Traces let you follow the error propagation chain backwards from symptom to cause.

Pattern: Hidden Error Origin

Symptoms: Users see “Internal Server Error” on the checkout page. Logs show 500 errors cascading through services.

What the trace reveals in a single view:

[api-gateway: POST /checkout] ─ 500 Internal Server Error
└── [checkout-service: POST /process] ─ 500
├── [cart-service: GET /cart] ─ 200 OK (45ms)
└── [payment-service: POST /charge] ─ 500
└── [fraud-service: POST /evaluate] ─ 500
└── [ML model: POST /predict] ─ 503

exception.message: "Model server OOM:
cannot allocate 2GB for inference batch"

The trace cuts through four levels of error wrapping and reveals the actual root cause: the ML model server ran out of memory. Without the trace, the on-call engineer would start by investigating the checkout service, then the payment service, before eventually reaching the fraud detection service, potentially losing 30+ minutes following the chain manually.

Pattern: Silent Error Swallowing

Sometimes errors don’t propagate. Instead, they get silently caught, and the system returns degraded results instead of errors:

[product-service: GET /product/123] ─ 200 OK (890ms)
├── [PostgreSQL: SELECT ...] ── 12ms ─ 200 OK
├── [review-service: GET /reviews] ── 5,001ms ─ TIMEOUT
│     otel.status_code: ERROR
├── [recommendation-svc: GET /similar] ── 5,002ms ─ TIMEOUT
│     otel.status_code: ERROR
└── [Redis: SET product-cache:123] ── 3ms

The product page returns 200 OK, but the trace reveals two child services timed out. Metrics show 200 OK and ~900ms latency. Only the trace reveals the degraded user experience.

To catch this pattern: Filter traces by spans with otel.status_code: ERROR even when the root span shows success.

Spotting Connection Pool Exhaustion

Connection pool exhaustion is subtle. It doesn’t always produce errors, but it silently adds latency to every request as threads wait for available connections.

Pattern: Pool Wait Time

What the trace reveals:

[order-service: GET /orders] ───────── 2,340ms
├── [PostgreSQL: SELECT ...] ── 15ms
├── [gap: 1,800ms]  ← No spans, just waiting
└── [PostgreSQL: SELECT ...] ── 12ms

The telltale sign is gaps between spans, periods where the service is doing nothing visible. The 1,800ms gap between the first and second database query indicates the thread was waiting for a connection from the pool.

Diagnostic approach: Look for consistent gaps in trace waterfalls that don’t correspond to any span. When you see this pattern across multiple traces for the same service, check connection pool metrics (active connections, wait queue depth, pool size). The trace points you to the exact service experiencing pool pressure, and metrics confirm the diagnosis.

Diagnosing Cache Effectiveness Issues

Caches are supposed to reduce latency, but misconfigured caches can make things worse. Traces reveal cache behavior that’s invisible in aggregate metrics.

Pattern: Cache Miss Cascade

[product-service: GET /product/456] ─────── 1,250ms
├── [Redis: GET product:456] ── 1ms (MISS)
├── [PostgreSQL: SELECT * FROM products ...] ── 85ms
├── [Redis: GET product:456:reviews] ── 1ms (MISS)
├── [review-service: GET /reviews] ── 890ms
│     ├── [PostgreSQL: SELECT ...reviews...] ── 45ms
│     └── [PostgreSQL: SELECT ...users...] ── 830ms  ← Slow join
├── [Redis: SET product:456] ── 2ms
└── [Redis: SET product:456:reviews] ── 1ms

The trace shows: both cache lookups missed, forcing expensive database queries and service calls. The review service’s slow user join (830ms) is the real latency contributor, normally hidden behind a cache hit.

To monitor cache effectiveness with traces: Add custom span attributes for cache hit/miss status. Then in your tracing tool, filter and group by this attribute to see miss rates per operation, not just aggregate miss rates.

# Python example: Adding cache status to spans

from opentelemetry import trace

tracer = trace.get_tracer("cache-instrumentation")

def get_from_cache(key):

with tracer.start_as_current_span("cache.lookup") as span:

span.set_attribute("cache.key", key)

result = redis_client.get(key)

span.set_attribute("cache.hit", result is not None)

return result

Pattern: Cache Stampede

When a popular cache key expires, many concurrent requests simultaneously miss the cache and hit the database, a problem known as cache stampede. Looking at multiple traces for the same endpoint around the same timestamp reveals the stampede: each trace shows a cache miss, and database query durations increase progressively as the database becomes overloaded. All traces set the same cache key, resulting in redundant writes.

Troubleshooting Message Queue Issues

Asynchronous messaging adds complexity to troubleshooting because the producer and consumer execute at different times. OpenTelemetry’s context propagation via W3C Trace Context headers connects these spans into a single trace.

Pattern: Consumer Lag

[order-service: POST /orders] ─ (publishes to Kafka)

├── [Kafka: produce to orders-topic] ── 5ms
│     messaging.kafka.partition: 3
│     messaging.kafka.offset: 1847293
│
│  ~~~ 45,000ms gap (consumer lag) ~~~
│
└── [fulfillment-svc: consume from orders-topic] ── 120ms
└── [PostgreSQL: INSERT INTO fulfillment_queue] ── 8ms

The trace links the producer span (order-service) to the consumer span (fulfillment-service) through propagated context. The 45-second gap between produce and consume timestamps reveals consumer lag. The consumer itself processes quickly (120ms), so the problem is in Kafka consumer group throughput, not processing logic.

Pattern: Poison Message / Dead Letter

[order-service: produce to orders-topic] ── 3ms

→ [fulfillment-svc: consume attempt 1] ── 15ms ── ERROR
│    exception.message: "Invalid product SKU format: null"
→ [fulfillment-svc: consume attempt 2] ── 12ms ── ERROR
→ [fulfillment-svc: consume attempt 3] ── 14ms ── ERROR
→ [dead-letter-queue: produce to orders-dlq] ── 4ms

The trace shows a message being consumed, failing, retried twice, and finally sent to the dead letter queue. The exception message reveals the root cause: a null product SKU, likely a producer-side validation issue.

Using Trace-Based Alerting for Proactive Troubleshooting

Reactive troubleshooting (waiting for users to complain) isn’t good enough. Modern tracing tools support alerting on trace-derived signals that catch issues before they impact users.

Alert on RED Metrics Derived from Traces

Alert	Condition	What It Catches
Error rate spike	Error rate > 5% for 5 minutes	Failed deployments, dependency outages
Latency degradation	p95 latency > 2x baseline for 10 min	Slow queries, missing indexes, cache failures
Throughput drop	Request rate < 50% of expected for 5 min	Upstream routing issues, DNS failures
Error rate by operation	Any operation error rate > 10%	Targeted failures in specific endpoints

Trace-Specific Alerts

Beyond RED metrics, some conditions are only visible through trace analysis:

Span count anomaly: Alert when average spans-per-trace exceeds a threshold, catching N+1 regressions after deployments
New error types: Alert when exception.type values appear that haven’t been seen in the last 7 days
Missing service in trace: Alert when an expected service stops appearing in traces for a critical flow

Building a Troubleshooting Workflow with Sematext Tracing

Sematext Tracing provides the trace analysis capabilities needed to apply all the patterns described above. Here’s how to build an effective troubleshooting workflow.

Step 1: Start with the Service Overview

The Tracing Overview dashboard provides RED metrics (Rate, Error, Duration) across all instrumented services. This is your starting point: identify which service has elevated error rates or latency, and in which time window the problem started.

Step 2: Drill into the Trace Explorer

Use the Trace Explorer to filter traces by the affected service, time window, and error status. Sort by duration to find the slowest traces, or filter by otel.status_code: ERROR to find failures.

Key filters for troubleshooting:

By service name: Isolate traces involving a specific service
By minimum duration: Find traces exceeding your latency SLO
By status: Filter for error traces only
By operation: Focus on a specific endpoint or database operation
By custom attributes: Filter by customer ID, order ID, or other business context

Step 3: Analyze the Trace Waterfall

Open the Trace Details view for a representative trace. The waterfall visualization shows the complete request flow with timing for each span. Look for the patterns described in this guide: long spans, gaps between spans, high span counts, and error spans.

Step 4: Set Up Alerts

Configure alerts on the RED metrics derived from your traces. Start with error rate and p95 latency alerts for your most critical services and endpoints, then expand to more specific alerts as you learn your system’s failure patterns.

Troubleshooting Checklist for Production Incidents

When an incident hits, use this trace-based workflow to minimize time-to-resolution:

Identify the scope: Check the service overview: is the issue isolated to one service or affecting multiple? Are error rates or latency elevated?
Find representative traces: Use the trace explorer to filter for affected traces. Sort by duration for latency issues, filter by error status for failures.
Read the waterfall: Open 3–5 representative traces. Look for: the longest span (bottleneck), error spans (root cause), gaps between spans (pool exhaustion), high span counts (N+1 patterns), and missing expected spans (service unreachable).
Check span attributes: Examine db.statement for bad queries, http.status_code for upstream failures, exception.message for error details, and custom attributes for business context.
Correlate with other signals: Jump to logs for detailed error messages and stack traces. Check infrastructure metrics for resource exhaustion. Look at deployment events for recent changes.
Verify the fix: After applying a fix, compare new traces against the problematic ones. Confirm the bottleneck span duration decreased, error spans disappeared, or the N+1 pattern resolved.

Summary

Distributed tracing transforms microservices troubleshooting from guesswork into systematic diagnosis. The patterns covered in this guide, including latency bottlenecks, N+1 queries, timeout cascades, retry storms, error propagation, connection pool exhaustion, cache failures, and message queue issues, account for the vast majority of production incidents in distributed systems.

The key is developing pattern recognition: learn what healthy traces look like for your critical flows, and the unhealthy patterns will stand out immediately during incidents. OpenTelemetry auto-instrumentation provides the data foundation, and a capable tracing backend like Sematext Tracing gives you the analysis tools to turn that data into fast resolution.

Next steps:

Not yet instrumented? Start with How to Implement Distributed Tracing in Microservices with OpenTelemetry Auto-Instrumentation
Need to optimize your instrumentation? Read OpenTelemetry Instrumentation Best Practices for Microservices Observability
Want to extract higher-level insights? See From Raw Traces to Operational Intelligence (coming soon – contact us)
Ready to try? Start your free Sematext trial, no credit card required

Start Free Trial

Search site

Troubleshooting Microservices with OpenTelemetry Distributed Tracing

Table of contents

Why Traces Are the Best Tool for Microservices Troubleshooting

Anatomy of a Troubleshooting Trace

Diagnosing Latency Bottlenecks with Trace Waterfall Analysis

Pattern: The Slow Database Query

Pattern: Sequential Service Calls (Missed Parallelization)

Pattern: Fan-out Amplification

Detecting N+1 Query Patterns in Traces

Pattern: Classic ORM N+1

Pattern: Service-Level N+1 (Microservice Fan-out)

Diagnosing Timeout Cascades and Retry Storms

Pattern: Timeout Cascade

Pattern: Retry Storm

Troubleshooting Error Propagation Across Service Boundaries

Pattern: Hidden Error Origin

Pattern: Silent Error Swallowing

Spotting Connection Pool Exhaustion

Pattern: Pool Wait Time

Diagnosing Cache Effectiveness Issues

Pattern: Cache Miss Cascade

Pattern: Cache Stampede

Troubleshooting Message Queue Issues

Pattern: Consumer Lag

Pattern: Poison Message / Dead Letter

Using Trace-Based Alerting for Proactive Troubleshooting

Alert on RED Metrics Derived from Traces

Trace-Specific Alerts

Building a Troubleshooting Workflow with Sematext Tracing

Step 1: Start with the Service Overview

Step 2: Drill into the Trace Explorer

Step 3: Analyze the Trace Waterfall

Step 4: Set Up Alerts

Troubleshooting Checklist for Production Incidents

Summary

OpenTelemetry Production Monitoring: What Breaks, and How to Prevent It

OpenTelemetry in Production: Design for Order, High Signal, Low Noise, and Survival

Top 10 Lightstep (ServiceNow) Alternatives in 2026: Complete Migration Guide

Related posts:

Related posts:

OpenTelemetry Production Monitoring: What Breaks, and How to Prevent It

OpenTelemetry in Production: Design for Order, High Signal, Low Noise, and Survival

Top 10 Lightstep (ServiceNow) Alternatives in 2026: Complete Migration Guide