New: Distributed Tracing with OpenTelemetry is now available - track requests end-to-end across all your services.  Learn more

Troubleshooting Microservices with OpenTelemetry Distributed Tracing

Updated on: February 15, 2026

Table of contents

Distributed tracing doesn’t just show you what happened. It shows you why things broke. While logs tell you a service returned a 500 error and metrics show latency spiked, only traces reveal the full chain of causation: the upstream timeout that triggered a retry storm, the N+1 query pattern that saturated your connection pool, or the missing cache hit that turned a 50ms call into a 3-second database roundtrip.

This guide covers practical, trace-based troubleshooting patterns for production microservices. You’ll learn how to use OpenTelemetry distributed traces to diagnose the most common, and most frustrating, problems that surface in distributed architectures.

What you’ll learn:

  • How to identify latency bottlenecks using trace waterfall analysis
  • Detecting N+1 query patterns and database performance issues in traces
  • Diagnosing retry storms, timeout cascades, and circuit breaker failures
  • Using error propagation traces to find root causes across service boundaries
  • Spotting connection pool exhaustion, cache misses, and queue backlogs
  • Correlating traces with logs and metrics for full-context debugging

For step-by-step instrumentation setup, see our companion guide: How to Implement Distributed Tracing in Microservices with OpenTelemetry Auto-Instrumentation. For production-hardening your instrumentation, see OpenTelemetry Instrumentation Best Practices for Microservices Observability.

Why Traces Are the Best Tool for Microservices Troubleshooting

Logs, metrics, and traces each serve a different purpose. But when a production incident hits a distributed system, traces are uniquely positioned to answer the hardest questions, especially those that span service boundaries.

Troubleshooting Question Logs Metrics Traces
Which service is slow? ❌ Scattered across services ✅ Latency dashboards ✅ Waterfall shows exact span
Why is it slow? 🟡 If you logged enough context ❌ No causal detail ✅ Child spans reveal cause
Which upstream call caused the error? ❌ Requires correlation IDs ❌ Only shows error rate ✅ Error propagation is visible
Is it a single request or systemic? ❌ Hard to aggregate ✅ Rate/error trends ✅ Trace grouping by pattern
What was the exact sequence of calls? ❌ Requires reconstruction ❌ No ordering info ✅ Waterfall shows call graph

The key insight is that traces give you causation, not just correlation. When service A calls service B, which calls service C, and C fails, a trace shows you the entire chain, the timing of each call, and exactly where things went wrong.

Anatomy of a Troubleshooting Trace

Before diving into specific patterns, let’s establish what you’re looking at in a trace waterfall. Understanding the structure makes pattern recognition faster during incidents.

A distributed trace consists of spans organized in a parent-child hierarchy, as defined by the OpenTelemetry Trace specification. Each span represents a single operation: an HTTP request, a database query, a cache lookup, a message publish. The root span represents the entry point, and child spans represent downstream operations.

[Root Span: GET /api/orders/12345] ─────────── 1,247ms

├── [auth-service: POST /validate] ── 23ms
├── [order-service: GET /orders/12345] ────── 1,180ms
│     ├── [PostgreSQL: SELECT * FROM orders] ── 12ms
│     ├── [inventory-service: GET /stock] ─── 890ms  ← BOTTLENECK
│     │     ├── [Redis: GET inventory:12345] ── 2ms (miss)
│     │     └── [PostgreSQL: SELECT ...] ── 875ms  ← ROOT CAUSE
│     └── [pricing-service: GET /calculate] ── 45ms
└── [notification-service: POST /email] ── 18ms

In this trace, the total request took 1,247ms. The trace waterfall immediately shows that inventory-service consumed 890ms, and within it, a database query took 875ms following a cache miss. Without the trace, you’d see a slow /api/orders endpoint in your metrics and have to investigate each service individually.

Key span attributes to examine during troubleshooting (see the full OpenTelemetry Semantic Conventions for reference):

Attribute What It Tells You
http.status_code HTTP response status for service calls
db.statement The actual SQL query executed
db.system Which database (PostgreSQL, MySQL, Redis)
http.method + http.url Which endpoint was called
otel.status_code = ERROR Span completed with an error
exception.message Error details if an exception occurred
net.peer.name Which host the call went to
messaging.system Message broker involved (Kafka, RabbitMQ)
Span duration How long the operation took

Diagnosing Latency Bottlenecks with Trace Waterfall Analysis

Latency issues are the most common reason teams reach for traces. The waterfall view transforms a vague “the API is slow” complaint into a precise diagnosis.

Pattern: The Slow Database Query

Symptoms in metrics: Elevated p95/p99 latency on a specific endpoint. Database CPU or connection usage may appear normal.

What the trace reveals:

[order-service: GET /orders] ────────── 2,340ms

├── [PostgreSQL: SELECT o.*, oi.* FROM orders o
│    JOIN order_items oi ON o.id = oi.order_id
│    WHERE o.customer_id = $1
│    ORDER BY o.created_at DESC] ──── 2,280ms  ← Problem
└── [Redis: SET order-cache:customer:789] ── 3ms

The trace shows a single database query consuming 97% of the request time. The db.statement attribute reveals the actual SQL, which is a full table scan joining orders with order items, likely missing an index on customer_id.

What to look for in spans:

  • db.statement: Check for missing WHERE clauses, full table scans, large JOINs, or unoptimized queries. Use EXPLAIN to confirm.
  • Span duration vs. typical duration: Compare against baseline traces for the same operation
  • Sequential vs. parallel queries: Are queries running sequentially when they could be parallelized?

Pattern: Sequential Service Calls (Missed Parallelization)

Symptoms in metrics: High latency that seems disproportionate to what any single service reports.

What the trace reveals:

[api-gateway: GET /dashboard] ──────────── 1,850ms

├── [user-service: GET /profile] ── 320ms
├── [order-service: GET /recent] ─── 480ms    (starts after user-svc)
├── [notification-svc: GET /unread] ── 410ms  (starts after order-svc)
└── [recommendation-svc: GET /for-you] ── 590ms (starts after notif.)

The waterfall reveals that four independent service calls are executing sequentially. Total time is the sum of all calls (1,800ms) instead of the max (590ms), a 3x penalty. The trace makes this immediately visible because spans don’t overlap.

The fix: Refactor to concurrent calls. With parallelization, the trace collapses to ~620ms as all four spans overlap.

Pattern: Fan-out Amplification

Symptoms in metrics: Latency increases with load, but individual service latencies look normal.

The trace reveals a product catalog page making 50 individual HTTP calls to the inventory service, one per product. Each call is fast (45–60ms), but the accumulated overhead of 50 sequential HTTP roundtrips adds up to over 3 seconds.

The fix: Replace individual calls with a batch API (GET /stock?skus=A001,A002,…,A050) or use a GraphQL-style query that returns all needed data in a single request.

Detecting N+1 Query Patterns in Traces

N+1 queries are one of the most common performance killers in microservices, and traces make them trivially easy to spot. The pattern appears as one initial query followed by N repetitive queries, and in the trace waterfall, it’s unmistakable.

Pattern: Classic ORM N+1

What the trace reveals:

[order-service: GET /orders] ─────────── 1,890ms

├── [PostgreSQL: SELECT * FROM orders WHERE status = 'active'
│    LIMIT 50] ── 15ms                              (1 query)
├── [PostgreSQL: SELECT * FROM customers WHERE id = 101] ── 8ms
├── [PostgreSQL: SELECT * FROM customers WHERE id = 102] ── 9ms
├── [PostgreSQL: SELECT * FROM customers WHERE id = 103] ── 7ms
│   ... (47 more identical-pattern queries)
└── [PostgreSQL: SELECT * FROM customers WHERE id = 150] ── 11ms

The trace shows 1 query to fetch orders + 50 individual queries to fetch each order’s customer. ORM lazy loading is the usual culprit. Each query is fast individually, but 51 database roundtrips add up to nearly 2 seconds.

How to spot N+1 patterns in your tracing tool:

  • High span count on a single trace: A trace with 50+ database spans for a simple endpoint is almost always an N+1
  • Repetitive db.statement patterns: Same query template with different parameter values
  • Low individual span duration but high total trace duration: Each query is fast, but there are too many

The fix: Replace lazy loading with eager loading (JOIN or IN clause):

— Instead of 51 queries, use 1:

SELECT o.*, c.* FROM orders o

JOIN customers c ON o.customer_id = c.id

WHERE o.status = 'active' LIMIT 50

Pattern: Service-Level N+1 (Microservice Fan-out)

The N+1 pattern isn’t limited to databases. It manifests across service boundaries too:

[checkout-service: POST /checkout] ───────── 4,100ms
├── [cart-service: GET /cart/items] ── 35ms
│    Response: [{productId: "P1"}, ..., {productId: "P20"}]
├── [product-service: GET /products/P1] ── 120ms
├── [product-service: GET /products/P2] ── 135ms
│   ... (18 more calls)
└── [product-service: GET /products/P20] ── 128ms

The checkout service fetches cart items, then calls the product service individually for each item. The fix: implement a batch endpoint (POST /products/batch accepting a list of IDs) or use request collapsing.

Diagnosing Timeout Cascades and Retry Storms

Timeout cascades are among the most dangerous failure modes in microservices. Patterns like the circuit breaker exist specifically to contain them. A single slow dependency can cause cascading failures across your entire system, and traces are the fastest way to understand the chain reaction.

Pattern: Timeout Cascade

Symptoms in metrics: Multiple services show elevated error rates simultaneously. Latency spikes propagate across services.

What the trace reveals:

[api-gateway: POST /orders] ──────────── 30,012ms (TIMEOUT)
└── [order-service: POST /create] ─────── 30,005ms (TIMEOUT)
├── [inventory-svc: POST /reserve] ──── 30,001ms (TIMEOUT)
│     └── [PostgreSQL: UPDATE inventory ...] ── 30,000ms
│           otel.status_code: ERROR
│           exception.message: "Lock wait timeout exceeded"
└── [payment-service: POST /charge] (NOT REACHED)

The trace reveals the cascade: a database lock timeout in inventory causes inventory to time out, which causes order-service to time out, which causes the gateway to time out. Without the trace, you’d see three services all timing out and might investigate the wrong one first.

Key diagnostic signals in timeout traces:

  • Span duration equals the configured timeout value exactly (e.g., 30,000ms), which confirms a timeout rather than slow processing
  • otel.status_code: ERROR with timeout-related exception messages
  • Child spans that were never started (like payment-service above), which confirms the timeout interrupted the flow
  • Multiple parent spans with identical durations, meaning each parent waited for the full timeout of its child

Pattern: Retry Storm

Symptoms in metrics: Sudden traffic spike to a downstream service. Error rates increase rather than decrease.

What the trace reveals:

[order-service: POST /create] ─────────── 12,450ms
├── [inventory-svc: POST /reserve] ── 5,001ms TIMEOUT
├── [inventory-svc: POST /reserve] ── 5,002ms TIMEOUT (retry 1)
├── [inventory-svc: POST /reserve] ── 2,410ms TIMEOUT (retry 2)
│     exception.message: "Connection pool exhausted"
└── Result: ERROR "Failed after 3 retries"

The trace shows the order service retrying the inventory call three times. With 100 concurrent requests all doing the same, the inventory service receives 300 requests instead of 100, a 3x amplification. The connection pool exhaustion on retry 2 confirms the retry storm is making things worse.

Multi-layer retry amplification: When multiple layers retry, the multiplication compounds:

Gateway (3 retries) → Order Service (3 retries) → Inventory

= 3 × 3 = 9 requests to inventory per user request

Troubleshooting Error Propagation Across Service Boundaries

When an error surfaces at the API boundary, the root cause often lies several services deep. Traces let you follow the error propagation chain backwards from symptom to cause.

Pattern: Hidden Error Origin

Symptoms: Users see “Internal Server Error” on the checkout page. Logs show 500 errors cascading through services.

What the trace reveals in a single view:

[api-gateway: POST /checkout] ─ 500 Internal Server Error
└── [checkout-service: POST /process] ─ 500
├── [cart-service: GET /cart] ─ 200 OK (45ms)
└── [payment-service: POST /charge] ─ 500
└── [fraud-service: POST /evaluate] ─ 500
└── [ML model: POST /predict] ─ 503

exception.message: "Model server OOM:
cannot allocate 2GB for inference batch"

The trace cuts through four levels of error wrapping and reveals the actual root cause: the ML model server ran out of memory. Without the trace, the on-call engineer would start by investigating the checkout service, then the payment service, before eventually reaching the fraud detection service, potentially losing 30+ minutes following the chain manually.

Pattern: Silent Error Swallowing

Sometimes errors don’t propagate. Instead, they get silently caught, and the system returns degraded results instead of errors:

[product-service: GET /product/123] ─ 200 OK (890ms)
├── [PostgreSQL: SELECT ...] ── 12ms ─ 200 OK
├── [review-service: GET /reviews] ── 5,001ms ─ TIMEOUT
│     otel.status_code: ERROR
├── [recommendation-svc: GET /similar] ── 5,002ms ─ TIMEOUT
│     otel.status_code: ERROR
└── [Redis: SET product-cache:123] ── 3ms

The product page returns 200 OK, but the trace reveals two child services timed out. Metrics show 200 OK and ~900ms latency. Only the trace reveals the degraded user experience.

To catch this pattern: Filter traces by spans with otel.status_code: ERROR even when the root span shows success.

Spotting Connection Pool Exhaustion

Connection pool exhaustion is subtle. It doesn’t always produce errors, but it silently adds latency to every request as threads wait for available connections.

Pattern: Pool Wait Time

What the trace reveals:

[order-service: GET /orders] ───────── 2,340ms
├── [PostgreSQL: SELECT ...] ── 15ms
├── [gap: 1,800ms]  ← No spans, just waiting
└── [PostgreSQL: SELECT ...] ── 12ms

The telltale sign is gaps between spans, periods where the service is doing nothing visible. The 1,800ms gap between the first and second database query indicates the thread was waiting for a connection from the pool.

Diagnostic approach: Look for consistent gaps in trace waterfalls that don’t correspond to any span. When you see this pattern across multiple traces for the same service, check connection pool metrics (active connections, wait queue depth, pool size). The trace points you to the exact service experiencing pool pressure, and metrics confirm the diagnosis.

Diagnosing Cache Effectiveness Issues

Caches are supposed to reduce latency, but misconfigured caches can make things worse. Traces reveal cache behavior that’s invisible in aggregate metrics.

Pattern: Cache Miss Cascade

[product-service: GET /product/456] ─────── 1,250ms
├── [Redis: GET product:456] ── 1ms (MISS)
├── [PostgreSQL: SELECT * FROM products ...] ── 85ms
├── [Redis: GET product:456:reviews] ── 1ms (MISS)
├── [review-service: GET /reviews] ── 890ms
│     ├── [PostgreSQL: SELECT ...reviews...] ── 45ms
│     └── [PostgreSQL: SELECT ...users...] ── 830ms  ← Slow join
├── [Redis: SET product:456] ── 2ms
└── [Redis: SET product:456:reviews] ── 1ms

The trace shows: both cache lookups missed, forcing expensive database queries and service calls. The review service’s slow user join (830ms) is the real latency contributor, normally hidden behind a cache hit.

To monitor cache effectiveness with traces: Add custom span attributes for cache hit/miss status. Then in your tracing tool, filter and group by this attribute to see miss rates per operation, not just aggregate miss rates.

# Python example: Adding cache status to spans

from opentelemetry import trace

tracer = trace.get_tracer("cache-instrumentation")

def get_from_cache(key):

with tracer.start_as_current_span("cache.lookup") as span:

span.set_attribute("cache.key", key)

result = redis_client.get(key)

span.set_attribute("cache.hit", result is not None)

return result

Pattern: Cache Stampede

When a popular cache key expires, many concurrent requests simultaneously miss the cache and hit the database, a problem known as cache stampede. Looking at multiple traces for the same endpoint around the same timestamp reveals the stampede: each trace shows a cache miss, and database query durations increase progressively as the database becomes overloaded. All traces set the same cache key, resulting in redundant writes.

Troubleshooting Message Queue Issues

Asynchronous messaging adds complexity to troubleshooting because the producer and consumer execute at different times. OpenTelemetry’s context propagation via W3C Trace Context headers connects these spans into a single trace.

Pattern: Consumer Lag

[order-service: POST /orders] ─ (publishes to Kafka)

├── [Kafka: produce to orders-topic] ── 5ms
│     messaging.kafka.partition: 3
│     messaging.kafka.offset: 1847293
│
│  ~~~ 45,000ms gap (consumer lag) ~~~
│
└── [fulfillment-svc: consume from orders-topic] ── 120ms
└── [PostgreSQL: INSERT INTO fulfillment_queue] ── 8ms 

The trace links the producer span (order-service) to the consumer span (fulfillment-service) through propagated context. The 45-second gap between produce and consume timestamps reveals consumer lag. The consumer itself processes quickly (120ms), so the problem is in Kafka consumer group throughput, not processing logic.

Pattern: Poison Message / Dead Letter

[order-service: produce to orders-topic] ── 3ms

→ [fulfillment-svc: consume attempt 1] ── 15ms ── ERROR
│    exception.message: "Invalid product SKU format: null"
→ [fulfillment-svc: consume attempt 2] ── 12ms ── ERROR
→ [fulfillment-svc: consume attempt 3] ── 14ms ── ERROR
→ [dead-letter-queue: produce to orders-dlq] ── 4ms 

The trace shows a message being consumed, failing, retried twice, and finally sent to the dead letter queue. The exception message reveals the root cause: a null product SKU, likely a producer-side validation issue.

Using Trace-Based Alerting for Proactive Troubleshooting

Reactive troubleshooting (waiting for users to complain) isn’t good enough. Modern tracing tools support alerting on trace-derived signals that catch issues before they impact users.

Alert on RED Metrics Derived from Traces

Alert Condition What It Catches
Error rate spike Error rate > 5% for 5 minutes Failed deployments, dependency outages
Latency degradation p95 latency > 2x baseline for 10 min Slow queries, missing indexes, cache failures
Throughput drop Request rate < 50% of expected for 5 min Upstream routing issues, DNS failures
Error rate by operation Any operation error rate > 10% Targeted failures in specific endpoints

Trace-Specific Alerts

Beyond RED metrics, some conditions are only visible through trace analysis:

  • Span count anomaly: Alert when average spans-per-trace exceeds a threshold, catching N+1 regressions after deployments
  • New error types: Alert when exception.type values appear that haven’t been seen in the last 7 days
  • Missing service in trace: Alert when an expected service stops appearing in traces for a critical flow

Building a Troubleshooting Workflow with Sematext Tracing

Sematext Tracing provides the trace analysis capabilities needed to apply all the patterns described above. Here’s how to build an effective troubleshooting workflow.

Step 1: Start with the Service Overview

The Tracing Overview dashboard provides RED metrics (Rate, Error, Duration) across all instrumented services. This is your starting point: identify which service has elevated error rates or latency, and in which time window the problem started.

Step 2: Drill into the Trace Explorer

Use the Trace Explorer to filter traces by the affected service, time window, and error status. Sort by duration to find the slowest traces, or filter by otel.status_code: ERROR to find failures.

Key filters for troubleshooting:

  • By service name: Isolate traces involving a specific service
  • By minimum duration: Find traces exceeding your latency SLO
  • By status: Filter for error traces only
  • By operation: Focus on a specific endpoint or database operation
  • By custom attributes: Filter by customer ID, order ID, or other business context

Step 3: Analyze the Trace Waterfall

Open the Trace Details view for a representative trace. The waterfall visualization shows the complete request flow with timing for each span. Look for the patterns described in this guide: long spans, gaps between spans, high span counts, and error spans.

Step 4: Set Up Alerts

Configure alerts on the RED metrics derived from your traces. Start with error rate and p95 latency alerts for your most critical services and endpoints, then expand to more specific alerts as you learn your system’s failure patterns.

Troubleshooting Checklist for Production Incidents

When an incident hits, use this trace-based workflow to minimize time-to-resolution:

  1. Identify the scope: Check the service overview: is the issue isolated to one service or affecting multiple? Are error rates or latency elevated?
  2. Find representative traces: Use the trace explorer to filter for affected traces. Sort by duration for latency issues, filter by error status for failures.
  3. Read the waterfall: Open 3–5 representative traces. Look for: the longest span (bottleneck), error spans (root cause), gaps between spans (pool exhaustion), high span counts (N+1 patterns), and missing expected spans (service unreachable).
  4. Check span attributes: Examine db.statement for bad queries, http.status_code for upstream failures, exception.message for error details, and custom attributes for business context.
  5. Correlate with other signals: Jump to logs for detailed error messages and stack traces. Check infrastructure metrics for resource exhaustion. Look at deployment events for recent changes.
  6. Verify the fix: After applying a fix, compare new traces against the problematic ones. Confirm the bottleneck span duration decreased, error spans disappeared, or the N+1 pattern resolved.

Summary

Distributed tracing transforms microservices troubleshooting from guesswork into systematic diagnosis. The patterns covered in this guide, including latency bottlenecks, N+1 queries, timeout cascades, retry storms, error propagation, connection pool exhaustion, cache failures, and message queue issues, account for the vast majority of production incidents in distributed systems.

The key is developing pattern recognition: learn what healthy traces look like for your critical flows, and the unhealthy patterns will stand out immediately during incidents. OpenTelemetry auto-instrumentation provides the data foundation, and a capable tracing backend like Sematext Tracing gives you the analysis tools to turn that data into fast resolution.

Next steps:

Start Free Trial

How to Implement Distributed Tracing in Microservices with OpenTelemetry Auto-Instrumentation

This guide shows you how to implement OpenTelemetry's auto-instrumentation for...

OpenTelemetry Instrumentation Best Practices for Microservices Observability

OpenTelemetry instrumentation is the foundation of modern microservices observability, but...

Top 12 Distributed Tracing Tools in 2026: Complete Comparison & Reviews

Distributed tracing has become essential for modern software teams. As...