Search site

OpenTelemetry Production Monitoring: What Breaks, and How to Prevent It

Updated on: February 18, 2026

OpenTelemetry almost always works beautifully in staging, demos, and videos. You enable auto-instrumentation, spans appear, metrics flow, the collector starts, and dashboards light up. Everything looks clean and predictable.

However, production has a way of humbling even the most carefully prepared setups. When real traffic hits, and it always spikes sooner or later, you start seeing dropped spans. Collector memory climbs until the process gets killed, and if you are running a single-instance collector, you can forget about collecting any telemetry until you bring it back up. Costs climb faster than anyone budgeted for. A few traces look incomplete. The bossman asks why latency increased by 12% after “just adding observability.”

None of this means OpenTelemetry is broken. It means production behaves differently than demos. This guide walks through what actually breaks when OpenTelemetry meets real-world scale, and what you can do about it before it becomes a 2 AM incident. Catching these issues early is the difference between a boring Tuesday and a war room.

For a practical setup of OpenTelemetry in microservices, see our step-by-step guide on distributed tracing with auto-instrumentation.

The First Production Surprise: Cardinality Explosions

High cardinality is one of the fastest ways to destabilize an otherwise healthy observability setup, and it almost always starts innocently. Someone with the best intension adds a genuinely helpful attribute:

user_id
session_id
request_uuid
a fully expanded URL path

In development, nothing bad happens. In production, that single decision can create millions of unique time series. For example, if a request counter is labeled with user_idand you have two million users, you have just created two million distinct metric series for one metric. Multiply that across services and dimensions, and storage, memory, and the performance of your observability tool degrades quickly.

You will notice it in a few ways: dashboards become noisy or slow, request latency increases, storage costs spike, and collector memory usage grows for no obvious reason.

The fix is not complicated, but it requires discipline. Metrics should use low-cardinality dimensions only, things like environment (prod, staging), service name, endpoint patterns rather than full URLs, and HTTP status classes (2xx, 4xx, 5xx). Anything that is essentially unique per request does not belong on a metric.

With auto-instrumentation, you do not always control attribute creation directly, but you can still suppress high-cardinality attributes via agent configuration, or drop and transform attributes in the collector using processors like filter, attributes, or transform. With manual instrumentation, you have full control and full responsibility. If you truly need high-cardinality identifiers, consider hashing or aggregating them before attaching them.

The key habit is to monitor cardinality continuously, not just after a cost spike. Keep an eye on the collector metrics that look like processor_accepted_metric_points broken down by metric name. These reveal which metrics are growing out of control before they degrade performance or inflate your bill.

For more guidance on instrumentation hygiene and preventing cardinality issues from the start, see our OpenTelemetry instrumentation best practices.

Scaling Pressure in OpenTelemetry Production Pipelines

OpenTelemetry components, SDKs, agents, and collectors, are not magic. They are software services that can be overloaded, and in high-throughput systems they often are.

In busy environments, traces can be generated at hundreds of thousands per second. Metrics multiply across services, containers, and pods. If batching, memory limits, and exporter throughput are not tuned, the pipeline itself becomes the bottleneck. The symptoms are predictable: processor_refused_spans starts increasing, collector memory climbs steadily, export failures appear, and telemetry arrives late or gets dropped entirely.

To understand where these bottlenecks occur, consider the overall OpenTelemetry production pipeline:

If you are using manual SDK instrumentation, you can tune batching and flush intervals directly. Larger batches reduce per-span overhead but increase memory pressure in the application itself, raising the risk of an OOM kill for containerized workloads. Smaller batches reduce memory but increase network calls. There is a balance, and you find it through load testing rather than guesswork.

With auto-instrumentation agents, you do not have direct SDK access, but most agents expose equivalent environment variables for batch size and schedule delay. These matter in production just as much as they do with manual instrumentation. A simple example showing where these settings live can save a lot of trial and error:

OTEL_BSP_MAX_EXPORT_BATCH_SIZE=512

OTEL_BSP_SCHEDULE_DELAY=5000

For detailed information, see Environment Variable Specification.

Regardless of instrumentation type, the collector itself must be treated like any other production service. Monitor its CPU and memory, scale it horizontally when needed, use load balancing with trace ID based routing so spans for the same trace land on the same collector instance, and watch queue lengths in the batch processor. If your collector is not monitored, you do not have observability, you have a single point of failure.

For detailed guidance, see OpenTelemetry Collector architecture and best practices.

Sampling Strategies for OpenTelemetry in Production

At some point, you realize capturing 100% of traces is not sustainable. Sampling becomes necessary. However, sampling is not just a cost decision, it also changes what you can see, so it deserves more thought than simply dialing a number down.

Agent-Level Sampling

Agent-level sampling makes the decision immediately when a request starts, before a single span hits the collector. The benefit is immediate volume reduction: CPU, memory, and network overhead all drop. The trade-off is permanent blindness for discarded traces. If an error happens in a trace that was not sampled, it simply does not exist in your backend. There is no way to recover it after the fact.

Agent-level sampling works well as a baseline control mechanism. Many production systems start at 5 to 10% and adjust based on throughput and debugging needs. It is particularly useful when throughput is extremely high, infrastructure or observability vendor cost is the primary concern, or you need to protect the collector from being overwhelmed. Just keep in mind that it does not guarantee you will retain slow or rare traces that would have been most useful during an incident.

Tail Sampling

Tail sampling moves the decision to the collector, after the entire trace has been observed. This enables smarter decisions: keep slow traces, keep error traces, retain 100% of traffic from business-critical services, and sample normal traffic probabilistically.

This is more powerful, but it comes with real operational weight. The collector has to buffer complete traces in memory while waiting for all spans to arrive, which means memory usage is meaningfully higher than with head-based sampling. It also adds latency to trace delivery, since the collector has to wait for the full trace before deciding whether to keep it. If your typical transaction takes 90 seconds to complete, your collector is buffering 90 seconds of trace data before it can act, which is a lot of memory at scale, and your traces will arrive in your backend 90 or more seconds after the fact. For short-lived transactions this is barely noticeable. For long-running workflows, plan accordingly.

In distributed systems, spans for the same trace can arrive at multiple collector instances. If each collector makes independent sampling decisions, traces become fragmented, leaving gaps that make debugging much harder. Using tail sampling with load-balanced routing, where all spans for a trace are routed to the same collector instance using trace ID hashing, keeps traces intact and reliable. To be precise – this sticky routing is required for well-functioning tail-sampling.

The most effective production strategy usually combines both approaches: use agent-level sampling to cut down overall span volume and prevent the collector from being overwhelmed, then use tail sampling at the collector to make sure high-value traces, slow requests, errors, and critical transactions, are preserved. Sampling is not random volume reduction. It is selecting the traces that help you debug real incidents.

For the official OpenTelemetry guidance, refer to the OpenTelemetry sampling specification.

How to Set Tail Sampling Policies in Practice

Before writing any tail sampling policy, start by asking yourself a few practical questions: what types of incidents happen most often? Are latency regressions more frequent than hard failures? Which services are business-critical or compliance-sensitive? The answers should guide your sampling decisions, not the other way around.

For example, if most of your incidents are latency-related, prioritize keeping slow traces. A common starting point is to retain 100% of traces slower than twice your SLO, while sampling just 5 to 10% of normal traffic. For compliance-sensitive endpoints, always keep those traces intact. For business-critical services, bias your sampling to capture a higher proportion of requests, perhaps 50% from your payment service but only 5% from static content services.

It is also worth maintaining a small baseline sample across all services, around 5 to 10% of overall traffic, even for well-behaved paths. This gives you trend data and lets you detect unknown failure modes you did not anticipate when writing the policies. Without that baseline, you lose visibility into normal system behavior and can miss gradual degradations that do not trigger your explicit rules.

Agent and Collector Stability: The Hidden Risk

Agents and collectors are not passive observers. They are active components in your application infrastructure, and they can fail like any other component.

The collector is the more straightforward case. OpenTelemetry SDKs instrument your application code directly, and the collector runs as a separate process (or set of processes) that receives, processes, and exports telemetry. When a collector crashes, all buffered data is lost, including any traces that were being held in memory for tail sampling decisions. Memory spikes can trigger OOM kills, and if you are running a single collector instance, the entire observability pipeline goes dark until it recovers.

The common causes are predictable: exporters fall behind because the backend is slow or throttling ingest, queues grow, memory fills, and eventually the collector crashes. The practical safeguard against this is the memory limiter processor, which watches the collector’s overall memory consumption and temporarily refuses incoming data when it crosses your configured threshold, giving the collector room to catch up.

processors:

  memory_limiter:

    check_interval: 1s

    limit_mib: 2000

    spike_limit_mib: 400

service:

  pipelines:

    traces:

      receivers: [otlp]

      processors: [memory_limiter, batch]

      exporters: [otlphttp]

This is one of those configurations that feels optional until the day it is not.

Auto-instrumentation adds another layer of complexity. Java agents rewrite bytecode at runtime, async context propagation in .NET or Node.js can behave unexpectedly under load, and in high-throughput systems you may spend measurable CPU time just recording spans. This is why load testing your instrumentation matters as much as load testing your application. Before rolling out to production, measure baseline latency without instrumentation, then measure P50, P95, and P99 latency with it enabled. A 5 to 10% latency increase is often acceptable. Triple-digit millisecond overhead per request is not.

For detailed instructions by language, see the OpenTelemetry auto-instrumentation documentation.

Exporter Bottlenecks: When the Backend Cannot Keep Up

Even if your SDKs and collectors are perfectly tuned, the backend you are exporting to may not be. When the backend is slow, throttling requests, or simply unable to absorb your telemetry volume, batches start piling up in the exporter queues inside the collector. Left unchecked, this cascades into collector instability.

The signals to watch for are otelcol_exporter_send_failed_spans (a counter visible in the collector’s own self-monitoring metrics), growing exporter queue lengths, increased export latency, and rising memory pressure in the collector process.

For self-hosted backends like Elasticsearch, OpenSearch, or Prometheus, ingestion capacity must match telemetry throughput and cardinality. For external vendors, you need to understand their API rate limits, network latency characteristics, and burst handling policies before you are under pressure. An asynchronous exporter with buffering, retry logic, and exponential backoff is essential. Without it, a temporary backend slowdown cascades through the entire pipeline. Your observability stack is only as reliable as its slowest component.

Why This Matters in Real Systems

Many OpenTelemetry tutorials and examples show instrumentation working out of the box, which it does, in a demo environment with predictable traffic and no cost constraints. Real production systems are a different beast entirely: high throughput, distributed microservices, partial network failures, uneven traffic spikes, and budgets that someone is accountable for.

OpenTelemetry is genuinely powerful, but it requires operational discipline. When you adopt it, you are not just instrumenting a few services. You are operating an observability pipeline that itself needs capacity planning, monitoring, load testing, a clear sampling strategy, and ongoing cardinality governance. Treat it as first-class infrastructure and it becomes a strong foundation for understanding your systems. Treat it as a set-and-forget library and it becomes your next incident.

Start Free Trial

Log in

Search site

OpenTelemetry Production Monitoring: What Breaks, and How to Prevent It

Table of contents

The First Production Surprise: Cardinality Explosions

Scaling Pressure in OpenTelemetry Production Pipelines

Sampling Strategies for OpenTelemetry in Production

Agent-Level Sampling

Tail Sampling

How to Set Tail Sampling Policies in Practice

Agent and Collector Stability: The Hidden Risk

Exporter Bottlenecks: When the Backend Cannot Keep Up

Why This Matters in Real Systems

From Debugging to SLOs: How OpenTelemetry Changes the Way Teams Do Observability

Troubleshooting Microservices with OpenTelemetry Distributed Tracing

OpenTelemetry in Production: Design for Order, High Signal, Low Noise, and Survival

From Debugging to SLOs: How OpenTelemetry Changes the Way Teams Do Observability

Troubleshooting Microservices with OpenTelemetry Distributed Tracing

OpenTelemetry in Production: Design for Order, High Signal, Low Noise, and Survival

Search site

OpenTelemetry Production Monitoring: What Breaks, and How to Prevent It

Table of contents

The First Production Surprise: Cardinality Explosions

Scaling Pressure in OpenTelemetry Production Pipelines

Sampling Strategies for OpenTelemetry in Production

Agent-Level Sampling

Tail Sampling

How to Set Tail Sampling Policies in Practice

Agent and Collector Stability: The Hidden Risk

Exporter Bottlenecks: When the Backend Cannot Keep Up

Why This Matters in Real Systems

From Debugging to SLOs: How OpenTelemetry Changes the Way Teams Do Observability

Troubleshooting Microservices with OpenTelemetry Distributed Tracing

OpenTelemetry in Production: Design for Order, High Signal, Low Noise, and Survival

Related posts:

Related posts:

From Debugging to SLOs: How OpenTelemetry Changes the Way Teams Do Observability

Troubleshooting Microservices with OpenTelemetry Distributed Tracing

OpenTelemetry in Production: Design for Order, High Signal, Low Noise, and Survival