New: Distributed Tracing with OpenTelemetry is now available - track requests end-to-end across all your services.  Learn more

OpenTelemetry in Production: Design for Order, High Signal, Low Noise, and Survival

Updated on: February 11, 2026

Table of contents

A lot of talk around OpenTelemetry has to do with instrumentation, especially auto-instrumentation, about OTel being vendor neutral, being open and a defacto standard. But how you use the final output of OTel is what makes business difference.

In other words, how do you use it to make your life as an SRE/DevOps/biz person easier?

How do you have to set things up to truly solve production issues faster?

And does doing that require you to spend more money on observability or can you be smart about how you set things up so that OTel doesn’t break the bank?

While we were putting the finishing touches on Sematext’s OTel support, I asked one of my friends about their experience with and use of OTel in the context of questions like the ones above. The friend, the company, and the monitoring vendor they used will go unnamed, but here are the experiences and the practices my friend shared.

We’re a mid-sized org with about 30 frontend and backend developers. We know our way around observability, but have not adopted OpenTelemetry until late 2025. When we first rolled out OpenTelemetry in production, it felt like we had finally “done observability right.”

Every service was instrumented. OK, almost every service. 😉
Every request had a trace.
Every component had a metric.
Logs were nicely structured and correlated.

It was not quick and easy to set it all up, but we split the work among several team members and we did it.

However, within about two weeks we started observing – pun intended – problems:

  • our storage bill doubled
  • dashboards became slow
  • our team stopped opening traces
  • cardinality exploded
  • and we started sampling randomly just to survive

It became apparent pretty quickly that just adopting OpenTelemetry is not automatically going to give us good monitoring. OpenTelemetry doesn’t give you a signal strategy. Out of the box, with naive usage, it just gives you a firehose and enables you to drown in your own telemetry more quickly.

We kept this new firehose on, but we had to quickly start making decisions around things like:

  • what belongs in metrics
  • what belongs in traces
  • what belongs in logs
  • and, perhaps most importantly, what should never be emitted at all!

How I Think About the Three Telemetry Signals Now

Early on, we treated metrics, logs, and traces as three different ways to describe the same thing. They are not. That was a mistake. They are different tools with different costs and different failure modes.

Now I think about them like this:

  • Metrics answer: “Is the system healthy?” (both from tech/engineering perspective and business – we use metrics to understand the business side of things, too)
  • Traces answer: “Where did the time go?”
  • Logs answer: “What exactly happened?”

This separation of concerns feels simple and straightforward. As long as the observability tool you’re using has good UX for cross-connecting and correlating these signals, this separation should serve you well.

The Architecture We Ended Up With

This is the shape that finally worked for us:

 

The key idea is simple:
Applications emit everything. The collector acts as a filter, among other things, and decides what survives.

If you try to enforce strategy in application code, you’ll fail. Teams move too fast, especially now with AI. You need one place where you can say:

  • keep error traces
  • drop noisy attributes
  • batch aggressively
  • deduplicate
  • enforce memory limits

That place is the collector.

Metrics: What We Actually Trust During Incidents

The first real incident after we adopted OpenTelemetry was a checkout latency spike. Nobody opened a trace first. We all looked at metrics because our alert notifications pointed us there.

Metrics are what we trust when:

  • we get an alert notification
  • the CTO asks “are we down?”
  • a deploy goes wrong

So we designed metrics to answer only three questions:

  • How many requests?
  • How many errors?
  • How slow are they?

Sounds familiar? 👌Yes, RED!

Here’s a snippet from the relevant Python application.

Example (Python)

from opentelemetry import metrics

meter = metrics.get_meter("checkout")

request_counter = meter.create_counter(
    "http.server.requests",
    description="Total HTTP requests"
)

latency_histogram = meter.create_histogram(
    "http.server.duration",
    unit="ms"
)

def handle_request():
    request_counter.add(1, {"route": "/checkout", "status": "200"})
    latency_histogram.record(245, {"route": "/checkout"})

Hard Rule We Learned

It’s actually very simple: If a label (aka tag) can be different for every request, it does not belong in metrics.

These caused real problems for us:

  • user_id
  • email
  • request_id
  • order_id

You see where this is going? Yeah, cardinality. Cardinality tends to kill storage, makes certain UI elements unusable (think dropdowns with 1000+ values – fun!), etc.

Traces: How We Debugged Slow Requests

When it comes to traces you might think that they are like logs and you want to have them all so you can really dig in when you need to troubleshoot. However, for us at least, traces became useful only after we stopped trying to store all of them.

At first, we sampled at 100%. Meaning we didn’t sample at all.
Then we realized how much that was going to cost us.
Then we went for the other extreme and sampled at 1%.
But then we missed the interesting traces.

What finally worked was tail-based sampling:
We decide after the trace finishes whether it’s worth keeping.

Earlier, I mentioned a collector acting as a filter that decides what survives. This is a perfect example of that. Here’s the collector config for sampling.

Tail Sampling Config

processors:
  tail_sampling:
    policies:
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]

      - name: slow
        type: latency
        latency:
          threshold_ms: 500

 

So now what we have does this:

  • slow requests survive
  • failed requests survive
  • boring 200ms health checks die

This changed traces from “expensive noise” into “high-signal debugging data.”

We also learned to be careful with attributes.
Anything that explodes into millions of values makes sampling useless.

Logs: The Last Mile of Debugging

We still rely on logs like we relied on them before, except with tracing in place oftentimes logs are what we read after traces tell us “this DB call is slow” and we need to know why beyond what we can see through traces themselves.

So the big change – the key – for us was correlating logs with traces.

Here’s how we do it with Python. You’d do something like this in any language. Note how we get the trace_id and span_id from the context and include it in the log event.

Python Logging with Trace Context

from opentelemetry.trace import get_current_span
import logging

logger = logging.getLogger(__name__)

span = get_current_span()
ctx = span.get_span_context()

logger.error(
    "payment failed",
    extra={
        "trace_id": format(ctx.trace_id, "x"),
        "span_id": format(ctx.span_id, "x"),
        "order_id": 1234
    }
)

 

Once we did this, debugging became a flow instead of a search:

 

Alert → metric → trace → log.
That’s the loop we optimized for.

The Collector Is Where Strategy Lives

Here’s a simplified version of the collector config we ended up with:

receivers:
  otlp:
    protocols:
      grpc:
      http:

processors:
  memory_limiter:
    limit_mib: 400
  batch:
  tail_sampling:
    policies:
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow
        type: latency
        latency:
          threshold_ms: 500

exporters:
  otlp:
    endpoint: backend:4317

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, batch]
      exporters: [otlp]

    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp]

    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp]

This let us:

  • tune sampling without redeploying apps
  • cap memory
  • drop junk centrally

What Scaling Telemetry Really Means

When people say “scaling OpenTelemetry,” they usually mean handling “more traffic/observability data.”

Based on our experience, though, what we actually hit first was:

  • cardinality
  • storage
  • query performance
  • human attention

And thus, what scaling really meant for us in this context was:

  • having fewer but better metrics
  • having fewer but selectively chosen traces
  • well structured logs that we can not just search but really slice and dice

What I’d Do Again (and What I Wouldn’t)

Decision Result
Tail-sample traces Saved money and sanity
Golden signal metrics only Stable dashboards
Correlate logs with traces Faster debugging
Put strategy in collector Central control
Let teams emit anything Mistake (at first)

 

The Gist

OpenTelemetry is neither an observability strategy or solution. It’s a transport, a spec, an implementation in the form of SDKs. It’s just a tool. And one capable of drowning you in your own telemetry.

The strategy is being smart about how you set it up and how you use it. I strongly suggest counting on needing to spend some time on this. It pays off in the long run. Questions to answer:

  • what questions you want answered
  • what data you’re willing to pay for
  • what engineers will actually use

Metrics tell me when things break.
Traces tell me where they break.
Logs tell me why they break.

Everything else …….send to /dev/null?

Start Free Trial

OpenTelemetry Instrumentation Best Practices for Microservices Observability

OpenTelemetry instrumentation is the foundation of modern microservices observability, but...

Top 12 Distributed Tracing Tools in 2026: Complete Comparison & Reviews

Distributed tracing has become essential for modern software teams. As...

How to Monitor Server Uptime Without Missing Critical Failures

Server uptime monitoring is critical for ensuring the reliability and...