While we were putting the finishing touches on Sematext’s OTel support, I asked one of my friends about their experience with and use of OTel in the context of questions like the ones above. The friend, the company, and the monitoring vendor they used will go unnamed, but here are the experiences and the practices my friend shared.

We’re a mid-sized org with about 30 frontend and backend developers. We know our way around observability, but have not adopted OpenTelemetry until late 2025. When we first rolled out OpenTelemetry in production, it felt like we had finally “done observability right.”

Every service was instrumented. OK, almost every service. 😉
Every request had a trace.
Every component had a metric.
Logs were nicely structured and correlated.

It was not quick and easy to set it all up, but we split the work among several team members and we did it.

However, within about two weeks we started observing – pun intended – problems:

our storage bill doubled
dashboards became slow
our team stopped opening traces
cardinality exploded
and we started sampling randomly just to survive

It became apparent pretty quickly that just adopting OpenTelemetry is not automatically going to give us good monitoring. OpenTelemetry doesn’t give you a signal strategy. Out of the box, with naive usage, it just gives you a firehose and enables you to drown in your own telemetry more quickly.

We kept this new firehose on, but we had to quickly start making decisions around things like:

what belongs in metrics
what belongs in traces
what belongs in logs
and, perhaps most importantly, what should never be emitted at all!

How I Think About the Three Telemetry Signals Now

Early on, we treated metrics, logs, and traces as three different ways to describe the same thing. They are not. That was a mistake. They are different tools with different costs and different failure modes.

Now I think about them like this:

Metrics answer: “Is the system healthy?” (both from tech/engineering perspective and business – we use metrics to understand the business side of things, too)
Traces answer: “Where did the time go?”
Logs answer: “What exactly happened?”

This separation of concerns feels simple and straightforward. As long as the observability tool you’re using has good UX for cross-connecting and correlating these signals, this separation should serve you well.

The Architecture We Ended Up With

This is the shape that finally worked for us:

The key idea is simple:
Applications emit everything. The collector acts as a filter, among other things, and decides what survives.

If you try to enforce strategy in application code, you’ll fail. Teams move too fast, especially now with AI. You need one place where you can say:

keep error traces
drop noisy attributes
batch aggressively
deduplicate
enforce memory limits
…

That place is the collector.

Metrics: What We Actually Trust During Incidents

The first real incident after we adopted OpenTelemetry was a checkout latency spike. Nobody opened a trace first. We all looked at metrics because our alert notifications pointed us there.

Metrics are what we trust when:

we get an alert notification
the CTO asks “are we down?”
a deploy goes wrong

So we designed metrics to answer only three questions:

How many requests?
How many errors?
How slow are they?

Sounds familiar? 👌Yes, RED!

Here’s a snippet from the relevant Python application.

Example (Python)

from opentelemetry import metrics

meter = metrics.get_meter("checkout")

request_counter = meter.create_counter(
    "http.server.requests",
    description="Total HTTP requests"
)

latency_histogram = meter.create_histogram(
    "http.server.duration",
    unit="ms"
)

def handle_request():
    request_counter.add(1, {"route": "/checkout", "status": "200"})
    latency_histogram.record(245, {"route": "/checkout"})

Hard Rule We Learned

It’s actually very simple: If a label (aka tag) can be different for every request, it does not belong in metrics.

These caused real problems for us:

user_id
email
request_id
order_id

You see where this is going? Yeah, cardinality. Cardinality tends to kill storage, makes certain UI elements unusable (think dropdowns with 1000+ values – fun!), etc.

See The First Production Surprise: Cardinality Explosions for more details on cardinality problems in OpenTelemetry.

Traces: How We Debugged Slow Requests

When it comes to traces you might think that they are like logs and you want to have them all so you can really dig in when you need to troubleshoot. However, for us at least, traces became useful only after we stopped trying to store all of them.

At first, we sampled at 100%. Meaning we didn’t sample at all.
Then we realized how much that was going to cost us.
Then we went for the other extreme and sampled at 1%.
But then we missed the interesting traces.

What finally worked was tail-based sampling:
We decide after the trace finishes whether it’s worth keeping.

Earlier, I mentioned a collector acting as a filter that decides what survives. This is a perfect example of that. Here’s the collector config for sampling.

Tail Sampling Config

processors:
  tail_sampling:
    policies:
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]

      - name: slow
        type: latency
        latency:
          threshold_ms: 500

So now what we have does this:

slow requests survive
failed requests survive
boring 200ms health checks die

This changed traces from “expensive noise” into “high-signal debugging data.”

We also learned to be careful with attributes.
Anything that explodes into millions of values makes sampling useless.

For more details on sampling strategies, see our article The First Production Surprise: Cardinality Explosions.

Logs: The Last Mile of Debugging

We still rely on logs like we relied on them before, except with tracing in place oftentimes logs are what we read after traces tell us “this DB call is slow” and we need to know why beyond what we can see through traces themselves.

So the big change – the key – for us was correlating logs with traces.

Here’s how we do it with Python. You’d do something like this in any language. Note how we get the trace_id and span_id from the context and include it in the log event.

Python Logging with Trace Context

from opentelemetry.trace import get_current_span
import logging

logger = logging.getLogger(__name__)

span = get_current_span()
ctx = span.get_span_context()

logger.error(
    "payment failed",
    extra={
        "trace_id": format(ctx.trace_id, "x"),
        "span_id": format(ctx.span_id, "x"),
        "order_id": 1234
    }
)

Once we did this, debugging became a flow instead of a search:

Alert → metric → trace → log.
That’s the loop we optimized for.

The Collector Is Where Strategy Lives

Here’s a simplified version of the collector config we ended up with:

receivers:
  otlp:
    protocols:
      grpc:
      http:

processors:
  memory_limiter:
    limit_mib: 400
  batch:
  tail_sampling:
    policies:
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow
        type: latency
        latency:
          threshold_ms: 500

exporters:
  otlp:
    endpoint: backend:4317

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, batch]
      exporters: [otlp]

    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp]

    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp]

This let us:

tune sampling without redeploying apps
cap memory
drop junk centrally

What Scaling Telemetry Really Means

When people say “scaling OpenTelemetry,” they usually mean handling “more traffic/observability data.”

Based on our experience, though, what we actually hit first was:

cardinality
storage
query performance
human attention

And thus, what scaling really meant for us in this context was:

having fewer but better metrics
having fewer but selectively chosen traces
well structured logs that we can not just search but really slice and dice

What I’d Do Again (and What I Wouldn’t)

Decision	Result
Tail-sample traces	Saved money and sanity
Golden signal metrics only	Stable dashboards
Correlate logs with traces	Faster debugging
Put strategy in collector	Central control
Let teams emit anything	Mistake (at first)

The Gist

OpenTelemetry is neither an observability strategy or solution. It’s a transport, a spec, an implementation in the form of SDKs. It’s just a tool. And one capable of drowning you in your own telemetry.

The strategy is being smart about how you set it up and how you use it. I strongly suggest counting on needing to spend some time on this. It pays off in the long run. Questions to answer:

what questions you want answered
what data you’re willing to pay for
what engineers will actually use

Metrics tell me when things break.
Traces tell me where they break.
Logs tell me why they break.

Everything else …….send to /dev/null?

Start Free Trial

Log in

Search site

OpenTelemetry in Production: Design for Order, High Signal, Low Noise, and Survival

Table of contents

How I Think About the Three Telemetry Signals Now

The Architecture We Ended Up With

Metrics: What We Actually Trust During Incidents

Example (Python)

Hard Rule We Learned

Traces: How We Debugged Slow Requests

Tail Sampling Config

Logs: The Last Mile of Debugging

Python Logging with Trace Context

The Collector Is Where Strategy Lives

What Scaling Telemetry Really Means

What I’d Do Again (and What I Wouldn’t)

The Gist

OpenTelemetry Production Monitoring: What Breaks, and How to Prevent It

Troubleshooting Microservices with OpenTelemetry Distributed Tracing

Top 10 Lightstep (ServiceNow) Alternatives in 2026: Complete Migration Guide

OpenTelemetry Production Monitoring: What Breaks, and How to Prevent It

Troubleshooting Microservices with OpenTelemetry Distributed Tracing

Top 10 Lightstep (ServiceNow) Alternatives in 2026: Complete Migration Guide

Search site

OpenTelemetry in Production: Design for Order, High Signal, Low Noise, and Survival

Table of contents

How I Think About the Three Telemetry Signals Now

The Architecture We Ended Up With

Metrics: What We Actually Trust During Incidents

Example (Python)

Hard Rule We Learned

Traces: How We Debugged Slow Requests

Tail Sampling Config

Logs: The Last Mile of Debugging

Python Logging with Trace Context

The Collector Is Where Strategy Lives

What Scaling Telemetry Really Means

What I’d Do Again (and What I Wouldn’t)

The Gist

OpenTelemetry Production Monitoring: What Breaks, and How to Prevent It

Troubleshooting Microservices with OpenTelemetry Distributed Tracing

Top 10 Lightstep (ServiceNow) Alternatives in 2026: Complete Migration Guide

Related posts:

Related posts:

OpenTelemetry Production Monitoring: What Breaks, and How to Prevent It

Troubleshooting Microservices with OpenTelemetry Distributed Tracing

Top 10 Lightstep (ServiceNow) Alternatives in 2026: Complete Migration Guide