New: Distributed Tracing with OpenTelemetry is now available - track requests end-to-end across all your services.  Learn more

From Debugging to SLOs: How OpenTelemetry Changes the Way Teams Do Observability

Updated on: February 23, 2026

Table of contents

At some point in every team’s life, someone gets paged at 2 AM because a service is ‘slow.’ Nobody knows which service. Nobody knows why. Someone opens five different dashboards, pastes a trace ID into a Slack thread, and thirty minutes later you have twelve engineers in a call arguing about whether the problem is in the database or the API gateway. By the time you find the actual culprit, half the team has memorized each other’s sleep schedules.

This is what life looks like when observability is an afterthought: logs in one place, metrics in another, and a custom monitoring agent that only works for two services because the third one was written in a language nobody on the team uses anymore. It works, technically. Until it does not.

OpenTelemetry came out of a genuine frustration with this fragmented mess. It is an open-source observability framework that gives you a vendor-neutral, standardized way to instrument your applications and then connect that instrumentation to service health, error budgets, and eventually SLOs that your entire organization actually understands. This article walks through what that shift looks like in practice, and why it matters for more than just the people who are on call.

The Old World: Logs, APM Agents, and the Dashboard Graveyard

Let’s be direct about how most teams actually do observability before they invest in it properly. You have application logs going into a log management platform, with varying levels of structure depending on who wrote which service. You have an APM tool that auto-instruments some of your services but not all of them, and the traces it produces are siloed within its own ecosystem. And you have a monitoring dashboard that someone built eighteen months ago and that might or might not reflect how the service actually behaves today.

The real cost is not the outage. It is the investigation. A 2023 industry study on downtime costs found that engineering teams spend an average of 200-plus hours per year just on incident investigation, separate from the time actually fixing things. A good chunk of that is tool-switching and context-switching because telemetry data lives in silos.

The deeper problem is not the tools themselves; it is that each one has its own instrumentation model. Your APM agent captures HTTP spans one way. Your custom metrics library reports latency percentiles slightly differently. Your logs do not correlate to your traces automatically. So when something breaks, you are stitching together three different narratives instead of reading one coherent story about what happened.

This is actually the origin of Sematext – back in 2012 Sematext was the first platform to offer both performance monitoring (so metrics) and log monitoring in one observability platform, and then distributed transaction tracing in 2015.

What OpenTelemetry Actually Is (Without the Fluff)

OpenTelemetry standardizes how you generate, collect, and export telemetry data. It covers three signal types (with more to come), which are the foundation of everything else in this article:

THE THREE PILLARS OF OPENTELEMETRY

🔗
Traces

End-to-end request paths across services. Shows exactly where time is spent and where errors propagate.

📊
Metrics

Numeric measurements over time: latency histograms, request counts, error rates, resource utilization. The raw material for SLOs.

📋
Logs

Structured event records with trace context attached. No more copy-pasting trace IDs; logs link directly to the span that generated them, and an error span links back to every log event emitted during that span.

Traces, Metrics, and Logs share the same context propagation model in OTel, which lets you jump from a log line to its trace in seconds.

What makes OTel different from what came before is not magic; it is the fact that all three signals share the same context propagation model. A trace ID that starts in your frontend propagates through every instrumented microservice call, and if your logs are also emitting that trace ID, you can jump from a log line to its trace in seconds. Not minutes. Seconds. If you are the person doing production troubleshooting you know how valuable this difference is!

SLOs: What They Are and Why OTel Makes Them Achievable

Service Level Objectives have been a thing since Google wrote about them in the Site Reliability Engineering book, and they have been misunderstood and poorly implemented since roughly the same time. The core idea is simple: you agree on a target for how reliable a service needs to be, you measure it consistently, and you manage your engineering work in relation to how much reliability budget you have consumed or have left.

The reason SLOs often fail is not the concept; it is that teams try to define them before they have reliable telemetry. You cannot set a meaningful availability target for a service if your metrics come from three different monitoring agents that measure availability in subtly different ways. You end up with SLOs that nobody trusts, which means nobody uses them to make decisions.

Example SLOs Built on OTel Metrics

Service SLI Target Error Budget Status
Checkout API % requests < 500 ms, non-5xx 99.5% 3 h 36 m remaining HEALTHY
Auth Service % successful token validations 99.9% 0 h 22 m remaining AT RISK
Search API % queries returning results < 1 s 98.0% Budget exhausted BREACHED
Order Worker % jobs processed without retry 99.0% 5 h 12 m remaining HEALTHY

When SLIs are computed from OTel semantic conventions, every service uses the same measurement logic regardless of language or framework.

When your SLIs are computed from OTel metrics, specifically from the semantic conventions that define how HTTP span duration and status should be recorded, you get consistency across services by default. The latency histogram for your Go service and the one for your .NET service use the same bucket boundaries. The error classification follows the same logic. Suddenly your SLOs are comparing apples to apples, and that changes what you can do with them.

The Correlation Story: How one Trace ID Connects Everything

One of the things that sounds academic until you experience it is trace context propagation. When a request comes into your frontend and you are using OTel instrumentation, a trace ID gets generated and passed along to every downstream service call via HTTP headers, gRPC metadata, message queue attributes, or whatever transport you are using. Every span in that trace carries the same trace ID, and your logs carry it too if you have set up log correlation.

What this means in practice: when your error rate alert fires because the checkout service just breached its error budget, you do not start by guessing. You go to the traces for that time window, filter for error spans, and you are already looking at the full call path: frontend, checkout API, inventory service, payment gateway, with timing for each hop. If the inventory service was slow, you will see a long span there. If the payment gateway returned a 503, you will see that in the span status. No grep-ing through logs trying to find a request ID that someone may or may not have remembered to log. For a step-by-step breakdown of what these patterns look like in real incidents, troubleshooting microservices with distributed tracing is a good companion read.

Before vs After: What Investigation Actually Looks Like

Before OTel

Alert fires. Open APM tool, find service.

Open logging tool, search by timestamp.

Paste trace ID into search; hope the log format includes it.

Cross-reference three tools. Escalate because nobody can reproduce it.

MTTR: 45 to 90 min for medium-severity incidents.

After OTel

Alert fires with a link to the error budget burn rate.

Click through to traces for that time window.

Follow the trace to the failing span.

Logs automatically surfaced by trace ID.

MTTR: 5 to 20 min for the same incidents.

The difference in MTTR is not about effort. It is about whether correlated telemetry exists at all.

Auto-Instrumentation: Getting Value Without Rewriting Everything

One of the biggest objections to investing in observability is the instrumentation cost. If you have thirty microservices and each one needs to be manually instrumented before you see any benefit, that is a project with a very long feedback loop. This is actually what we saw with our initial distributed tracing implementation at Sematext back in 2015 – adoption of a challenge due to how much work engineers would have to invest in instrumenting their applications. OTel’s auto-instrumentation libraries change that equation significantly.

For Java, the OTel Java agent attaches to your JVM at startup and automatically instruments common frameworks such as Spring Boot, gRPC, JDBC, and Kafka without any code changes. For Python, opentelemetry-instrument does the same for Flask, Django, FastAPI, and SQLAlchemy. The .NET ecosystem has similar coverage through the automatic instrumentation package. You get spans for every incoming HTTP request, every outgoing call, and every database query without touching the application code. If you want to skip the boilerplate and start from something that already works, these language-specific OTel examples cover the setup end to end.

What to Actually Watch Out For

None of this comes without tradeoffs, and articles that only cover the benefits are setting you up for some unpleasant surprises. A few things will bite you if you do not plan for them.

A deep dive into OpenTelemetry instrumentation best practices covers all of these in detail, but here is the short version.

Cardinality explodes if you are not careful

OTel metrics support rich attribute sets, which is great for debugging but problematic for storage if you start adding high-cardinality attributes like user IDs or request IDs to your metrics. The OTel metrics spec includes cardinality limits, and you should understand them before you start attaching attributes to everything.

Sampling is necessary at scale and confusing to get right

Sending 100 percent of traces when you are handling thousands of requests per second is expensive. Head-based sampling, where you decide at the start of a trace whether to keep it, is simple but means you might drop the interesting traces. Tail-based sampling, where you decide after seeing the whole trace, keeps the errors but requires the OTel Collector to buffer spans, which adds complexity. There is no right answer, only tradeoffs that depend on your volume and budget.

Auto-instrumentation vs manual instrumentation: the honest tradeoff

Auto instrumentation gets you running in an afternoon with zero code changes and gives consistent coverage across your entire fleet from day one. The trade off is that it understands frameworks, not business intent. It can tell you a database query took 800 ms but not that it was pricing a cart for a high value customer.

Manual instrumentation fills the gap that actually matters for SLOs. Checkout completion time, order processing latency by fulfillment partner, or time to first search result. It takes more effort, but it is what turns a latency alert into a business conversation.

In practice, auto instrumentation provides the foundational 80 percent. Requests, error rates, and durations (aka RED) from day one. You then layer manual instrumentation on top for the business critical signals your SLOs should be measuring.

The Collector configuration gets complex fast

Once you start running multiple pipelines, applying transforms, doing tail-based sampling, and exporting to multiple backends, your collector config becomes something that needs to be tested and versioned like application code. Treat it that way from the start.

Starting Without Starting Over

The most common mistake teams make when adopting OTel is treating it as a big-bang migration. You do not need to instrument every service before any of it becomes useful. Pick one service, ideally something that sits in the middle of your call graph so you can see upstream and downstream spans, and get it fully instrumented with OTel, exporting to a collector and from there to whatever backend you already have. Define one or two SLIs for it. Watch them for a week and see if they match your intuition about how the service is performing.

That first service will teach you things that no amount of reading can. You will find out how your framework handles context propagation. You will discover that your log format does not include trace IDs and need to fix that. You will learn what your normal latency histogram looks like and be surprised by the long tail. Do that before you roll out to thirty services, and the rollout will go much faster. 

To get started see the Sematext step-by-step setup guide for OpenTelemetry tracing. Once you have that in place, the article on building a troubleshooting workflow with Sematext tracing shows how to use those first traces to investigate issues and iterate on your instrumentation.

Start Free Trial

OpenTelemetry in Production: Design for Order, High Signal, Low Noise, and Survival

A lot of talk around OpenTelemetry has to do with...

Top 10 Lightstep (ServiceNow) Alternatives in 2026: Complete Migration Guide

ServiceNow just pulled the plug on Lightstep. Here's where to...

How to Implement Distributed Tracing in Microservices with OpenTelemetry Auto-Instrumentation

This guide shows you how to implement OpenTelemetry's auto-instrumentation for...