What was traditionally known as just Monitoring has clearly been going through a renaissance over the last few years. The industry as a whole is finally moving away from having Monitoring and Logging silos – something we’ve been doing and “preaching” for years – and the term Observability emerged as the new moniker for everything that encompasses any form of infrastructure and application monitoring. Microservices have been around for a over a decade under one name or another. Now often deployed in separate containers it became obvious we need a way to trace transactions through various microservice layers, from the client all the way down to queues, storage, calls to external services, etc. This created a new interest in Transaction Tracing that, although not new, has now re-emerged as the third pillar of observability.
Traditionally, APM vendors had their own proprietary tracing agents and SDKs that would instrument applications, either automatically (blackbox instrumentation) or by having their users modify or annotate their apps’ source code (whitebox instrumentation). Long story short, this has issues such as vendor lock-in for users, and high costs associated with addition and maintenance of support for an ever-increasing number of technologies and their versions that need to be instrumented for vendors. Enter OpenTracing, vendor-neutral APIs and instrumentation for distributed tracing.
We’ll start this blog post series by introducing OpenTracing, explaining what it is and does, how it works, and why its adoption is growing. In subsequent posts we will first cover Zipkin followed by Jaeger, both being popular distributed tracers, and finally, compare Jaeger vs. Zipkin. If you prefer to read all four posts as a PDF you can also download it as a free OpenTracing eBook. Alternatively, follow @sematext if you are into observability in general.
OpenTracing Basics explained
In a distributed system, a trace encapsulates the transaction’s state as it propagates through the system. During the journey of the transaction, it can create one or multiple spans. A span represents a single unit of work inside a transaction, for example, an RPC client/server call, sending query to the database server, or publishing a message to the message bus. Speaking in terms of OpenTracing data model, the trace can also be seen as a collection of spans structured around the directed acyclic graph (DAG). The edges indicate the casual relationships (references) between spans. The span is identified by its unique ID, and optionally may include the parent identifier. If the parent identifier is omitted, we call that span as root span. The span also comprises human-readable operation name, start and end timestamps. All spans are grouped under the same trace identifier.
Spans may contain tags that represent contextual metadata relevant to a specific request. They consist of an unbounded sequence of key-value pairs, where keys are strings and values can be strings, numbers, booleans or date data types. Tags allow for context enrichment that may be useful for monitoring or debugging system behavior. While not mandatory, it’s highly recommended to follow the OpenTracing semantics guidelines when naming tags. Such as that, we should assign component tag to the framework, module or library which generates span/spans, use peer.hostname and peer.port to describe target hosts, etc. Another reason for tagging standardization is making the tracer aware of the existence of certain tags that would add intelligence or instruct the tracer to put special emphasis on them.
Besides tags, OpenTracing has a notion of log events. They represent timestamped textual (although not limited to textual content) annotations that may be recorded along the duration of a span. Events could express any occurrence of interest to the active span, like timer expiration, cache miss events, build or deployment starting events, etc.
Baggage items allow for cross-span propagation, i.e., they let associate metadata that also propagates to future children of the root span. In other words, the local data is transported along the full path as request if traveling downstream through the system. However, this powerful feature should be used carefully because it can easily saturate network links if the propagated items are about to be injected into many descendant spans.
As at the time of writing, OpenTracing supports two types of relationships:
ChildOf – to express casual references between two spans. Following with our RPC scenario, the server side span would be the ChildOf the initiator (request) span.
FollowsFrom – when parent span isn’t linked to the outcome of the child span. This relationship is usually used to model asynchronous executions like emitting messages to the message bus.
What OpenTracing aims to offer
OpenTracing aims to offer a consistent, unified and tracer-agnostic instrumentation API for a wide range of frameworks, platforms and programming languages. It abstracts away the differences among numerous tracer implementations, so shifting from an existing one to a new tracer system would only require configuration changes specific to that new tracer. For what it’s worth, we should mention the benefits of distributed tracing:
- out of the box infrastructure overview: how the interactions between services are done and their dependencies
- efficient and fast detection of latency issues
- intelligent error reporting: Span transport error messages and stack traces. We can take advantage of that insight to identify root cause factors or cascading failures.
- trace data can be forwarded to log processing platforms for query and analysis
How OpenTracing works
OpenTracing API is modeled around two fundamental types:
- Tracer – knows how to create a new span as well as inject/extract span contexts across process boundaries. All OpenTracing compatible tracers must provide a client with the implementation of the Tracer interface.
- Span – tracer’s build method yields a brand new created span. We can invoke a number of operations after the span has been started, like aggregating tags, changing span’s operation name, binding references to other spans, adding baggage items, etc.
SpanContext – the consumers of the API only interact with this type when injecting/extracting the span context from the transport protocol.
Want to get useful how-to instructions, copy-paste code for tracer registration? We’ve prepared an OpenTracing eBook which puts all key OpenTracing information at your fingertips: from introducing OpenTracing, explaining what it is and does, how it works, to covering Zipkin followed by Jaeger, both being popular distributed tracers, and finally, compare Jaeger vs. Zipkin. Download yours.
Distributed Context Propagation
One of the most compelling and powerful features attributed to tracing systems is distributed context propagation. Context propagation composes the causal chain and dissects the transaction from inception to finalization – it illuminates the request’s path until its final destination.
From a technical point of view, context propagation is the ability for the system or application to extract the propagated span context from a variety of carriers like HTTP headers, AMQP message headers or Thrift fields, and then join the trace from that point. Context propagation is very efficient since it only involves propagating identifiers and baggage items. All other metadata like tags, logs, etc. isn’t propagated but transmitted asynchronously to the tracer system. It’s the responsibility of the tracer to assemble and construct the full trace from distinct spans that might be injected in-band / out-of-band.
OpenTracing standardizes context propagation across process boundaries by Inject/Extract pattern.
Why OpenTracing adoption is growing
As organizations are embracing the cloud-native movement and thus migrating their applications from monolithic to microservice architectures, the need for general visibility and observability into software behavior becomes an essential requirement. Because the monolithic code base is segregated into multiple independent services running inside their own processes, which in addition can scale to various instances, such a trivial task as diagnosing the latency of an HTTP request issued from the client can end up being a serious deal. To fulfill the request, it has to propagate through load balancers, routers, gateways, cross machine’s boundaries to communicate with other microservices, send asynchronous messages to message brokers, etc.
Along with this pipeline, there could be a possible bottleneck, contention or communication issue in any of the aforementioned components. Debugging through such a complex workflow wouldn’t be feasible if not relying on some kind of tracing/instrumentation mechanism. That’s why distributed tracers like Zipkin, Jaeger or AppDash were born (most of them are inspired on Google’s Dapper large-scale distributed tracing platform). All of the after-mentioned tracers help engineers and operation teams to understand and reason about system behavior as the complexity of the infrastructure grows exponentially. Tracers expose the source of truth for the interactions originated within the system. Every transaction (if properly instrumented) might reflect performance anomalies in an early phase when new services are being introduced by (probably) independent teams with polyglot software stacks and continuous deployments.
Cloud-native paradigm is creating a new reality and mindset of how software is built and deployed. Instead of static VM-centric infrastructures, the containers are first-class citizens in the world of programmable, automated, immutable infrastructure as code. Deployments are continuous and huge monolithic code bases are split into multiple independent microservices. Gaining deep visibility into our software stack is of critical importance. OpenTracing is paving the way to make developers and DevOps engineers’ lives easier by helping us narrow down root cause problems or expose opportunities for optimization. Despite being still relatively young, OpenTracing is being widely adopted by big vendors as well as small organizations and teams. We are looking forward to seeing OpenTracing become a universal tracing standard where each software component (be it a web framework, application server or the message broker) ships with built-in support for OpenTracing instrumentation to assemble and propagate traces out of the box. In our next post we cover Zipkin as OpenTracing-compatible Distributed Tracer.