New: Distributed Tracing with OpenTelemetry is now available - track requests end-to-end across all your services.  Learn more

Running OpenTelemetry at Scale: Architecture Patterns for 100s of Services

Updated on: March 3, 2026

Table of contents

It feels great getting OpenTelemetry working in a demo environment. Spans appear, metrics flow, you connect it to a backend and everything lights up in a satisfying cascade. You write the internal doc, you present it to the team, but it’s just a matter of time when somebody on the team asks: “Great, so how do we roll this out to all 100 services?” If you are at that point on your OTel journey, this article will help you roll out OTel to production.

Running OTel across a handful of services and running it across a few hundred are genuinely different problems. The instrumentation part stays roughly the same. Everything around it — how you collect the data, how you route it, how you make sure a traffic spike in one region does not take down your entire observability pipeline — that is where teams either build something resilient or spend the next six months fire-fighting because of inadequate planning or suboptimal architecture.

I wrote this article to share the patterns that actually hold up at scale: collector tiers, load balancing strategies, sampling at volume, and multi-cluster setups. Everything comes with real config examples because “it depends” is only useful advice if you can see what it depends on.

See How OpenTelemetry changes the way teams do observability for why OpenTelemetry matters and how it shifts focus from traditional metrics and logs to full, end-to-end observability.

Why a Single Collector Falls Apart (and When)

Most OTel tutorials show you a single collector instance receiving spans from all your services and forwarding everything to a backend. That setup works until about the point where it stops working, which tends to happen quietly and at the worst possible time. You are not going to notice a single collector struggling until it is already dropping data, buffering is maxed out, and your traces have gaps you cannot explain.

The core issue is that a single collector is both a single point of failure and a resource bottleneck. At low traffic it sits there looking fine. Add a few dozen services, let traffic spike during a product launch or a retry storm, and you will watch it start falling behind. The exporter queue fills up. Backpressure kicks in. Services start dropping spans rather than blocking on the export. By the time anyone notices, you have lost the exact telemetry you needed to understand what just happened.

The failure mode is silent. When a collector falls behind, it does not usually crash spectacularly. It drops spans without loud errors, your traces become incomplete, and your dashboards show suspiciously clean latency numbers because the slow requests stopped being recorded. If your p99 looks unexpectedly healthy during an incident, check your collector queue depth before trusting it.

The solution is to stop thinking about the collector as a single process and start thinking about it as a tier. Two tiers cover most production scenarios. Three tiers cover the rest. The architecture you need depends on your traffic, whether you need tail-based sampling, and how many backends you are exporting to.

Let me make this more specific: if you have fewer than 20 services and under 500 requests per second total, a single well-configured collector will likely hold up (yes, of course it depends on the underlying hardware/resources). At 20 to 80 services or 500 to 5,000 RPS, the two-tier model becomes worthwhile. Above 80 services or 5,000 RPS, you need the full tiered setup with trace-aware load balancing and tail-based sampling at the gateway. 

For more information on common production pitfalls and strategies to prevent them, see OpenTelemetry Production Monitoring: What Breaks and How to Prevent It.

Collector Tiers: The Architecture That Actually Scales

The tiered collector model separates two concerns that should never have been combined in the first place: getting data off your services quickly, and doing something intelligent with that data before it hits your backend.

Before getting into the architecture, it helps to know that the OTel Collector can run in three modes — and in a scaled setup, you will use all three:

  • Agent — a collector running on the same host as your services, collecting telemetry locally and forwarding it upstream. It stays thin: no heavy processing, just receive-and-forward.
  • Gateway — a collector running as a standalone service, receiving data from agents (or directly from SDKs) and doing the heavier work: sampling, routing, fan-out to backends, attribute redaction.

Combined — the full pattern, where agent collectors feed into gateway collectors. Agents handle what only makes sense per-host (host metrics, file logs, resource detection). Gateways handle what only makes sense centrally (tail-based sampling, cross-service routing, policy management). The OTel Collector deployment docs call this the combined deployment pattern.

The tiered setup this article describes is the combined pattern. Here is what it looks like:

TWO-TIER COLLECTOR ARCHITECTURE

SERVICES
Service
App A
Service
App B
Service
App C
Service
App N

TIER 1 — AGENT / SIDECAR COLLECTORS
Agent
Collector
Agent
Collector
Agent
Collector
Agent
Collector

TIER 2 — GATEWAY COLLECTORS
Gateway
Collector (HA)
Gateway
Collector (HA)

Backend
Traces
Backend
Metrics
Backend
Logs

Tier 1 agents sit close to services and do minimal work. Tier 2 gateways handle sampling, routing, and backend fan-out.

Tier 1: Collectors running as agents

The agent tier runs as a sidecar. Its job is exactly one thing: receive telemetry from the services and forward it as fast as possible. No tail-based sampling, no complex routing logic, no fan-out to multiple backends. The only processing you want at this tier is cheap and stateless: adding resource attributes like cluster name, node name, and environment; batching spans to reduce connection overhead; and basic filtering to drop genuinely worthless spans like health check endpoints generating thousands of spans per minute and telling you nothing.

Only stamp resource attributes that are low-cardinality and apply to the whole node or pod — things like environment, cluster name, and region. Adding high-cardinality values like user IDs or request IDs as resource attributes will explode your metrics storage, because each unique value becomes a separate time series.
TIER 1 AGENT COLLECTOR CONFIG
# Tier 1: runs as DaemonSet, minimal processing
receivers:
  otlp:
    protocols:
      grpc: {endpoint: "0.0.0.0:4317"}
      http: {endpoint: "0.0.0.0:4318"}

processors:
  batch:                    # batch before forwarding
    send_batch_size: 1024
    timeout: 5s
  resourcedetection:        # stamp node/pod metadata
    detectors: [k8snode, env]
  filter/drop_healthchecks:
    spans:
      exclude:
        match_type: regexp
        attributes:
          - {key: http.route, value: ".*/health.*"}

exporters:
  otlp:
    # forward to gateway tier, not directly to backend
    endpoint: "otel-gateway:4317"
    sending_queue:
      enabled: true
      num_consumers: 4
      queue_size: 500

service:
  pipelines:
    traces:
      receivers:  [otlp]
      processors: [batch, resourcedetection, filter/drop_healthchecks]
      exporters:  [otlp]
Agent config stays thin. Anything heavier than batching and attribute stamping belongs in the gateway tier.

Tier 2: Collectors running as gateways

The gateway tier is where the interesting work happens: tail-based sampling, fan-out to multiple backends, and the routing logic that sends traces, metrics, and logs where they need to go. Once you introduce a gateway tier, it needs careful resource sizing. In practice, that means running at least two gateway collectors behind a load balancer to avoid single points of failure.

How you deploy them depends on your environment. In Kubernetes, that typically means a Deployment scaled by load rather than node count. In a VM-based setup, two or more collector processes behind a hardware or software load balancer works just as well. The important thing is that the gateway tier scales horizontally based on traffic, not based on how many hosts you have.

Two to four instances is a reasonable starting point for a deployment handling roughly 1,000 to 5,000 spans per second across 20 to 50 services. Beyond that, sizing should be driven primarily by your tail-based sampling configuration — specifically the decision_wait window and the num_traces value — which determine how much trace state each gateway must hold in memory.

Load Balancing: The Subtle Trap with Tail-Based Sampling

If you are using tail-based sampling and running multiple gateway collector instances, standard round-robin load balancing will silently break your sampling decisions. Tail-based sampling works by collecting all spans for a given trace and then making a single keep-or-drop decision once the trace is complete. With round-robin, spans for the same trace end up scattered across different collector instances. Each instance only sees a fragment, so no instance ever has enough context to make a valid decision.

The symptom is traces that look complete but are not. You will see traces that hit your sampling rate but are missing spans from certain services, because those spans went to a different collector instance that independently decided to drop its fragment. This is one of the harder things to debug because the data loss is structured rather than random.

The solution is trace-aware load balancing, where spans are routed to gateway instances based on their trace ID. The OTel Collector has a loadbalancing exporter built for exactly this. It consistently hashes trace IDs to the same downstream collector, which means all spans for a given trace always end up in the same place regardless of which agent they came from.

LOAD BALANCING EXPORTER CONFIG — AGENT TIER
exporters:
  loadbalancing:
    routing_key: "traceID"   # hash by trace ID, not round-robin
    resolver:
      k8s:                    # auto-discover gateway pods via DNS
        service: "otel-gateway"
        ports: [4317]
    protocol:
      otlp:
        timeout: 1s
        sending_queue:
          enabled: true
          queue_size: 1000
The k8s resolver watches the gateway headless service and automatically updates routing when pods scale up or down.

Gateway restarts or scale-in events can occasionally produce incomplete traces.  See OTel Collector scaling documentation for details.

Sampling Strategies at Volume: Picking the Right One

At small scale, sampling feels like an optional optimization. At large scale, it is a financial and operational necessity. Sending 100 percent of traces from a service handling 10,000 requests per second generates a staggering volume of data, most of which you will never look at. This is not too different from logs – for example, Sematext’s log pipeline contains the Sampling Processor for the same reason. Getting sampling right means you keep the traces that help you debug real incidents and drop the ones that would just sit there consuming storage.

The tricky part is that “keep the useful traces” is not as simple as it sounds. The traces you most need to keep are the ones with errors and high latency, which are often a small fraction of total traffic. If you use pure random sampling at 1 percent, you will statistically drop 99 percent of your error traces along with everything else. That is the core tension that drives the choice between head-based and tail-based sampling.

SAMPLING STRATEGY COMPARISON
STRATEGY WHERE KEEPS ERRORS MEMORY COST BEST FOR WHAT IT DOES
Always-on SDK YES HIGH Dev / staging only Keep all spans, no sampling
Parent-based SDK INHERITS LOW Consistent decisions across services Keep/drop based on parent trace
Probabilistic SDK/Collector NO LOW Volume reduction on healthy traffic Randomly keep spans at a fixed rate
Rate-limiting Collector NO LOW Capping ingest cost during spikes Keep spans until a fixed rate limit
Tail-based Collector (GW) YES HIGH Error-aware sampling at scale Keep spans based on errors & latency
Most production deployments combine parent-based sampling at the SDK with tail-based sampling at the gateway tier.

The combination that works at scale

Parent-based sampling means the sampling decision is made once at the root span — the first service that receives the request — and every downstream service in that trace inherits the same decision automatically, so you never end up with a trace where some spans were kept and others were dropped by different services making independent choices.

Use parent-based sampling at the SDK level to reduce overall span volume before it even reaches the collector, then use tail-based sampling at the gateway tier to make intelligent keep-or-drop decisions on what makes it through. Two passes of selection — aggressive on volume, smart about what survives.

A concrete example: set parent-based sampling at 10 percent for general traffic at the SDK. At the gateway, keep 100 percent of error traces, 100 percent of traces exceeding your latency SLO, and 10 percent of everything else. You end up storing roughly 11 to 12 percent of total trace volume, but with near-complete coverage of the production incidents you actually need to investigate.

TAIL SAMPLING POLICY CONFIG — GATEWAY TIER
processors:
  tail_sampling:
    decision_wait: 10s      # wait for all spans before deciding
    num_traces: 100000      # traces held in memory simultaneously
    expected_new_traces_per_sec: 1000
    policies:
      # always keep error traces
      - name: keep-errors
        type: status_code
        status_code: {status_codes: [ERROR]}

      # always keep slow traces (adjust threshold to your SLO)
      - name: keep-slow
        type: latency
        latency: {threshold_ms: 500}

      # keep 100% of checkout and payment — business critical
      - name: keep-critical-services
        type: string_attribute
        string_attribute:
          key: service.name
          values: [checkout-api, payment-service]

      # probabilistic baseline for everything else
      - name: baseline-sample
        type: probabilistic
        probabilistic: {sampling_percentage: 10}
Policies are evaluated in order. A trace is kept if any policy matches. The probabilistic baseline catches everything the specific policies did not select.

Memory sizing for tail-based sampling

The num_traces parameter is the one that will bite you if you undershoot it. It controls how many traces the gateway holds in memory simultaneously while waiting for all their spans to arrive. A rough formula: multiply your expected traces per second by your decision_wait value, then add 20 percent headroom. For 1,000 traces per second with a 10 second wait, you need at least 12,000 slots — not the 1,000 that most tutorial configs show.

The tail sampling processor documentation has the full parameter reference including the memory limiter integration, which you absolutely want enabled at the gateway tier to prevent OOM kills during traffic spikes.

Multi-Cluster Setups: When One Pipeline Is Not Enough

At some point, a single OTel pipeline stops being the right model. Maybe you operate in multiple regions with data residency requirements. Maybe you have a mix of Kubernetes clusters running different workloads with different SLOs. Whatever the reason, multi-cluster OTel setups introduce a layer of complexity that single-cluster thinking does not prepare you for.

The fundamental question is where aggregation happens. Aggregate within each cluster and forward summarized telemetry to a global backend, and you keep cross-region bandwidth low but lose the ability to do cross-cluster trace correlation. Forward raw telemetry to a central aggregation layer, and you get full correlation capability at significantly higher egress cost. Most organizations end up with a hybrid: metrics and logs aggregate locally, traces are forwarded to a central tier for correlation.

Getting trace context across cluster boundaries

Cross-cluster trace correlation only works if your services propagate the W3C traceparent header across cluster boundaries. Internal service mesh traffic usually handles this correctly. However, cross-cluster calls that pass through an API gateway, CDN, or any reverse proxy that strips unknown headers will break trace continuity at that boundary.

Diagnosing this is straightforward: if you see a trace starting at an API gateway span and the first downstream service shows a different root span with no parent, there’s a propagation break. To fix it, add traceparent and tracestate to your proxy’s header allowlist.

Here is what that looks like in the two most common cases:

PROXY HEADER CONFIG — NGINX AND ENVOY
# nginx — add inside your proxy_pass block
proxy_set_header traceparent $http_traceparent;
proxy_set_header tracestate  $http_tracestate;

---

# Envoy — request_headers_to_add in HttpConnectionManager
route_config:
  request_headers_to_add:
    - header: { key: traceparent }
    - header: { key: tracestate }
One of these two covers the vast majority of cases. If you are behind a CDN, check their documentation for custom header passthrough settings.

Data residency and the GDPR headache

If you operate in the EU, forwarding raw traces containing user identifiers to a central tier outside the EU can be a compliance problem. The practical solution is to run attribute redaction in your regional gateway before any data leaves the region. The OTel Collector’s transform processor lets you hash, mask, or drop specific attributes before export.

PII REDACTION CONFIG — EU GATEWAY PROCESSOR
processors:
  transform/redact_pii:
    trace_statements:
      - context: span
        statements:
          # hash user IDs rather than drop
          - set(attributes["user.id"], SHA256(attributes["user.id"]))
          # drop email entirely
          - delete_key(attributes, "user.email")
          # truncate IP to /24 for geo without individual tracking
          - replace_pattern(attributes["net.peer.ip"], "\\d+$", "0")
Run PII redaction at the regional gateway, not the central tier. By the time data reaches central, sensitive attributes should already be gone.

Keeping the Pipeline Itself Observable

It would be funny if the  observability tools couldn’t be observed. The OTel Collector exposes its own internal telemetry as a standard OTLP pipeline, which means you can route it to any backend or an observability solution that you are already using.

otelcol_processor_batch_timeout_trigger_send (gotta love this long property name!) tells you whether the batch processor is flushing because the timeout fired rather than because the batch was full. A high ratio of timeout-triggered flushes means your traffic volume is lower than your batch config expects, and you are adding unnecessary latency.

otelcol_exporter_queue_size is the canary for backpressure. When otelcol_exporter_queue_size climbs toward your configured maximum, your exporter is falling behind the ingest rate. If it hits the maximum, the collector starts dropping data. Set an alert at 80 percent of queue capacity and you will catch pressure building before it becomes data loss.

otelcol_processor_tail_sampling_sampling_decision_timer_latency (another awesome long name!) tells you how long the tail sampling processor is taking to make decisions. A sudden increase here usually means the number of active traces in memory has grown past what the processor can efficiently scan — either increase resources or tighten your sampling policy.

COLLECTOR SELF-MONITORING CONFIG
receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: otel-collector
          scrape_interval: 15s
          static_configs:
            - targets: ["localhost:8888"]

# Expose collector's own telemetry via its service config
service:
  telemetry:
    metrics:
      level: detailed   # basic | normal | detailed
      address: 0.0.0.0:8888
    logs:
      level: warn       # keep collector logs quiet in production

Set telemetry level to ‘detailed’ in staging to understand baseline behavior, then dial back to ‘normal’ in production.

Rolling This Out Without Breaking Everything

The migration path from a single collector to a tiered setup does not have to be a big-bang cutover. You could introduce the gateway tier first while keeping the existing single collector in place, route a small percentage of services to the new tier, and validate that data is flowing correctly before moving everything over.

I suggest you start with a non-critical service — one that has decent traffic but where gaps in telemetry during the migration window would not cause anyone to lose sleep. Verify spans arrive at the gateway, verify they arrive at the backend with the right resource attributes, and check that your tail sampling policies are making sensible decisions. That validation loop is worth running for a week before you touch any of your critical services.

The config change on the service side is usually just updating the OTLP endpoint to the new agent address. If you are using the OTel Operator for Kubernetes, you can inject the agent endpoint as an environment variable through the Instrumentation custom resource — no application code changes, no redeployment of service configs when the collector topology changes.

The pattern across all of this — tiered collectors, trace-aware load balancing, layered sampling strategies, regional pipelines — is that scaling OTel is fundamentally an architecture problem, not an instrumentation problem. The instrumentation is the relatively easy part. The hard part is building a pipeline that stays operational under load, degrades gracefully when individual components have problems, and gives you enough visibility into itself that you can tell when something is wrong before it starts affecting the data your engineers depend on during incidents.

Once your OpenTelemetry pipeline is running at scale, the next step is learning how to interpret the traces to identify performance bottlenecks and root causes. See Troubleshooting Microservices with OpenTelemetry Distributed Tracing for an in-depth and very practical guidance on that subject.

Start Free Trial

Troubleshooting Microservices with OpenTelemetry Distributed Tracing

Distributed tracing doesn’t just show you what happened. It shows...

OpenTelemetry in Production: Design for Order, High Signal, Low Noise, and Survival

A lot of talk around OpenTelemetry has to do with...

Top 10 Lightstep (ServiceNow) Alternatives in 2026: Complete Migration Guide

ServiceNow just pulled the plug on Lightstep. Here's where to...