Cost Optimization
Tracing Cost Optimization¶
Distributed tracing can generate significant data volumes, especially in high-traffic applications. This guide provides strategies to optimize your tracing costs while maintaining essential observability.
Note: Tracing costs are primarily driven by the volume of trace data ingested and stored. Any data you filter out during collection or reduce through sampling directly impacts your costs. Sampling must be configured in your applications - the Sematext Agent forwards all traces it receives.
Sampling Strategies¶
Sampling is the most effective way to reduce tracing costs while maintaining statistical significance. For detailed configuration instructions, see the Sampling Configuration Guide.
Head-Based Sampling (Recommended Start)¶
Sample traces at the application level before they're sent:
Production Environments:
- High-traffic services: Keep 1-5% of traces (0.01-0.05 sampling rate)
- Medium-traffic services: Keep 5-10% of traces (0.05-0.1 sampling rate)
- Low-traffic services: Keep 10-50% of traces (0.1-0.5 sampling rate)
Development/Staging: - Use 100% sampling for complete visibility during testing
Service-Specific Sampling¶
Apply different sampling rates based on service importance:
Critical Services (Higher Sampling):
- Payment processing: 50-100%
- Authentication: 20-50%
- Core APIs: 10-20%
Support Services (Lower Sampling):
- Health checks: 100% (always sample - low volume, critical error detection)
- Static content: 0.1-1%
- Internal utilities: 1-5%
Attribute and Data Optimization¶
Remove Unnecessary Attributes¶
Filter out attributes that don't provide actionable insights:
High-Volume, Low-Value Attributes:
- Large request/response bodies
- Detailed stack traces for successful operations
- Verbose debug information
- Redundant metadata
OpenTelemetry Attribute Filtering:
// Example: Filter out large attributes
const span = tracer.startSpan('operation');
// Don't add large request bodies
// span.setAttributes({ 'http.request.body': largeBody }); // Avoid this
// Instead, add size or summary information
span.setAttributes({
'http.request.size': largeBody.length,
'http.request.type': 'json'
});
Optimize Span Names and Operations¶
Use consistent, concise span names:
Good Examples:
- GET /api/users/{id}
- database.query
- payment.process
Avoid:
- GET /api/users/12345
(high cardinality)
- Very long descriptive operation name with lots of details
- Dynamic span names with timestamps or IDs
Infrastructure and Agent Optimization¶
Agent Configuration¶
For agent-specific optimizations, see the Agent OpenTelemetry Configuration guide.
Service and Environment Strategies¶
Environment Separation¶
Production:
- Lower sampling rates (1-10%)
- Focus on error and performance traces
- Longer retention for critical paths
Staging:
- Medium sampling rates (10-50%)
- Comprehensive error tracking
- Shorter retention periods
Development:
- Higher sampling rates or 100% sampling
- Full attribute collection for debugging
- Shortest retention periods
Service Prioritization¶
Tier 1 (Business Critical):
- Customer-facing APIs
- Payment processing
- Authentication services
- Sampling: Keep 10-50% of traces
Tier 2 (Important):
- Internal APIs
- Data processing
- Integration services
- Sampling: Keep 5-20% of traces
Tier 3 (Supporting):
- Health checks
- Metrics collection
- Background tasks
- Sampling: Keep 1-5% of traces (except health checks - keep 100%)
Retention and Plan Optimization¶
Data Retention Strategy¶
Short-Term (1-7 days):
- High-volume, low-priority traces
- Development environment traces
- Automated testing traces
Medium-Term (7-30 days):
- Production error traces
- Performance baseline traces
- Critical business flows
Long-Term (30+ days):
- Compliance-required traces
- Security audit trails
- Business intelligence traces
Plan Selection¶
Choose the right Sematext plan based on your needs:
Basic Plan:
- Small applications
- Low to medium traffic
- Basic retention needs
Standard Plan:
- Growing applications
- Medium to high traffic
- Enhanced analytics needs
Pro Plan:
- Large-scale applications
- Enterprise requirements
- Advanced analytics and longer retention
See detailed features and pricing at sematext.com/pricing
Monitoring and Alerting Cost Impact¶
Cost-Aware Alerting¶
Set up alerts that balance coverage with cost:
High-Value Alerts (Always Monitor):
- Error rate spikes
- Critical service failures
- SLA violations
Medium-Value Alerts (Sampled Monitoring):
- Performance degradation
- Unusual traffic patterns
- Service dependencies
Cost Monitoring¶
Track Key Metrics:
- Daily trace volume by service
- Sampling effectiveness
- Storage utilization
- Plan usage trends
Cost Optimization Alerts:
- Alert when trace volume increases unexpectedly
- Monitor sampling rate effectiveness
- Track storage growth trends
Implementation Checklist¶
Phase 1: Assessment¶
- [ ] Analyze current trace volume by service
- [ ] Identify high-volume, low-value traces
- [ ] Evaluate service criticality tiers
- [ ] Review current sampling configuration
Phase 2: Quick Wins¶
- [ ] Implement basic sampling (5-10% for high-traffic services)
- [ ] Remove unnecessary attributes from spans
- [ ] Consider filtering static content traces (keep health checks for error detection)
- [ ] Optimize span naming conventions
Phase 3: Advanced Optimization¶
- [ ] Set up service-specific sampling rates
- [ ] Configure attribute filtering rules
- [ ] Optimize retention policies
Phase 4: Monitoring and Tuning¶
- [ ] Set up cost monitoring dashboards
- [ ] Implement cost-aware alerting
- [ ] Regular review and adjustment of sampling rates
- [ ] Monitor for sampling bias in critical paths
Best Practices¶
Sampling Best Practices¶
- Start conservative: Begin with lower sampling rates and increase as needed
- Preserve errors: Always sample error traces at higher rates
- Monitor bias: Ensure sampling doesn't hide important patterns
- Service correlation: Consider trace propagation when setting service-specific rates
Attribute Management¶
- Business value: Only collect attributes that provide actionable insights
- Cardinality control: Avoid high-cardinality attributes (user IDs, timestamps)
- Size limits: Set reasonable limits on attribute value sizes
- Sensitive data: Never include secrets, passwords, or PII in traces
Regular Reviews¶
- Monthly: Review trace volumes and costs
- Quarterly: Adjust sampling rates based on traffic patterns
- Annually: Evaluate plan needs and retention requirements
Troubleshooting Cost Issues¶
High Costs Checklist¶
- Check sampling rates - Are they too high for your traffic volume?
- Review service distribution - Are non-critical services generating most traces?
- Analyze trace sizes - Are spans containing unnecessary large attributes?
- Examine retention - Are you storing traces longer than needed?
- Service proliferation - Are test or temporary services generating traces?
Common Cost Drivers¶
- No sampling in production environments
- Health check traces not filtered out
- Large request/response bodies in span attributes
- High-cardinality span names (with IDs or timestamps)
- Development traces not properly separated