Tracing
Codename: Columbo — The Detective. "Just one more thing." Follows the request path wherever it leads.
Distributed tracing storage and visualization backends.
Overview
Tracing storage services receive traces from the OpenTelemetry Collector and provide trace visualization and analysis.
Jaeger - Distributed Tracing
Distributed tracing platform for monitoring microservices and troubleshooting performance.
Overview
Jaeger provides:
- Distributed trace collection
- Trace visualization and analysis
- Service dependency graphs
- Performance monitoring
- Root cause analysis
Ports
- 16686 - Web UI
- 14268 - HTTP collector (Jaeger native)
- 14250 - gRPC collector
- 4317/4318 - OTLP (via collector)
Configuration
Jaeger receives traces from the OpenTelemetry Collector.
Key Features
- Trace Search - Find traces by service, operation, tags
- Service Map - Visualize service dependencies
- Span Details - Drill down into trace spans
- Performance Analysis - Identify bottlenecks
Architecture
Services → OpenTelemetry Collector → Jaeger
↓
Visualization UI
Usage
Start Service
make up-observability
# or
docker compose up jaeger
Access Web UI
open http://localhost:16686
Check Health
curl http://localhost:14269/
Using the UI
Search for Traces
- Open http://localhost:16686
- Select service from dropdown
- Choose operation (optional)
- Set time range
- Click "Find Traces"
Analyze a Trace
- Click on a trace in search results
- View span timeline
- Examine span details and tags
- Check logs attached to spans
Service Dependencies
- Go to "System Architecture" tab
- View service dependency graph
- See request flow between services
Trace Features
Trace Structure
Trace
└─ Root Span (e.g., HTTP request)
├─ Child Span (e.g., database query)
├─ Child Span (e.g., external API call)
│ └─ Nested Span (e.g., auth check)
└─ Child Span (e.g., cache lookup)
Span Attributes
- Operation name - What the span represents
- Duration - How long it took
- Tags - Key-value metadata
- Logs - Events that happened during span
- Trace ID - Links all spans in a trace
Common Analysis Patterns
Find Slow Requests
- Search with min duration filter
- Look for spans with long duration
- Identify bottlenecks in trace timeline
Debug Errors
- Search for error tags
- Examine error span details
- Check logs attached to error span
- Trace error propagation
Compare Performance
- Search for same operation
- Compare trace durations
- Identify differences in span patterns
Integration with Logs
Trace-Log Correlation
When logs include trace IDs, you can:
- Find trace in Jaeger
- Copy trace ID
- Search logs in Loki/Grafana by trace_id
- See logs in context of trace
Example log query:
{service_name="api"} | json | trace_id="abc123..."
Sampling
Sampling Strategies
Jaeger supports different sampling strategies:
Always Sample (Development)
# All traces captured
sampler:
type: const
param: 1
Probabilistic (Production)
# 1% of traces captured
sampler:
type: probabilistic
param: 0.01
Rate Limiting
# Max traces per second
sampler:
type: ratelimiting
param: 10
Storage
In-Memory (Default)
- Fast, ephemeral
- Lost on restart
- Good for development
Elasticsearch (Production)
services:
jaeger:
environment:
- SPAN_STORAGE_TYPE=elasticsearch
- ES_SERVER_URLS=http://elasticsearch:9200
Cassandra (Production)
services:
jaeger:
environment:
- SPAN_STORAGE_TYPE=cassandra
- CASSANDRA_SERVERS=cassandra:9042
Performance Analysis
Identify Bottlenecks
- Look for long-duration spans
- Check span relationships
- Find sequential vs parallel execution
- Optimize critical path
Service Dependencies
- View service map
- Identify chattiness (many calls)
- Find circular dependencies
- Optimize service communication
Monitoring Jaeger
Metrics
Jaeger exposes Prometheus metrics:
curl http://localhost:14269/metrics
Key Metrics
- Spans received/sec
- Traces received/sec
- Storage latency
- Query latency
Production Notes
- Persistent Storage - Use Elasticsearch or Cassandra
- Sampling - Use probabilistic sampling (1-10%)
- Retention - Set trace retention policy
- Resource Limits - Monitor memory and storage
- High Availability - Deploy multiple collectors
- Security - Enable authentication and TLS
Troubleshooting
No Traces Appearing
- Check OTel Collector is sending to Jaeger
- Verify service instrumentation
- Check network connectivity
- Review Jaeger logs
High Memory Usage
- Reduce sampling rate
- Shorten retention period
- Use persistent storage
- Add resource limits
Slow Queries
- Add indexes in storage backend
- Reduce time range of searches
- Use more specific filters
- Optimize storage configuration
Alternatives
If Jaeger doesn't fit your needs:
- Grafana Tempo - Designed for high volume, cheaper storage
- Zipkin - Simpler, less features
- Lightstep - SaaS, advanced analysis
- Honeycomb - SaaS, full observability platform
- AWS X-Ray - Cloud-native (AWS only)