Tracing

Codename: Columbo — The Detective. "Just one more thing." Follows the request path wherever it leads.

Distributed tracing storage and visualization backends.

Overview

Tracing storage services receive traces from the OpenTelemetry Collector and provide trace visualization and analysis.

Jaeger - Distributed Tracing

Distributed tracing platform for monitoring microservices and troubleshooting performance.

Overview

Jaeger provides:

Distributed trace collection
Trace visualization and analysis
Service dependency graphs
Performance monitoring
Root cause analysis

Ports

16686 - Web UI
14268 - HTTP collector (Jaeger native)
14250 - gRPC collector
4317/4318 - OTLP (via collector)

Configuration

Jaeger receives traces from the OpenTelemetry Collector.

Key Features

Trace Search - Find traces by service, operation, tags
Service Map - Visualize service dependencies
Span Details - Drill down into trace spans
Performance Analysis - Identify bottlenecks

Architecture

Services → OpenTelemetry Collector → Jaeger
                                        ↓
                                   Visualization UI

Usage

Start Service

make up-observability
# or
docker compose up jaeger

Access Web UI

open http://localhost:16686

Check Health

curl http://localhost:14269/

Using the UI

Search for Traces

Open http://localhost:16686
Select service from dropdown
Choose operation (optional)
Set time range
Click "Find Traces"

Analyze a Trace

Click on a trace in search results
View span timeline
Examine span details and tags
Check logs attached to spans

Service Dependencies

Go to "System Architecture" tab
View service dependency graph
See request flow between services

Trace Features

Trace Structure

Trace
└─ Root Span (e.g., HTTP request)
   ├─ Child Span (e.g., database query)
   ├─ Child Span (e.g., external API call)
   │  └─ Nested Span (e.g., auth check)
   └─ Child Span (e.g., cache lookup)

Span Attributes

Operation name - What the span represents
Duration - How long it took
Tags - Key-value metadata
Logs - Events that happened during span
Trace ID - Links all spans in a trace

Common Analysis Patterns

Find Slow Requests

Search with min duration filter
Look for spans with long duration
Identify bottlenecks in trace timeline

Debug Errors

Search for error tags
Examine error span details
Check logs attached to error span
Trace error propagation

Compare Performance

Search for same operation
Compare trace durations
Identify differences in span patterns

Integration with Logs

Trace-Log Correlation

When logs include trace IDs, you can:

Find trace in Jaeger
Copy trace ID
Search logs in Loki/Grafana by trace_id
See logs in context of trace

Example log query:

{service_name="api"} | json | trace_id="abc123..."

Sampling

Sampling Strategies

Jaeger supports different sampling strategies:

Always Sample (Development)

# All traces captured
sampler:
  type: const
  param: 1

Probabilistic (Production)

# 1% of traces captured
sampler:
  type: probabilistic
  param: 0.01

Rate Limiting

# Max traces per second
sampler:
  type: ratelimiting
  param: 10

Storage

In-Memory (Default)

Fast, ephemeral
Lost on restart
Good for development

Elasticsearch (Production)

services:
  jaeger:
    environment:
      - SPAN_STORAGE_TYPE=elasticsearch
      - ES_SERVER_URLS=http://elasticsearch:9200

Cassandra (Production)

services:
  jaeger:
    environment:
      - SPAN_STORAGE_TYPE=cassandra
      - CASSANDRA_SERVERS=cassandra:9042

Performance Analysis

Identify Bottlenecks

Look for long-duration spans
Check span relationships
Find sequential vs parallel execution
Optimize critical path

Service Dependencies

View service map
Identify chattiness (many calls)
Find circular dependencies
Optimize service communication

Monitoring Jaeger

Metrics

Jaeger exposes Prometheus metrics:

curl http://localhost:14269/metrics

Key Metrics

Spans received/sec
Traces received/sec
Storage latency
Query latency

Production Notes

Persistent Storage - Use Elasticsearch or Cassandra
Sampling - Use probabilistic sampling (1-10%)
Retention - Set trace retention policy
Resource Limits - Monitor memory and storage
High Availability - Deploy multiple collectors
Security - Enable authentication and TLS

Troubleshooting

No Traces Appearing

Check OTel Collector is sending to Jaeger
Verify service instrumentation
Check network connectivity
Review Jaeger logs

High Memory Usage

Reduce sampling rate
Shorten retention period
Use persistent storage
Add resource limits

Slow Queries

Add indexes in storage backend
Reduce time range of searches
Use more specific filters
Optimize storage configuration

Alternatives

If Jaeger doesn't fit your needs:

Grafana Tempo - Designed for high volume, cheaper storage
Zipkin - Simpler, less features
Lightstep - SaaS, advanced analysis
Honeycomb - SaaS, full observability platform
AWS X-Ray - Cloud-native (AWS only)

Overview​

Jaeger - Distributed Tracing

Overview​

Ports​

Configuration​

Key Features​

Architecture​

Usage​

Start Service​

Access Web UI​

Check Health​

Using the UI​

Search for Traces​

Analyze a Trace​

Service Dependencies​

Trace Features​

Trace Structure​

Span Attributes​

Common Analysis Patterns​

Find Slow Requests​

Debug Errors​

Compare Performance​

Integration with Logs​

Trace-Log Correlation​

Sampling​

Sampling Strategies​

Storage​

In-Memory (Default)​

Elasticsearch (Production)​

Cassandra (Production)​

Performance Analysis​

Identify Bottlenecks​

Service Dependencies​

Monitoring Jaeger​

Metrics​

Key Metrics​

Production Notes​

Troubleshooting​

No Traces Appearing​

High Memory Usage​

Slow Queries​

Alternatives​

Overview

Overview

Ports

Configuration

Key Features

Architecture

Usage

Start Service

Access Web UI

Check Health

Using the UI

Search for Traces

Analyze a Trace

Service Dependencies

Trace Features

Trace Structure

Span Attributes

Common Analysis Patterns

Find Slow Requests

Debug Errors

Compare Performance

Integration with Logs

Trace-Log Correlation

Sampling

Sampling Strategies

Storage

In-Memory (Default)

Elasticsearch (Production)

Cassandra (Production)

Performance Analysis

Identify Bottlenecks

Service Dependencies

Monitoring Jaeger

Metrics

Key Metrics

Production Notes

Troubleshooting

No Traces Appearing

High Memory Usage

Slow Queries

Alternatives