Logging
Codename: Watson — The Chronicler. Writes down every messy detail for later deduction.
Hermes (Promtail) — The Messenger. Delivers the logs to Watson.
Log aggregation and storage backends.
Overview
Log storage services receive logs from the OpenTelemetry Collector and provide querying interfaces.
Active Implementation
Loki
Status: ✅ Active Type: Cost-effective log aggregation
- Label-based indexing
- LogQL query language
- Native Grafana integration
- Optimized for Kubernetes
Loki - Log Aggregation
Log aggregation system optimized for Kubernetes and cloud-native applications.
Overview
Loki provides:
- Cost-effective log storage
- Label-based indexing (like Prometheus)
- Integration with Grafana
- LogQL query language
- High compression
- Multi-tenancy support
Ports
- 3100 - HTTP API and query interface
Configuration
See .env.example for configuration options.
Key Features
- Cost-Effective - Only indexes labels, not full text
- LogQL - Powerful query language similar to PromQL
- Grafana Integration - Native datasource in Grafana
- Compression - Excellent compression ratios
- Scalable - Horizontally scalable architecture
Architecture
Services → OpenTelemetry Collector → Loki → Grafana
Loki receives logs from:
- OpenTelemetry Collector (OTLP HTTP)
- Promtail (log shipper)
- Docker logging driver
- Direct HTTP API
Usage
Start Service
make up-observability
# or
docker compose up loki
Check Health
curl http://localhost:3100/ready
Query Logs
# Via API
curl -G -s "http://localhost:3100/loki/api/v1/query" \
--data-urlencode 'query={service_name="toolbox-go"}'
# Via Grafana (recommended)
open http://localhost:3000/explore
LogQL Query Examples
Basic Queries
# All logs from a service
{service_name="swiss-army-go"}
# Logs with specific level
{service_name="swiss-army-go"} |= "level=error"
# Logs matching regex
{service_name="swiss-army-go"} |~ "error|exception"
# Logs NOT containing text
{service_name="swiss-army-go"} != "healthcheck"
JSON Parsing
# Parse JSON and filter
{service_name="swiss-army-go"} | json | user_id="123"
# Extract field and aggregate
sum by (status_code) (
rate({service_name="api"} | json [5m])
)
Metrics from Logs
# Count logs per minute
count_over_time({service_name="swiss-army-go"}[1m])
# Error rate
sum(rate({service_name="swiss-army-go"} |= "level=error" [5m]))
# Average response time from JSON logs
avg_over_time(
{service_name="api"} | json | unwrap duration [5m]
)
Trace Correlation
# Find logs for specific trace
{service_name="swiss-army-go"} | json | trace_id="abc123"
Labels
Recommended Labels
service_name - Service identifier
level - Log level (info, warn, error)
environment - Environment (dev, staging, prod)
Label Best Practices
- Low Cardinality - Limit unique label values
- Don't Over-Label - Too many labels hurt performance
- Use Filters - Parse content with LogQL, don't label everything
Bad Examples
user_id="123" # High cardinality
request_id="abc-def" # High cardinality
timestamp="..." # Already tracked
Good Examples
service_name="api"
level="error"
environment="production"
Storage & Retention
Configuration
# In loki config
limits_config:
retention_period: 744h # 31 days
compactor:
retention_enabled: true
retention_delete_delay: 2h
Disk Usage
# Check storage size
du -sh /var/lib/loki
# Clean old chunks
docker compose exec loki rm -rf /loki/chunks/fake/*
Integration with OpenTelemetry
Collector Configuration
exporters:
loki:
endpoint: http://loki:3100/loki/api/v1/push
labels:
attributes:
service.name: 'service_name'
severity: 'level'
Grafana Integration
Loki is auto-provisioned as a datasource in Grafana.
Access in Grafana
- Open http://localhost:3000
- Go to Explore
- Select "Loki" datasource
- Enter LogQL query
Example Dashboard Queries
# Error rate panel
sum(rate({service_name="api"} |= "level=error" [5m]))
# Log volume by service
sum by (service_name) (count_over_time({job="docker"}[1m]))
# Top error messages
topk(10,
sum by (message) (count_over_time({level="error"}[1h]))
)
Performance Tuning
Optimize Queries
# Good - Uses labels
{service_name="api", level="error"}
# Bad - Full text search
{job="docker"} |= "error"
# Better - Label + filter
{service_name="api"} |= "error"
Query Limits
limits_config:
max_query_length: 721h # Max time range
max_query_lookback: 30d # Lookback limit
max_entries_limit_per_query: 5000
Monitoring
Key Metrics
- Ingestion rate (lines/second)
- Query latency
- Storage size
- Failed requests
Loki Metrics
# Scrape Loki metrics with Prometheus
curl http://localhost:3100/metrics
Production Notes
- Object Storage - Use S3/GCS for chunk storage
- Compactor - Enable compaction for retention
- Resource Limits - Set memory and CPU limits
- Query Limits - Prevent expensive queries
- Label Cardinality - Monitor and control
- Multi-Tenancy - Use tenant IDs for isolation
- Backup - Regular backups of index and chunks
Alternatives
If Loki doesn't fit your needs:
- Elasticsearch - Full-text search, resource-intensive
- Splunk - Enterprise features, expensive
- CloudWatch Logs - AWS-native
- Datadog - SaaS, full observability platform