Metrics
Codename: Dr. House — Diagnostics. Doesn't trust you; trusts the vitals. "It's never DNS."
Metrics storage and querying backends.
Overview
Metrics storage services receive metrics from the OpenTelemetry Collector and provide time-series database capabilities.
Prometheus - Metrics Storage
Time-series database for storing and querying metrics.
Overview
Prometheus provides:
- Time-series metrics storage
- PromQL query language
- Pull-based scraping model
- Built-in alerting
- Service discovery
Ports
- 9090 - Web UI and API
Configuration
Primary Config: prometheus.yaml
Key Features
- Scraping - Pull metrics from targets
- PromQL - Powerful query language
- Storage - Local time-series database
- Alerting - Alert manager integration
- Federation - Multi-cluster support
Architecture
Prometheus → Scrapes metrics from:
- OpenTelemetry Collector
- Service /metrics endpoints
- Node exporters
Usage
Start Service
make up-observability
# or
docker compose up prometheus
Access Web UI
open http://localhost:9090
Check Health
curl http://localhost:9090/-/healthy
PromQL Query Examples
Basic Queries
# Current CPU usage
rate(cpu_usage_seconds_total[5m])
# Request rate
rate(http_requests_total[5m])
# Error rate
rate(http_requests_total{status="500"}[5m])
# Request duration (P95)
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
RED Metrics
# Rate - requests per second
sum(rate(http_requests_total[5m])) by (service)
# Errors - error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
# Duration - request latency P95
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)
Alerting
# High error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) > 0.05
# Service down
up{job="my-service"} == 0
Metrics from OpenTelemetry
OTel Collector exports metrics to Prometheus:
# In otel-collector-config.yml
exporters:
prometheus:
endpoint: '0.0.0.0:8889'
Prometheus scrapes from the collector:
# In prometheus.yaml
scrape_configs:
- job_name: 'otel-collector'
static_configs:
- targets: ['otel-collector:8889']
Data Retention
Configuration
global:
scrape_interval: 15s
evaluation_interval: 15s
# Retention
storage:
tsdb:
retention.time: 15d # Keep data for 15 days
retention.size: 50GB # Or until 50GB
Check Storage
# Via API
curl http://localhost:9090/api/v1/status/tsdb
# Via container
docker compose exec prometheus df -h /prometheus
Grafana Integration
Prometheus is auto-provisioned as a datasource in Grafana.
Use in Grafana
- Open Grafana (http://localhost:3000)
- Create dashboard
- Add panel with Prometheus datasource
- Write PromQL queries
Performance Tuning
Scrape Configuration
scrape_configs:
- job_name: 'high-frequency'
scrape_interval: 5s # Fast scraping
- job_name: 'low-frequency'
scrape_interval: 60s # Slower scraping
Resource Limits
# In docker-compose
services:
prometheus:
deploy:
resources:
limits:
memory: 2GB
cpus: '1.0'
Monitoring Prometheus
Prometheus monitors itself:
# Check targets
open http://localhost:9090/targets
# Check configuration
open http://localhost:9090/config
# Check service discovery
open http://localhost:9090/service-discovery
Production Notes
- External Storage - Use remote write for long-term storage
- High Availability - Deploy multiple Prometheus instances
- Alertmanager - Set up alert routing and notifications
- Federation - Aggregate metrics across clusters
- Retention Policy - Balance storage vs data retention
- Backup - Regular snapshots of TSDB
Alternatives
If Prometheus doesn't fit your needs:
- Victoria Metrics - Prometheus-compatible, faster, cheaper
- InfluxDB - Time-series DB with different query language
- Datadog - SaaS platform
- Thanos - Prometheus with long-term storage