Dashboards
Codename: Friday — The UI. The visual interface overlay for all your platform's vitals.
Dashboard and visualization platforms for observability data.
Overview
Visualization services provide unified dashboards for viewing logs, metrics, and traces from observability backends.
Grafana - Visualization Platform
Unified observability dashboards for metrics, logs, and traces.
Overview
Grafana provides:
- Multi-datasource dashboards
- Unified visualization
- Alerting and notifications
- Explore mode for ad-hoc queries
- Dashboard templates and sharing
Ports
- 3000 - Web UI and API
Configuration
Grafana is auto-provisioned with datasources for:
- Prometheus - Metrics
- Loki - Logs
- Jaeger - Traces
Datasource Config: provisioning/datasources/
Usage
Start Service
make up-observability
# or
docker compose up grafana
Access Web UI
open http://localhost:3000
Default Credentials
Username: admin
Password: admin
Change password on first login!
Key Features
1. Dashboards
Create visual dashboards with:
- Time-series graphs
- Gauge panels
- Tables and lists
- Heatmaps
- Stat panels
- Bar charts
2. Explore Mode
Ad-hoc querying across datasources:
- Query metrics (PromQL)
- Search logs (LogQL)
- Find traces (Jaeger UI)
- Correlate data across sources
3. Alerting
Set up alerts based on queries:
- Threshold alerts
- Query-based alerts
- Alert routing
- Notification channels
4. Unified Search
Search across all observability data:
- Find logs by trace ID
- Jump from metric to trace
- Correlate events across sources
Quick Start
1. Explore Metrics
- Open Grafana (http://localhost:3000)
- Go to Explore (compass icon)
- Select "Prometheus" datasource
- Enter PromQL query:
rate(http_requests_total[5m]) - Click "Run Query"
2. Search Logs
- Go to Explore
- Select "Loki" datasource
- Enter LogQL query:
{service_name="swiss-army-go"} - Filter and search logs
3. View Traces
- Go to Explore
- Select "Jaeger" datasource
- Search by service or trace ID
- Click trace to view details
Creating Dashboards
Basic Dashboard
- Click "+" then "Dashboard"
- Add panel
- Select datasource (Prometheus, Loki, Jaeger)
- Write query
- Choose visualization type
- Save dashboard
Example Panels
Request Rate (Prometheus)
sum(rate(http_requests_total[5m])) by (service)
Error Logs (Loki)
sum(count_over_time({service_name="api"} |= "level=error" [5m])) by (service_name)
Trace Count (Custom)
Query Jaeger API for trace statistics
Dashboard Templates
RED Metrics Dashboard
Monitor Rate, Errors, Duration:
# Rate
sum(rate(http_requests_total[5m])) by (service)
# Errors
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
# Duration (P95)
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)
Service Overview Dashboard
- Request rate graph
- Error rate graph
- Latency percentiles (P50, P95, P99)
- Active instances
- Recent error logs
- Trace samples
Trace-Log-Metric Correlation
Drill-down Flow
1. See metric spike in dashboard
↓
2. Click to explore metrics
↓
3. Find high-latency traces
↓
4. Click trace ID to view in Jaeger
↓
5. Find error span in trace
↓
6. Copy trace ID
↓
7. Search logs for trace_id in Loki
↓
8. Find root cause in logs
Example Workflow
- Dashboard alert - High error rate
- Explore metrics - Which endpoint?
- Search logs - What errors?
- Find trace - Which request failed?
- Analyze trace - Where did it fail?
- Check logs - Why did it fail?
Alerting
Create Alert
- Open dashboard panel
- Click "Alert" tab
- Define alert rule:
WHEN avg() OF query(A, 5m, now)
IS ABOVE 100 - Add notification channel
- Save alert
Notification Channels
- Slack
- PagerDuty
- Webhook
- Discord
- Teams
Data Source Configuration
Prometheus
# provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
url: http://prometheus:9090
isDefault: true
Loki
# provisioning/datasources/loki.yml
apiVersion: 1
datasources:
- name: Loki
type: loki
url: http://loki:3100
Jaeger
# provisioning/datasources/jaeger.yml
apiVersion: 1
datasources:
- name: Jaeger
type: jaeger
url: http://jaeger:16686
Variables and Templating
Dashboard Variables
Create dynamic dashboards:
# Service selector
$service = label_values(service_name)
# Query using variable
rate(http_requests_total{service="$service"}[5m])
Common Variables
- Service - Filter by service
- Environment - dev/staging/prod
- Time range - Quick time selection
- Instance - Filter by instance
Performance Tips
- Limit time ranges - Don't query years of data
- Use caching - Enable query caching
- Reduce refresh rate - Don't refresh every second
- Optimize queries - Use recording rules in Prometheus
- Dashboard organization - Separate dashboards by team/service
Production Notes
- Authentication - Enable proper auth (OAuth, LDAP, etc.)
- User Management - Set up teams and permissions
- Backup Dashboards - Export and version control dashboards
- High Availability - Deploy multiple Grafana instances
- Database - Use external database (PostgreSQL) instead of SQLite
- Security - Use HTTPS, secure datasource credentials
- Monitoring - Monitor Grafana itself
Troubleshooting
Datasource Not Working
- Check datasource configuration
- Verify network connectivity
- Test datasource URL from Grafana container
- Check datasource logs
Dashboard Not Loading
- Check query syntax
- Verify time range
- Check datasource availability
- Review Grafana logs
Slow Performance
- Reduce time range
- Optimize queries
- Enable query caching
- Increase Grafana resources
Alternatives
If Grafana doesn't fit your needs:
- Kibana - For Elasticsearch/OpenSearch stack
- Datadog - SaaS, full platform
- Custom Dashboards - Build your own
- Prometheus UI - Basic metrics UI
- Jaeger UI - For traces only