The Problem
Marta runs a dozen microservices on Kubernetes for an e-commerce platform. When customers report slow checkouts, her team spends hours guessing which service is the bottleneck. Errors surface in Slack messages from frustrated developers grepping individual pod logs. There is no centralized view of system health, no distributed tracing to follow a request across services, and no alerting — the team finds out about outages from customer support tickets.
She needs metrics, tracing, and alerting working together so her team can detect issues before customers do, trace slow requests to their root cause, and get paged when something breaks at 3 AM instead of finding out the next morning.
The Solution
Deploy Prometheus for metrics collection, Grafana for dashboards, Jaeger for distributed tracing, and Alertmanager for routing notifications. Each tool handles one concern, and together they give full observability across every service.
# Install the skills
npx terminal-skills install prometheus-monitoring grafana jaeger prometheus-alertmanager
Step-by-Step Walkthrough
1. Deploy Prometheus for Metrics Collection
Marta starts with Prometheus to scrape metrics from all services. Every microservice already exposes a /metrics endpoint using the Prometheus client library for their language.
# prometheus.yml — Scrape config for Kubernetes service discovery
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- '/etc/prometheus/rules/*.yml'
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
She deploys Prometheus as a StatefulSet with persistent storage so metrics survive pod restarts.
2. Instrument Services with OpenTelemetry for Tracing
Next, Marta adds distributed tracing so her team can follow a single checkout request from the API gateway through the order service, payment service, and inventory service. She uses OpenTelemetry to send traces to Jaeger.
# tracing.py — Shared tracing setup for Python services
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
def init_tracing(service_name: str):
resource = Resource.create({
"service.name": service_name,
"deployment.environment": "production",
})
provider = TracerProvider(resource=resource)
exporter = OTLPSpanExporter(endpoint="http://jaeger-collector:4317", insecure=True)
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
FlaskInstrumentor().instrument()
RequestsInstrumentor().instrument()
Each service calls init_tracing("service-name") at startup. HTTP calls between services automatically propagate trace context, so Jaeger shows the complete request path.
3. Deploy Jaeger for Trace Storage and Visualization
# jaeger-deployment.yml — Jaeger collector and query with Elasticsearch backend
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger-collector
namespace: observability
spec:
replicas: 2
selector:
matchLabels:
app: jaeger-collector
template:
spec:
containers:
- name: jaeger-collector
image: jaegertracing/jaeger-collector:1.54
env:
- name: SPAN_STORAGE_TYPE
value: elasticsearch
- name: ES_SERVER_URLS
value: http://elasticsearch:9200
- name: COLLECTOR_OTLP_ENABLED
value: "true"
ports:
- containerPort: 4317
- containerPort: 14268
Marta's team can now open the Jaeger UI, search for a slow checkout request, and see every span — the API gateway took 50ms, the order service took 120ms, and the payment service took 4.2 seconds calling the external payment provider. The bottleneck is immediately visible.
4. Configure Alertmanager for Notification Routing
With metrics flowing, Marta defines alert rules in Prometheus and routes them through Alertmanager to the right team channels.
# alert-rules.yml — Critical alert rules
groups:
- name: service-alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/ sum(rate(http_requests_total[5m])) by (service) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate above 5% on {{ $labels.service }}"
dashboard: "https://grafana.internal/d/services?var-service={{ $labels.service }}"
- alert: HighLatencyP99
expr: |
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)) > 3
for: 10m
labels:
severity: warning
annotations:
summary: "P99 latency above 3s on {{ $labels.service }}"
# alertmanager.yml — Route alerts to the right team
route:
receiver: 'default-slack'
group_by: ['alertname', 'service']
group_wait: 30s
routes:
- match:
severity: critical
receiver: 'pagerduty-oncall'
- match_re:
service: ^(payment|billing)$
receiver: 'payments-slack'
receivers:
- name: 'default-slack'
slack_configs:
- channel: '#ops-alerts'
send_resolved: true
- name: 'pagerduty-oncall'
pagerduty_configs:
- service_key: '<PD_KEY>'
- name: 'payments-slack'
slack_configs:
- channel: '#payments-alerts'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'service']
5. Build Grafana Dashboards
Finally, Marta creates Grafana dashboards that pull metrics from Prometheus and link to Jaeger traces. When someone sees a latency spike on the dashboard, they click through to the exact traces from that time window.
She configures two data sources in Grafana — Prometheus at http://prometheus:9090 and Jaeger at http://jaeger-query:16686 — and uses Grafana's trace-to-metrics correlation to link them.
The team now has a single service overview dashboard showing request rate, error rate, and latency (the RED method) for every service, with drill-down to traces for any anomaly.
The Result
Within the first week, Marta's team catches a memory leak in the inventory service through a gradual latency increase visible on the dashboard — before any customer noticed. When the payment provider has a brief outage, Alertmanager pages the on-call engineer within 30 seconds, and they confirm the issue through Jaeger traces in under a minute. Mean time to detection drops from hours to minutes.