The Problem
A platform team runs 18 microservices and learns about outages from customer support tickets. Logs are scattered across individual server files with no central aggregation. Metrics exist for CPU and memory but not for business-level indicators like order completion rate or payment success rate. The team has 340 alert rules, most inherited from a previous engineer, and 60% of them fire weekly without anyone acting on them. Real issues get buried in alert noise, and the mean time to detect a production problem is 47 minutes.
The Solution
Using the observability-setup skill to build a centralized logging and metrics pipeline, the log-analyzer skill to parse unstructured logs and surface actionable patterns, and the alert-optimizer skill to eliminate noisy alerts and create meaningful ones tied to user-facing impact.
Step-by-Step Walkthrough
1. Set up centralized log aggregation
Route all service logs to a central store with structured formatting.
Set up a centralized logging pipeline for our 18 microservices. Configure each service to emit structured JSON logs with consistent fields: timestamp, service name, request ID, log level, and duration. Route everything to a Loki instance with 30-day retention. Create a Grafana dashboard showing log volume by service and error rate trends.
Structured JSON logs replace the current mix of plain text formats. Every log line includes a request ID that traces a single user action across all 18 services. Querying "show me everything that happened for request abc-123" now takes seconds instead of SSHing into multiple servers and grepping timestamps.
2. Analyze log patterns and surface anomalies
Use log analysis to find recurring errors and performance degradation.
Analyze the last 7 days of production logs. Find the top 10 recurring error patterns, identify services with increasing error rates, and flag any log patterns that correlate with the three customer-reported outages last week. Show me the error timeline overlaid with deployment events.
The log analyzer produces a structured report ranking every recurring pattern by frequency and impact:
LOG PATTERN ANALYSIS — Last 7 Days
===================================
Top Error Patterns (by frequency):
# Pattern Count Services Trend
1 ConnectionPool exhausted 4,812 order-svc +340% daily at 14:00
2 Upstream timeout (5000ms) 3,291 payment-svc Steady
3 Retry storm (cascade) 2,877 payment,inventory Correlated w/ #2
4 JSON parse error (null body) 841 gateway Sporadic
5 Certificate expiry warning 623 auth-svc New this week
Outage Correlation:
Mon 09:14 order-svc deploy → pattern #1 spike (pool size reduced in config)
Wed 14:02 batch job start → pattern #1 + #2 cascade → customer reports
Fri 16:45 payment gateway 503 → pattern #2 + #3 → 12min full outage
The analysis reveals that 73% of errors come from just two sources: a retry storm between the payment and inventory services (one service timing out triggers cascading retries), and a database connection pool exhaustion that happens every day at 2 PM when a batch job runs. Both patterns were invisible when logs lived on individual servers.
3. Audit and optimize alert rules
Eliminate noisy alerts and create SLO-based alerting.
Audit our 340 alert rules. For each rule, show me: how many times it fired in the last 30 days, how many times someone acknowledged it, and how many led to an actual incident. Delete alerts that fired more than 10 times without acknowledgment. Replace individual service health checks with SLO-based alerts: order completion rate below 99.5%, payment success rate below 99.9%, and API p95 latency above 500ms.
The audit is brutal: 204 of 340 alerts fired in the last month with zero acknowledgments. These are pure noise. Deleting them and replacing with 12 SLO-based alerts reduces alert volume by 94% while improving detection of real issues. The team now gets alerted when users are actually impacted, not when a single pod restarts.
4. Create runbooks linked to alerts
Attach investigation steps to each remaining alert.
For each of the 12 remaining alert rules, create a runbook that includes: what the alert means in plain English, the first three diagnostic commands to run, likely root causes ranked by probability, and escalation contacts. Link each runbook to its alert in Grafana so the on-call engineer sees it immediately when paged.
Every alert now has a clear path forward. When the "order completion rate below 99.5%" alert fires at 3 AM, the on-call engineer sees the runbook immediately: check payment gateway status, check database connection pool, check the inventory service error rate. No more staring at a vague alert trying to figure out where to start.
Consider scheduling a monthly review of alert effectiveness. Track the ratio of alerts-to-incidents and the median time from alert to resolution. If an alert fires frequently but resolution always takes the same three steps, that is a candidate for automated remediation rather than human paging.
Real-World Example
Tanya's team detects production outages 47 minutes after they start -- usually when a customer complains. After building the observability pipeline, centralized logs reveal two chronic issues causing 73% of errors. The alert audit eliminates 204 noisy rules and replaces them with 12 SLO-based alerts tied to user impact. Mean time to detection drops from 47 minutes to 3 minutes. The next production issue -- a database connection pool exhaustion at 2 PM -- triggers an alert within 90 seconds, and the linked runbook guides the on-call engineer to the fix in under 5 minutes.