Reduce Alert Fatigue by Tuning Error Monitoring with AI — AI Workflow

Set up intelligent error monitoring and eliminate noisy alerts so your on-call team only gets paged for real incidents.

The Problem

Marcus is a backend engineer at a 20-person SaaS startup. His team of four developers rotates on-call weekly, and the pager has become a joke -- the kind nobody laughs at. Last month they logged 1,847 alert notifications. That's roughly 460 per person per week, or one every 22 minutes during waking hours.

The worst part isn't the volume -- it's what the volume hides. 80% of those alerts are duplicates, known issues, or low-severity noise that self-heals in seconds. Real incidents get buried in the flood. Last week, a payment processing outage went unnoticed for 23 minutes because the on-call engineer assumed it was just another noisy Redis timeout alert. It wasn't. That 23 minutes cost the company $4,200 in failed transactions and a very uncomfortable customer call.

The team set up Sentry months ago with the best intentions. Every error got an alert. Every alert went to PagerDuty. Now the on-call rotation is a punishment nobody volunteers for, and two engineers have privately mentioned it in their 1-on-1s as a reason they're considering leaving.

The Solution

Using the error-monitoring and alert-optimizer skills, the agent audits the existing error stream, groups errors by root cause, builds a severity model based on actual user impact, and generates new alert rules that route the right signal to the right channel. The goal: page only when it matters, Slack for things that can wait, and silence everything else.

Step-by-Step Walkthrough

Step 1: Audit the Error Landscape

Marcus exports 30 days of Sentry data and hands it to the agent.

Here's our Sentry project export (errors-export.json). Analyze the last 30 days of errors: group them by root cause, show me which ones fire most often, and flag any that are duplicates or already-resolved issues still triggering alerts.

The audit reveals where those 4,812 events are actually coming from:

Error Group	Events	% of Total	Impact
`TypeError: Cannot read property 'email' of undefined`	1,247	26%	Low -- only affects unauthenticated preview requests
`TimeoutError: Redis connection timed out`	891	19%	Medium -- causes ~200ms latency spike, auto-recovers in <5s
`PaymentProcessingError: Stripe webhook signature invalid`	312	6%	HIGH -- failed payments are silently dropped
7 duplicate groups mapping to the same root cause	1,089	23%	Mixed -- inflating alert count artificially
3 groups marked "resolved" but still re-triggering	284	6%	None -- these are ghosts
Everything else (94 error patterns)	989	20%	Mostly low

The picture is stark: one error responsible for a quarter of all alerts has no user impact at all, while the Stripe webhook issue -- the one that actually loses money -- accounts for just 6% of the noise. It's a needle-in-a-haystack problem, except the haystack pages you every 22 minutes.

Step 2: Define Severity Tiers Based on User Impact

Based on that audit, create a three-tier severity model: P1 (page immediately), P2 (Slack notify, respond within 1 hour), P3 (log only, review weekly). Use these criteria: P1 = revenue impact or data loss, P2 = degraded experience for >5% of users, P3 = everything else.

The severity model maps every error group into one of three buckets:

P1 -- Page immediately (8 error patterns): Payment failures, data corruption, auth system down, API returning 500s to >10% of requests. These are the ones worth waking someone up for.
P2 -- Slack notify, respond within 1 hour (15 error patterns): Elevated latency, degraded search results, third-party integration hiccups affecting >5% of users. Important, but not worth interrupting deep work.
P3 -- Log only, review weekly (94 error patterns): The TypeError on unauthenticated previews. The Redis timeout that self-heals. The deprecated API warnings. Everything that used to generate 80% of the noise.

The Redis timeout alone -- 19% of all pages -- drops to P3. It self-heals in under 5 seconds. Nobody needs to know about it at 3 AM.

Step 3: Generate New Alert Rules

Now generate Sentry alert rule configurations (as JSON) for each tier. P1 should page via PagerDuty, P2 should post to #engineering-alerts in Slack, P3 should just be captured with no notification. Merge the 7 duplicate groups.

The new configuration routes each tier to the right channel:

P1 rules route through PagerDuty with a 30-second acknowledgment window. The Stripe webhook error, which was previously indistinguishable from the noise, now pages immediately with a custom message that includes the failed payment amount and customer ID.
P2 rules post to #engineering-alerts in Slack with a 1-hour response SLA. Messages include a direct link to the Sentry issue and a suggested first diagnostic step.
P3 rules capture events silently. No notification, no Slack message, no page. A weekly digest email summarizes P3 trends for the Monday team review.

The 7 duplicate groups get merged into their root causes using Sentry's fingerprinting rules, so a single underlying issue produces a single alert instead of seven.

Step 4: Create On-Call Runbooks

For each P1 alert, generate a runbook entry: what the alert means, likely root cause, first three diagnostic steps, and escalation path. Format as markdown I can add to our wiki.

Each P1 alert now has a companion runbook. When the Stripe webhook alert fires at 2 AM, the on-call engineer doesn't have to reverse-engineer what went wrong -- the runbook says: check the webhook signing secret in production config, verify Stripe's status page, inspect the last 10 failed events in the Sentry dashboard. Escalation path: if unresolved in 15 minutes, page the payments team lead.

This turns a panicked investigation into a checklist. Mean time to resolution drops because the first responder isn't starting from zero.

Step 5: Set Up Proactive Anomaly Detection

Finally, create alert rules that detect anomalies -- not just thresholds. Set up a rule that pages when the error rate for any endpoint exceeds 3 standard deviations above its 7-day rolling average. This way we catch new issues we haven't seen before, not just known error types.

Three anomaly detection rules go live:

Unusual Error Spike: Fires when any endpoint's error count in a 5-minute window exceeds 3 standard deviations above its 7-day rolling average. Starts as P2, auto-promotes to P1 if the spike persists for more than 15 minutes. 30-minute cooldown to avoid re-alerting on the same incident.
New Error Type: Fires on the first occurrence of an error fingerprint not seen in the last 30 days. Routed to Slack with the full stack trace and affected endpoint. This catches the problems you haven't anticipated.
Error Rate Ratio: Monitors the ratio of 5xx responses to total responses per 5-minute window. P1 if the ratio exceeds 10%, P2 if between 5-10%. The baseline is typically 0.3%, so a 5% ratio means something is seriously wrong.

These rules catch the unknown unknowns -- the class of problems that threshold-based alerts miss entirely.

Real-World Example

Marcus applies the tiered model on a Tuesday morning. The numbers tell the story immediately: 62% of the previous month's 1,847 alerts came from just three known, low-impact issues. Those all drop to P3.

Within the first week, the on-call engineer's alert volume drops from 460 per week to 35 -- and every single one is actionable. On Thursday, the anomaly detection rule catches a new error pattern from a deploy that would have been buried in the old noise: a database connection pool leak causing intermittent timeouts. It gets fixed in 20 minutes.

Mean time to acknowledge real incidents drops from 23 minutes to 4 minutes. The on-call rotation stops being a punishment. Two months later, nobody has mentioned leaving over pager fatigue in their 1-on-1s. The team actually volunteers for on-call now -- partly because the alerts are manageable, and partly because the runbooks make incidents feel solvable instead of terrifying.