Terminal.skills
Use Cases/Set Up Canary Deployments with Automated Rollback Using AI

Set Up Canary Deployments with Automated Rollback Using AI

Configure canary deployment pipelines that gradually roll out changes and automatically roll back on error spikes.

DevOps#cicd#github-actions#gitlab-ci#pipeline#deployment
Works with:claude-codeopenai-codexgemini-clicursor
$

The Problem

Tomas runs DevOps at a 20-person fintech startup. They deploy straight to production with a basic blue-green setup, and every other week a deploy causes a spike in 5xx errors. It takes 15-30 minutes for someone to notice, then another 10 minutes to roll back manually. During the last incident, a bad deploy ran for 25 minutes before anyone realized the error rate had tripled -- 40% of active users hit payment processing errors. For a fintech app that handles payments, 25 minutes of elevated errors means failed transactions, confused customers, and support tickets that take days to resolve.

The team knows they need gradual rollouts with automatic rollback. But between researching service mesh traffic splitting, configuring health checks with the right Prometheus queries, writing rollback triggers that do not false-alarm on normal traffic noise, and wiring it all into the CI pipeline, nobody has the bandwidth to set it up from scratch. The irony is that every bad deploy makes the team more cautious, which slows down releases, which means bigger batches of changes per deploy, which makes each deploy riskier.

The Solution

Using the cicd-pipeline, coding-agent, and analytics-tracking skills, this walkthrough builds a canary deployment system on Kubernetes: traffic shifts from 5% to 25% to 100% over defined hold periods, with Prometheus health checks that trigger automatic rollback if error rates or latency cross thresholds.

Step-by-Step Walkthrough

Step 1: Define the Canary Strategy

Tomas starts by describing what he needs:

Set up canary deployments for our Node.js API running on Kubernetes. I want traffic to shift gradually: 5% for 5 minutes, 25% for 10 minutes, then 100%. Auto-rollback if error rate exceeds 1% or p99 latency goes above 500ms during any canary phase. We use Prometheus for metrics.

Four files come out of this:

FilePurpose
k8s/canary-deployment.yamlCanary pod spec with resource limits matching production
k8s/canary-service.yamlService mesh traffic splitting rules
k8s/canary-analysis-template.yamlPrometheus metric queries for health checks
.github/workflows/canary-deploy.ymlCI pipeline with staged rollout

The traffic schedule is deliberately conservative:

  • Stage 1: 5% of traffic for 5 minutes, then check metrics
  • Stage 2: 25% of traffic for 10 minutes, then check metrics
  • Stage 3: 100% -- promoted to stable

If anything goes wrong at Stage 1, only 5% of users are affected, and rollback happens in under 90 seconds. Compare that to the current state: 100% of users affected for 25+ minutes.

An important detail in the canary deployment spec: the canary pods have the exact same resource limits as production pods. If the canary runs with different CPU or memory limits, latency measurements are meaningless -- a canary that looks fine with generous resources might fall over under production constraints.

Step 2: Configure Health Metric Queries

The canary needs to know what "healthy" looks like. Three Prometheus queries evaluate canary health at each stage:

Error rate -- the percentage of 5xx responses over a 2-minute window:

promql
rate(http_requests_total{status=~"5.."}[2m]) / rate(http_requests_total[2m])

Threshold: must stay below 1%.

Latency p99 -- the 99th percentile response time:

promql
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[2m]))

Threshold: must stay below 500ms.

Success rate -- the inverse of error rate, as a sanity check:

promql
1 - (rate(http_requests_total{status=~"5.."}[2m]) / rate(http_requests_total[2m]))

Threshold: must stay above 99%.

Evaluation runs every 60 seconds during canary phases. If any threshold is breached for 2 consecutive checks, rollback triggers automatically. The 2-consecutive-check requirement is important -- a single noisy data point (a spike from one slow downstream call, for example) should not trigger a false rollback. But two consecutive breaches indicate a real problem.

The 2-minute query window ([2m]) is calibrated for the 5% traffic stage. At low traffic volumes, a wider window smooths out noise. At the 25% stage, the same window provides more statistical significance because more requests flow through.

Step 3: Validate with a Dry Run

Before trusting this with a real release, Tomas runs the pipeline with the current stable version:

Run a dry-run canary deployment using our current stable version to validate the pipeline works end-to-end without risk.

The dry run deploys the exact same version as a "canary" to verify three things: traffic splitting works correctly (5% of requests actually hit the canary pods), metrics flow into the analysis template and evaluate correctly, and the rollback trigger fires when manually provoked by injecting artificial errors.

This step catches configuration issues before they matter. In Tomas's case, the dry run reveals that the Prometheus service discovery labels don't match the canary pod labels -- the health checks were evaluating against zero data points and silently passing. Without the dry run, the first real canary deploy would have had no health protection at all.

Step 4: Wire Up Deployment Notifications

Every stage transition posts to Slack so the team knows what is happening without watching a dashboard:

  • Canary started: "Deploying v2.4.1 -- canary at 5%"
  • Stage promoted: "Canary v2.4.1 promoted to 25% -- metrics healthy"
  • Rollback triggered: "Canary v2.4.1 rolled back -- p99 latency 720ms exceeds 500ms threshold"
  • Deployment complete: "v2.4.1 fully promoted to production"

The rollback notification includes the specific metric that triggered it, the value at the time, and a link to the Grafana dashboard for the canary analysis period. No digging through dashboards to figure out what happened -- the notification tells the whole story.

Step 5: Review Rollback History

After a few weeks of running, Tomas wants to spot patterns in what goes wrong:

Show me all canary rollbacks from the last 30 days with the metric that triggered each one and how long the canary ran before rollback.

The deployment logs produce a summary like this:

DateVersionRollback StageTrigger MetricTime to Rollback
Feb 3v2.3.8Stage 1 (5%)p99 latency 720ms3m 12s
Feb 10v2.4.0Stage 1 (5%)Error rate 2.3%1m 48s
Feb 14v2.4.2Stage 2 (25%)Error rate 1.4%6m 30s

The summary reveals which types of changes tend to fail, which health metrics catch problems first, and whether the hold times are calibrated correctly. If most rollbacks happen in the first 2 minutes, the 5-minute hold at Stage 1 is providing plenty of safety margin. If they happen at Stage 2 (like the Feb 14 deploy), it means some issues only surface under higher traffic -- and the 10-minute hold is earning its keep.

The rollback history also becomes a feedback loop for the development team. A pattern of latency regressions from a particular service component points to an architectural issue worth addressing, not just a series of individual bad deploys.

Real-World Example

On the first real canary deploy after setup, the new version shows elevated p99 latency at the 5% stage -- 720ms against a 500ms threshold. The system automatically rolls back in under 90 seconds. Instead of 40% of users hitting payment errors for 25 minutes (like the last incident), only 5% of users experienced slightly slow responses for less than 2 minutes.

Tomas reviews the rollback report, identifies a missing database index in the new code path, and the developer fixes it in an hour. The next deploy sails through all three stages cleanly.

Over the following month, two more bad deploys are caught at the 5% stage. The first is a memory leak that would have eventually crashed the pods -- caught because the canary pod's memory usage climbed during the hold period, triggering a resource alert. The second is an error rate spike from a misconfigured feature flag. Both are caught before they reach more than 5% of users.

User-facing incidents drop from biweekly to zero. The team deploys with confidence instead of dread -- and the on-call engineer stops waking up at 2 AM to manually roll back broken releases. Perhaps most importantly, the team starts deploying more frequently. Smaller, more frequent deploys mean less risk per deploy, which means fewer rollbacks, which means more confidence. The canary system breaks the vicious cycle of "big scary deploys" that was slowing the team down.