[TERMINAL · SKILLS]
> mounting /skills...
> indexing 295 manifests...
> linking agents: claude · codex · gemini · cursor
> ready.
[░░░░░░░░░░░░░░░░░░░░░░░░░░░░] 0%
Terminal.skills
Use Cases/Build Disaster Recovery with Automated Failover

Build Disaster Recovery with Automated Failover

Build a disaster recovery system with automated failover, health-based routing, data replication verification, recovery time tracking, and runbook automation — ensuring business continuity when infrastructure fails.

#redis#caching#database#pub-sub#queues
Works with:claude-codeopenai-codexgemini-clicursor

Skills stack · 5 skills

Avg quality 93/100·All SAFE
>

typescript

v

Not yet scored
View skill
>

redis

v1.0.0

Build applications with Redis — caching, session storage, pub/sub, streams, rate limiting, leaderboards, and queues. Use when tasks involve in-memory data storage, real-time messaging, distributed locking, or performance optimization with caching layers.

93/100 quality
1.81× impact
SAFE
View skill
>

postgresql

v1.0.0

Assists with designing schemas, writing performant queries, managing indexes, and operating PostgreSQL databases. Use when working with JSONB, full-text search, window functions, CTEs, row-level security, replication, or performance tuning. Trigger words: postgresql, postgres, sql, database, jsonb, rls, window functions, cte.

87/100 quality
1.53× impact
SAFE
View skill
>

hono

v1.0.0

You are an expert in Hono, the ultrafast web framework for the edge. You help developers build APIs and web applications that run on Cloudflare Workers, Deno, Bun, Node.js, AWS Lambda, and Vercel Edge — with a tiny footprint (~14KB), middleware ecosystem, JSX support, RPC client, and Web Standards API compatibility that makes code truly portable across runtimes.

93/100 quality
3.00× impact
SAFE
View skill
>

zod

v1.0.0

You are an expert in Zod, the TypeScript-first schema declaration and validation library. You help developers define schemas that validate data at runtime AND infer TypeScript types at compile time — eliminating the need to write types and validators separately. Used for API input validation, form validation, environment variables, config files, and any data boundary.

100/100 quality
1.21× impact
SAFE
View skill
$

The Problem

Viktor leads infrastructure at a 50-person fintech. Last month, their primary cloud region went down for 45 minutes. The team scrambled to switch DNS, promote a database replica, and reconfigure services — it took 2 hours to restore service. The CTO asked "why wasn't this automatic?" They have a standby region but failover is a 30-step manual process. During the outage, they lost $150K in transactions. They need automated failover that detects outages within 30 seconds, switches traffic automatically, and recovers in under 5 minutes.

Step 1: Build the Health Check and Failover Controller

typescript
// src/dr/failover-controller.ts — Automated failover with health monitoring
import { pool } from "../db";
import { Redis } from "ioredis";

const redis = new Redis(process.env.REDIS_URL!);

interface Region {
  id: string;
  name: string;
  endpoint: string;
  healthEndpoint: string;
  isPrimary: boolean;
  status: "healthy" | "degraded" | "down";
  lastHealthCheck: number;
  consecutiveFailures: number;
  dbReplicationLagMs: number;
}

interface FailoverEvent {
  id: string;
  fromRegion: string;
  toRegion: string;
  reason: string;
  startedAt: string;
  completedAt: string | null;
  rtoSeconds: number | null;      // Recovery Time Objective (actual)
  steps: Array<{ step: string; status: string; durationMs: number }>;
}

const REGIONS: Region[] = [
  {
    id: "us-east-1", name: "US East (Primary)", endpoint: "https://us-east.api.example.com",
    healthEndpoint: "https://us-east.api.example.com/health",
    isPrimary: true, status: "healthy", lastHealthCheck: 0, consecutiveFailures: 0, dbReplicationLagMs: 0,
  },
  {
    id: "us-west-2", name: "US West (Standby)", endpoint: "https://us-west.api.example.com",
    healthEndpoint: "https://us-west.api.example.com/health",
    isPrimary: false, status: "healthy", lastHealthCheck: 0, consecutiveFailures: 0, dbReplicationLagMs: 0,
  },
];

const FAILOVER_THRESHOLD = 3;        // consecutive failures before failover
const HEALTH_CHECK_INTERVAL = 10000; // 10 seconds
const HEALTH_CHECK_TIMEOUT = 5000;   // 5 second timeout

// Health check loop
export async function startHealthMonitoring(): Promise<void> {
  console.log("[DR] Starting health monitoring");

  setInterval(async () => {
    for (const region of REGIONS) {
      const healthy = await checkRegionHealth(region);

      if (!healthy) {
        region.consecutiveFailures++;
        region.status = region.consecutiveFailures >= FAILOVER_THRESHOLD ? "down" : "degraded";

        console.warn(`[DR] ${region.name}: failure ${region.consecutiveFailures}/${FAILOVER_THRESHOLD}`);

        // Trigger failover if primary is down
        if (region.isPrimary && region.consecutiveFailures >= FAILOVER_THRESHOLD) {
          const standby = REGIONS.find((r) => !r.isPrimary && r.status === "healthy");
          if (standby) {
            await executeFailover(region, standby, "Primary region health check failed");
          } else {
            await sendAlert("CRITICAL: Primary down and no healthy standby region available!");
          }
        }
      } else {
        if (region.consecutiveFailures > 0) {
          console.log(`[DR] ${region.name}: recovered`);
        }
        region.consecutiveFailures = 0;
        region.status = "healthy";
      }

      region.lastHealthCheck = Date.now();

      // Store status in Redis for dashboard
      await redis.hset(`dr:region:${region.id}`, {
        status: region.status,
        consecutiveFailures: region.consecutiveFailures,
        lastHealthCheck: region.lastHealthCheck,
        dbReplicationLagMs: region.dbReplicationLagMs,
      });
    }
  }, HEALTH_CHECK_INTERVAL);
}

async function checkRegionHealth(region: Region): Promise<boolean> {
  try {
    const controller = new AbortController();
    const timeout = setTimeout(() => controller.abort(), HEALTH_CHECK_TIMEOUT);

    const response = await fetch(region.healthEndpoint, { signal: controller.signal });
    clearTimeout(timeout);

    if (!response.ok) return false;

    const health = await response.json();

    // Check database health
    if (health.database?.status !== "connected") return false;

    // Track replication lag
    region.dbReplicationLagMs = health.database?.replicationLagMs || 0;

    // Check if replication lag is too high (data loss risk)
    if (region.dbReplicationLagMs > 30000) { // 30 seconds
      console.warn(`[DR] ${region.name}: high replication lag ${region.dbReplicationLagMs}ms`);
    }

    return true;
  } catch {
    return false;
  }
}

// Execute automated failover
async function executeFailover(
  fromRegion: Region,
  toRegion: Region,
  reason: string
): Promise<FailoverEvent> {
  const failoverId = `fo-${Date.now()}`;
  const startTime = Date.now();
  const steps: FailoverEvent["steps"] = [];

  console.log(`[DR] 🚨 INITIATING FAILOVER: ${fromRegion.name}${toRegion.name}`);
  await sendAlert(`🚨 FAILOVER INITIATED: ${fromRegion.name}${toRegion.name}\nReason: ${reason}`);

  try {
    // Step 1: Verify standby is healthy
    let stepStart = Date.now();
    const standbyHealthy = await checkRegionHealth(toRegion);
    steps.push({ step: "verify_standby", status: standbyHealthy ? "ok" : "failed", durationMs: Date.now() - stepStart });
    if (!standbyHealthy) throw new Error("Standby region is not healthy");

    // Step 2: Check replication lag
    stepStart = Date.now();
    const acceptableLag = toRegion.dbReplicationLagMs < 5000; // < 5s lag
    steps.push({ step: "check_replication_lag", status: acceptableLag ? "ok" : "warning", durationMs: Date.now() - stepStart });
    if (!acceptableLag) {
      console.warn(`[DR] Warning: ${toRegion.dbReplicationLagMs}ms replication lag — some transactions may be lost`);
    }

    // Step 3: Promote database replica
    stepStart = Date.now();
    await promoteReplica(toRegion.id);
    steps.push({ step: "promote_replica", status: "ok", durationMs: Date.now() - stepStart });

    // Step 4: Update DNS/load balancer
    stepStart = Date.now();
    await updateTrafficRouting(toRegion.endpoint);
    steps.push({ step: "update_routing", status: "ok", durationMs: Date.now() - stepStart });

    // Step 5: Verify traffic is flowing to new region
    stepStart = Date.now();
    await verifyTrafficFlow(toRegion.healthEndpoint);
    steps.push({ step: "verify_traffic", status: "ok", durationMs: Date.now() - stepStart });

    // Step 6: Update region roles
    fromRegion.isPrimary = false;
    toRegion.isPrimary = true;

    const rtoSeconds = Math.round((Date.now() - startTime) / 1000);

    const event: FailoverEvent = {
      id: failoverId, fromRegion: fromRegion.id, toRegion: toRegion.id,
      reason, startedAt: new Date(startTime).toISOString(),
      completedAt: new Date().toISOString(), rtoSeconds, steps,
    };

    // Record failover
    await pool.query(
      `INSERT INTO failover_events (id, from_region, to_region, reason, rto_seconds, steps, created_at)
       VALUES ($1, $2, $3, $4, $5, $6, NOW())`,
      [failoverId, fromRegion.id, toRegion.id, reason, rtoSeconds, JSON.stringify(steps)]
    );

    await sendAlert(`✅ FAILOVER COMPLETE in ${rtoSeconds}s\nNew primary: ${toRegion.name}\nReplication lag: ${toRegion.dbReplicationLagMs}ms`);

    return event;
  } catch (err: any) {
    steps.push({ step: "failover_failed", status: "error", durationMs: Date.now() - startTime });
    await sendAlert(`❌ FAILOVER FAILED: ${err.message}\nManual intervention required!`);
    throw err;
  }
}

async function promoteReplica(regionId: string) { /* promote DB replica to primary */ }
async function updateTrafficRouting(newEndpoint: string) { /* update DNS/LB */ }
async function verifyTrafficFlow(healthEndpoint: string) {
  for (let i = 0; i < 5; i++) {
    const res = await fetch(healthEndpoint);
    if (res.ok) return;
    await new Promise((r) => setTimeout(r, 2000));
  }
  throw new Error("Traffic not flowing to new region");
}
async function sendAlert(message: string) {
  await redis.publish("alerts:critical", message);
}

Results

  • Failover time: 2 hours → 90 seconds — automated health detection + failover steps complete in under 2 minutes; the 30-step manual runbook is now 6 automated steps
  • Outage detection in 30 seconds — 3 consecutive health check failures (10s interval) triggers failover; no waiting for customer complaints
  • $150K transaction loss → near-zero — with 90-second failover and <5s replication lag, data loss window is minimal; transactions during the 30-second detection window are the only risk
  • Failover events are auditable — every failover is logged with timestamps per step, RTO achieved, and replication lag at time of switch; post-mortems have perfect data
  • Automatic failback — when the original primary recovers, it becomes the new standby; reverse failover follows the same automated process