[TERMINAL · SKILLS]
> mounting /skills...
> indexing 295 manifests...
> linking agents: claude · codex · gemini · cursor
> ready.
[░░░░░░░░░░░░░░░░░░░░░░░░░░░░] 0%
Terminal.skills
Use Cases/Build a Self-Healing Kubernetes Operator

Build a Self-Healing Kubernetes Operator

Build a custom Kubernetes operator that detects anomalies, auto-scales based on queue depth, rolls back failed deployments, and restarts stuck pods — reducing on-call incidents by 60%.

#kubernetes#helm#containers#orchestration#redis
Works with:claude-codeopenai-codexgemini-clicursor

Skills stack · 6 skills

Avg quality 95/100·All SAFE
>

typescript

v

Not yet scored
View skill
>

kubernetes-helm

v1.0.0

Manages Kubernetes clusters and Helm charts. Use when the user wants to write Kubernetes manifests, create Helm charts, deploy applications, debug pods, configure networking (services, ingress), set up autoscaling, manage secrets and config maps, write operators, troubleshoot cluster issues, or implement GitOps workflows. Trigger words: kubernetes, k8s, kubectl, helm, helm chart, pod, deployment, service, ingress, namespace, configmap, secret, hpa, pvc, statefulset, daemonset, cronjob, operator, kustomize, argocd, flux, gitops, node pool, taint, toleration, affinity.

100/100 quality
1.19× impact
SAFE
View skill
>

docker

v

Not yet scored
View skill
>

redis

v1.0.0

Build applications with Redis — caching, session storage, pub/sub, streams, rate limiting, leaderboards, and queues. Use when tasks involve in-memory data storage, real-time messaging, distributed locking, or performance optimization with caching layers.

93/100 quality
1.81× impact
SAFE
View skill
>

postgresql

v1.0.0

Assists with designing schemas, writing performant queries, managing indexes, and operating PostgreSQL databases. Use when working with JSONB, full-text search, window functions, CTEs, row-level security, replication, or performance tuning. Trigger words: postgresql, postgres, sql, database, jsonb, rls, window functions, cte.

87/100 quality
1.53× impact
SAFE
View skill
>

zod

v1.0.0

You are an expert in Zod, the TypeScript-first schema declaration and validation library. You help developers define schemas that validate data at runtime AND infer TypeScript types at compile time — eliminating the need to write types and validators separately. Used for API input validation, form validation, environment variables, config files, and any data boundary.

100/100 quality
1.21× impact
SAFE
View skill
$

The Problem

A platform runs 80 microservices on Kubernetes. The on-call engineer gets paged 4-5 times per night. 60% of incidents follow the same pattern: pod crash loop → restart pod; memory leak → scale up; failed deploy → rollback. Each page takes 15-30 minutes of human intervention: wake up, VPN in, kubectl, assess, fix, verify. The SRE team is burning out — 2 engineers quit in 6 months citing on-call fatigue.

Step 1: Custom Resource Definition

yaml
# crd/self-healing-policy.yaml
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: healingpolicies.platform.example.com
spec:
  group: platform.example.com
  versions:
    - name: v1
      served: true
      storage: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              properties:
                targetDeployment:
                  type: string
                policies:
                  type: array
                  items:
                    type: object
                    properties:
                      name:
                        type: string
                      condition:
                        type: object
                        properties:
                          type:
                            type: string
                            enum: [crash_loop, high_memory, high_cpu, queue_depth, error_rate, deploy_failed]
                          threshold:
                            type: number
                          durationSeconds:
                            type: integer
                      action:
                        type: object
                        properties:
                          type:
                            type: string
                            enum: [restart, scale_up, scale_down, rollback, drain_node, alert_only]
                          params:
                            type: object
                            x-kubernetes-preserve-unknown-fields: true
                      cooldownSeconds:
                        type: integer
  scope: Namespaced
  names:
    plural: healingpolicies
    singular: healingpolicy
    kind: HealingPolicy

Step 2: Operator Controller

typescript
// src/operator/controller.ts
import * as k8s from '@kubernetes/client-node';
import { Redis } from 'ioredis';

const kc = new k8s.KubeConfig();
kc.loadFromCluster();
const k8sApi = kc.makeApiClient(k8s.AppsV1Api);
const coreApi = kc.makeApiClient(k8s.CoreV1Api);
const redis = new Redis(process.env.REDIS_URL!);

interface HealingPolicy {
  targetDeployment: string;
  policies: Array<{
    name: string;
    condition: { type: string; threshold: number; durationSeconds: number };
    action: { type: string; params?: Record<string, any> };
    cooldownSeconds: number;
  }>;
}

export async function reconcile(policy: HealingPolicy, namespace: string): Promise<void> {
  for (const rule of policy.policies) {
    // Check cooldown
    const cooldownKey = `healing:cooldown:${namespace}:${policy.targetDeployment}:${rule.name}`;
    const inCooldown = await redis.get(cooldownKey);
    if (inCooldown) continue;

    const triggered = await checkCondition(rule.condition, policy.targetDeployment, namespace);
    if (!triggered) continue;

    console.log(`[Healing] ${rule.name} triggered for ${namespace}/${policy.targetDeployment}`);

    // Execute remediation
    await executeAction(rule.action, policy.targetDeployment, namespace);

    // Set cooldown
    await redis.setex(cooldownKey, rule.cooldownSeconds, '1');

    // Record action for audit
    await recordAction(namespace, policy.targetDeployment, rule.name, rule.action.type);
  }
}

async function checkCondition(
  condition: { type: string; threshold: number; durationSeconds: number },
  deployment: string,
  namespace: string
): Promise<boolean> {
  switch (condition.type) {
    case 'crash_loop': {
      const pods = await coreApi.listNamespacedPod(
        namespace, undefined, undefined, undefined, undefined,
        `app=${deployment}`
      );
      return pods.body.items.some(pod =>
        pod.status?.containerStatuses?.some(cs =>
          cs.state?.waiting?.reason === 'CrashLoopBackOff' &&
          (cs.restartCount ?? 0) > condition.threshold
        )
      );
    }

    case 'high_memory': {
      // Query Prometheus via API
      const query = `container_memory_usage_bytes{pod=~"${deployment}.*",namespace="${namespace}"} / container_spec_memory_limit_bytes{pod=~"${deployment}.*"} * 100`;
      const result = await queryPrometheus(query);
      return result > condition.threshold;
    }

    case 'error_rate': {
      const query = `rate(http_requests_total{status=~"5..",deployment="${deployment}"}[5m]) / rate(http_requests_total{deployment="${deployment}"}[5m]) * 100`;
      const result = await queryPrometheus(query);
      return result > condition.threshold;
    }

    case 'deploy_failed': {
      const deploy = await k8sApi.readNamespacedDeployment(deployment, namespace);
      const conditions = deploy.body.status?.conditions ?? [];
      return conditions.some(c =>
        c.type === 'Progressing' && c.status === 'False' && c.reason === 'ProgressDeadlineExceeded'
      );
    }

    default:
      return false;
  }
}

async function executeAction(
  action: { type: string; params?: Record<string, any> },
  deployment: string,
  namespace: string
): Promise<void> {
  switch (action.type) {
    case 'restart': {
      // Rolling restart by patching annotation
      await k8sApi.patchNamespacedDeployment(deployment, namespace, {
        spec: { template: { metadata: { annotations: {
          'kubectl.kubernetes.io/restartedAt': new Date().toISOString(),
        }}}}
      }, undefined, undefined, undefined, undefined, undefined, {
        headers: { 'Content-Type': 'application/strategic-merge-patch+json' },
      });
      console.log(`[Healing] Restarted ${deployment}`);
      break;
    }

    case 'scale_up': {
      const maxReplicas = action.params?.maxReplicas ?? 10;
      const deploy = await k8sApi.readNamespacedDeployment(deployment, namespace);
      const current = deploy.body.spec?.replicas ?? 1;
      const target = Math.min(current + (action.params?.increment ?? 1), maxReplicas);

      await k8sApi.patchNamespacedDeployment(deployment, namespace, {
        spec: { replicas: target },
      }, undefined, undefined, undefined, undefined, undefined, {
        headers: { 'Content-Type': 'application/strategic-merge-patch+json' },
      });
      console.log(`[Healing] Scaled ${deployment} from ${current} to ${target}`);
      break;
    }

    case 'rollback': {
      // Get previous ReplicaSet
      const rsList = await k8sApi.listNamespacedReplicaSet(
        namespace, undefined, undefined, undefined, undefined, `app=${deployment}`
      );
      const sorted = rsList.body.items
        .filter(rs => (rs.status?.replicas ?? 0) >= 0)
        .sort((a, b) => {
          const aRev = parseInt(a.metadata?.annotations?.['deployment.kubernetes.io/revision'] ?? '0');
          const bRev = parseInt(b.metadata?.annotations?.['deployment.kubernetes.io/revision'] ?? '0');
          return bRev - aRev;
        });

      if (sorted.length >= 2) {
        const previousRS = sorted[1];
        const previousImage = previousRS.spec?.template?.spec?.containers?.[0]?.image;
        if (previousImage) {
          await k8sApi.patchNamespacedDeployment(deployment, namespace, {
            spec: { template: { spec: { containers: [{ name: deployment, image: previousImage }] }}},
          }, undefined, undefined, undefined, undefined, undefined, {
            headers: { 'Content-Type': 'application/strategic-merge-patch+json' },
          });
          console.log(`[Healing] Rolled back ${deployment} to ${previousImage}`);
        }
      }
      break;
    }
  }
}

async function queryPrometheus(query: string): Promise<number> {
  const res = await fetch(`${process.env.PROMETHEUS_URL}/api/v1/query?query=${encodeURIComponent(query)}`);
  const data = await res.json() as any;
  return parseFloat(data.data?.result?.[0]?.value?.[1] ?? '0');
}

async function recordAction(ns: string, deploy: string, rule: string, action: string): Promise<void> {
  console.log(`[Audit] ${new Date().toISOString()} | ${ns}/${deploy} | ${rule} | ${action}`);
}

Step 3: Example Policy

yaml
# policies/api-service.yaml
apiVersion: platform.example.com/v1
kind: HealingPolicy
metadata:
  name: api-service-healing
  namespace: production
spec:
  targetDeployment: api-service
  policies:
    - name: crash-loop-restart
      condition:
        type: crash_loop
        threshold: 5          # more than 5 restarts
        durationSeconds: 300
      action:
        type: restart
      cooldownSeconds: 600    # don't restart again for 10 min

    - name: memory-pressure-scale
      condition:
        type: high_memory
        threshold: 85          # 85% memory usage
        durationSeconds: 120
      action:
        type: scale_up
        params:
          increment: 2
          maxReplicas: 12
      cooldownSeconds: 300

    - name: failed-deploy-rollback
      condition:
        type: deploy_failed
        threshold: 1
        durationSeconds: 300
      action:
        type: rollback
      cooldownSeconds: 1800   # 30 min cooldown

    - name: high-error-rate
      condition:
        type: error_rate
        threshold: 10          # 10% 5xx rate
        durationSeconds: 60
      action:
        type: alert_only
      cooldownSeconds: 300

Results

  • On-call pages: reduced from 4-5/night to 1-2/night (60% reduction)
  • Auto-remediated incidents: 15/month (previously required human intervention)
  • Mean time to recovery: 2 minutes (was 15-30 minutes with human)
  • Failed deploy rollbacks: automatic — zero customer-facing impact
  • Crash loop restarts: handled before PagerDuty even fires
  • SRE retention: zero resignations in 6 months after deployment
  • Audit trail: every automated action logged for compliance