The Problem
Kai is a platform engineer at a 40-person startup running 22 microservices on Kubernetes. Service-to-service communication is a mess. Some services use HTTP, others gRPC. Timeout values are hardcoded differently in every service -- the order service waits 30 seconds for a payment response, which is 25 seconds too long. There's no mutual TLS, so any pod in the cluster can call any service. And when one service slows down, it cascades into a full outage because nothing trips the circuit.
Last month, the payment service had a latency spike. The order service kept retrying with its 30-second timeout, exhausting its connection pool. The API gateway backed up waiting on orders. Within 4 minutes, every service in the cluster was returning 503s. A single slow service took down the entire platform for 12 minutes.
The team decided to adopt Istio, but the configuration surface is enormous: VirtualServices, DestinationRules, PeerAuthentications, AuthorizationPolicies -- each service needs multiple YAML files, and getting one wrong means routing failures or broken auth. After two weeks of manual configuration, the team has three services onboarded and 19 to go. At this rate, the full rollout will take three months. Another cascading failure will happen before then.
The Solution
Use the coding-agent skill to analyze the existing service topology, generate Istio configuration for all 22 services, and wire up traffic policies. Use the docker-helper skill to verify sidecar injection and container configurations. Apply the security-audit skill to validate mTLS coverage and authorization policies across the mesh before anything goes live.
Step-by-Step Walkthrough
Step 1: Map the Current Service Topology
Before generating any Istio config, the full picture of what calls what needs to be clear. Many teams don't actually know their complete service dependency graph -- it exists as tribal knowledge spread across 22 different repos.
Read all Kubernetes manifests in /k8s/deployments/. For each service, extract: service name, port, protocol (HTTP/gRPC), dependencies (which other services it calls), and current timeout/retry settings if any. Output a service dependency graph.
The topology analysis produces the dependency graph:
api-gateway -> [user-service, order-service, product-service, notification-service]
order-service -> [payment-service, inventory-service, user-service]
payment-service -> [stripe-webhook, ledger-service]
notification-service -> [email-worker, sms-worker]
... (18 more services)
Protocol breakdown: 16 services communicate over HTTP, 6 use gRPC (user-service, product-service, search-service, and three internal data services).
The timeout analysis reveals the real problem:
| Category | Count | Issue |
|---|---|---|
| No timeout configured | 8 services | Will wait indefinitely for a response -- the root cause of cascading failures |
| Hardcoded 30s timeout | 5 services | Far too long for synchronous calls; allows connection pool exhaustion |
| Reasonable timeouts (2-10s) | 9 services | These are fine |
Eight services with no timeout at all. If any dependency hangs, those 8 services hang with it, and their callers hang waiting for them, and so on until the whole cluster is down. This is exactly what happened in last month's incident.
Step 2: Generate Istio Configurations for All 22 Services
Each service needs three Istio resources: a VirtualService for routing and timeouts, a DestinationRule for circuit breaking, and a PeerAuthentication for mTLS.
For each of the 22 services, generate: (1) a VirtualService with appropriate timeouts and retries based on the service type, (2) a DestinationRule with circuit breaker settings, (3) enable mTLS via PeerAuthentication in STRICT mode. Use 3s timeout for synchronous calls, 10s for batch operations, and 2 retries with exponential backoff.
Sixty-six YAML files generated in /k8s/istio/, organized by service:
# k8s/istio/user-service/virtual-service.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: user-service
spec:
hosts:
- user-service
http:
- route:
- destination:
host: user-service
timeout: 3s
retries:
attempts: 2
perTryTimeout: 2s
retryOn: 5xx,reset,connect-failure
Timeout values are calibrated to each service's role:
- Synchronous API calls (user-service, product-service): 3-second timeout. If a user lookup takes more than 3 seconds, something is wrong -- fail fast and let the caller handle it.
- Multi-downstream callers (order-service): 5-second timeout, because it calls multiple services sequentially and needs room for the chain.
- External integration (payment-service): 10-second timeout, because Stripe's API can legitimately take a few seconds under load. Only 1 retry -- payment operations shouldn't be retried aggressively.
- Background workers (email-worker, sms-worker): 10-second timeout with lenient circuit breaker settings, because a slow email delivery isn't user-facing.
Circuit breaker settings on every DestinationRule: eject a service instance after 5 consecutive 5xx errors, keep it ejected for 30 seconds, then let it back in. This is what prevents cascading failures -- when the payment service slows down, Istio ejects it instead of letting the order service pile up retries.
Step 3: Lock Down Service-to-Service Authorization
mTLS encrypts traffic, but it doesn't restrict who can call whom. AuthorizationPolicies enforce the dependency graph: a service only accepts traffic from its known callers.
Based on the service dependency graph, generate AuthorizationPolicies that restrict each service to only accept traffic from its known callers. For example, payment-service should only accept requests from order-service and stripe-webhook. Deny all other traffic.
Twenty-two authorization policies, one per service:
# k8s/istio/payment-service/authorization-policy.yaml
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: payment-service
spec:
selector:
matchLabels:
app: payment-service
rules:
- from:
- source:
principals:
- "cluster.local/ns/default/sa/order-service"
- "cluster.local/ns/default/sa/stripe-webhook"
The payment-service only accepts traffic from order-service and stripe-webhook. Deny all others. If a compromised pod in the cluster tries to call the payment service directly, the request gets rejected at the mesh level before it ever reaches application code.
This is defense in depth -- mTLS ensures the connection is encrypted and authenticated, and the authorization policy ensures only the right services are making that connection. A lateral movement attack (compromising one service to reach others) gets significantly harder.
Step 4: Validate the Configuration
Sixty-six YAML files and 22 authorization policies is a lot of surface area for mistakes. Validation catches problems before they reach the cluster.
Dry-run all generated Istio configs against our cluster. Check for: YAML syntax errors, port mismatches, missing sidecar injection labels, and circular dependency issues. Also verify that mTLS is STRICT on all services with no PERMISSIVE exceptions.
The validation results:
| Check | Result |
|---|---|
| YAML syntax across 66 files | All pass |
| Namespace injection labels | All 22 namespaces have istio-injection=enabled |
| Sidecar annotations | 1 warning: email-worker missing explicit annotation (auto-injection will handle it, but explicit is safer) |
| mTLS mode | 1 exception needed: stripe-webhook requires PERMISSIVE because Stripe sends plain HTTP webhooks with no client certificate |
| Circular dependencies | None detected |
| Authorization policy references | All service accounts valid |
Two adjustments: add the explicit sidecar annotation to email-worker (a one-line change), and set stripe-webhook's PeerAuthentication to PERMISSIVE mode. The Stripe exception is unavoidable -- external webhook sources can't present mTLS certificates -- but it's scoped to a single service, not mesh-wide.
Step 5: Plan a Phased Rollout
Applying 66 YAML files to a production cluster at once is how outages happen. A phased rollout lets the team validate each layer before moving to the next.
Generate a phased rollout plan. Start with non-critical services, then internal services, then the api-gateway last. Include kubectl commands and rollback steps for each phase.
Four phases over one week:
Phase 1 (Monday) -- Background workers: email-worker, sms-worker, and 3 internal data processors. These are non-customer-facing, so any misconfiguration has minimal blast radius. Smoke test: send a test email, verify delivery. Rollback: kubectl delete -f k8s/istio/email-worker/.
Phase 2 (Wednesday) -- Internal services: ledger-service, inventory-service, search-service, and 4 more. These support the core services but aren't directly customer-facing. Smoke test: verify inventory queries return correct data, search results are accurate. Rollback: delete the Istio resources for the affected service.
Phase 3 (Thursday) -- Core services: user-service, order-service, payment-service, product-service, notification-service. This is where it matters most. Deploy during low-traffic hours. Smoke test: complete a full purchase flow end-to-end. Rollback: single-command deletion per service.
Phase 4 (Friday) -- API gateway: the front door. Apply Istio config, verify all routes respond, test with synthetic traffic. This goes last because every other service needs to be meshed first -- the gateway depends on all of them.
Each phase includes a 4-hour monitoring window before proceeding to the next. If any service shows elevated error rates or latency during its window, the phase gets rolled back and investigated before continuing.
Real-World Example
Kai starts the rollout on Monday with the background workers. Email delivery works, SMS goes through, no issues. Wednesday, internal services come online with circuit breakers. The team immediately sees the benefit: a test where they artificially slow down the inventory service shows the circuit breaker ejecting it after 5 errors and the order service gracefully degrading instead of hanging.
Thursday is the real test. Core services get meshed during the 6 AM low-traffic window. A full purchase flow completes end-to-end with mTLS active. The authorization policies are working -- a quick test confirms that an unauthorized service account gets denied when trying to reach the payment service.
The validation step caught something critical before deployment: the stripe-webhook service was configured with STRICT mTLS, which would have blocked all incoming Stripe webhooks. That would have silently stopped payment confirmations from reaching the system. Catching it in validation instead of production saved what could have been hours of revenue loss.
Friday, the API gateway goes live. By end of day, all 22 services are meshed with mTLS, circuit breakers, proper timeouts, and authorization policies. What was projected to take three months of manual configuration finished in one week. More importantly, the cascading failure scenario from last month is now impossible -- the circuit breakers eject slow services in seconds instead of letting them drag down the entire cluster.