Terminal.skills
Use Cases/Automate Infrastructure Drift Detection with AI

Automate Infrastructure Drift Detection with AI

Detect and resolve infrastructure drift between your IaC definitions and actual cloud state before it causes outages.

Development#coding-agent#automation#sub-agent#background-process#orchestration
Works with:claude-codeopenai-codexgemini-clicursor
$

The Problem

Alex is a DevOps engineer at a 25-person SaaS team. They manage 120 cloud resources with Terraform. Over the past year, engineers have made 40+ manual changes through the AWS console during incidents — a security group rule here, a scaling parameter there, an RDS instance resize at 2 AM when the database was melting. Every change was justified in the moment and forgotten the next day.

Now terraform plan shows 23 unexpected changes, and nobody has run terraform apply in three months because the last time someone tried, the plan included reverting the ASG min_size from 4 back to 2 — which would have halved the web tier's capacity during the busiest month of the year.

The problem is not the drift itself. It is that nobody knows which manual changes were intentional fixes that need to be kept versus mistakes that should be reverted. Applying the plan blindly might undo the scaling fix that is keeping the site alive. Ignoring the plan means every future Terraform change carries 23 unknowns. And somewhere in those 23 changes, there might be an open port or an overly permissive IAM policy that nobody remembers creating. The team is stuck, afraid to touch infrastructure in either direction.

The Solution

Using the coding-agent skill to analyze drift and reconcile differences, the cicd-pipeline skill to set up automated detection on a schedule, and the report-generator to produce clear drift reports, the agent categorizes every drifted resource by risk, updates Terraform code to match intentional changes, flags unauthorized modifications for review, and sets up ongoing monitoring so drift never accumulates this badly again.

Step-by-Step Walkthrough

Step 1: Capture the Current Drift

First, get the full picture of what has drifted and why it might have changed:

Run terraform plan on our ./infrastructure directory and parse the output. For each drifted resource, classify the change as: security-related, performance-related, configuration-related, or unknown. Show me what changed and when it might have been changed manually.

Step 2: Categorize the Drift

The 23 drifted resources fall into four buckets. The categorization immediately reveals what needs urgent attention versus what can wait:

Security (fix in IaC immediately) — 3 resources:

  • aws_security_group.api — port 9090 opened, not in Terraform. This is a debug endpoint that was likely enabled during an incident and never closed.
  • aws_iam_role.lambda_exec — extra policy attached: AmazonS3FullAccess. Someone could not figure out the minimum permissions and took a shortcut.
  • aws_s3_bucket.uploads — public access block disabled. This is the scariest one — the uploads bucket may have been publicly accessible for weeks or months.

Performance (likely intentional hotfixes) — 3 resources:

  • aws_autoscaling_group.webmin_size changed from 2 to 4 (almost certainly the January traffic spike incident)
  • aws_rds_instance.primaryinstance_class changed from db.r5.large to db.r5.2xlarge (the slow query cascade from two weeks later)
  • aws_elasticache.redisnode_type changed from cache.t3.medium to cache.r6g.large (capacity planning decision, tracked in OPS-445)

Configuration — 14 resources with tag changes, description updates, and minor parameter adjustments that have no operational impact.

Unknown — 3 resources with changes that do not correlate with any incident ticket or Slack conversation.

The security bucket is the most concerning. An open port, an overly permissive IAM policy, and a public S3 bucket all look like rushed incident response where someone took shortcuts and never cleaned up. These are exactly the kind of silent risks that compound when drift goes unmanaged.

Step 3: Reconcile Drift into Code

The performance changes are almost certainly intentional — someone scaled up the database during a slow query cascade and the ASG during a traffic spike. These need to be codified in Terraform so the next terraform apply does not accidentally revert them.

For the 6 security and performance changes: update the Terraform code to match the current cloud state where the change was intentional. For the 3 security issues (port 9090, S3 public access, extra IAM policy), flag them as likely unauthorized and create a plan to revert.

Three Terraform files get updated to codify the intentional performance changes, each with a comment linking to the incident or ticket that motivated the change:

hcl
# infrastructure/autoscaling.tf
resource "aws_autoscaling_group" "web" {
  min_size = 4  # Updated from 2 — INC-2024-089 (Jan 15 traffic spike)
  # ...
}
hcl
# infrastructure/rds.tf
resource "aws_rds_instance" "primary" {
  instance_class = "db.r5.2xlarge"  # Updated from db.r5.large — INC-2024-091 (slow query cascade)
  # ...
}
hcl
# infrastructure/elasticache.tf
resource "aws_elasticache_cluster" "redis" {
  node_type = "cache.r6g.large"  # Updated from cache.t3.medium — OPS-445 capacity planning
  # ...
}

For the three security issues, a separate targeted terraform apply plan reverts the risky changes: remove port 9090 from the API security group, detach AmazonS3FullAccess from the Lambda role (replaced with a minimal policy that grants access to only the specific bucket the function needs), and re-enable the public access block on the uploads bucket.

Step 4: Generate Compliance Documentation

SOC 2 auditors want a paper trail for every infrastructure change, including changes made during incidents and later reconciled. Each reconciled change gets a formal record:

For each reconciled change, generate a change record documenting: what changed, why (incident reference if applicable), who approved the reconciliation, and the date codified. We need this for our SOC 2 audit trail.
RecordResourceChangeReferenceAction
CR-001ASG min_size 2 to 4Scaling increaseINC-2024-089 (Jan 15)Codified
CR-002RDS instance upgradePerformanceINC-2024-091Codified
CR-003ElastiCache upgradeCapacity planningOPS-445Codified
CR-004Security group port 9090UnauthorizedNo referenceReverted
CR-005IAM policy over-permissionUnauthorizedNo referenceReverted
CR-006S3 public accessUnauthorizedNo referenceReverted, escalated to security

The records are exported to infrastructure/drift-records/2025-02-17.md. For the SOC 2 audit, each record includes the before/after state, the date of the original change (estimated from CloudTrail), and the date of reconciliation.

Step 5: Set Up Automated Drift Detection

The one-time cleanup is valuable, but the real win is preventing drift from accumulating for three months again. Without ongoing detection, the same problem reappears after the next incident.

Create a GitHub Actions workflow that runs terraform plan every 6 hours, compares against last known state, and posts a Slack notification if any new drift is detected. Include the drift classification in the alert.

The workflow runs terraform plan -detailed-exitcode four times a day. Exit code 0 means no changes — the infrastructure matches the code. Exit code 2 means drift detected. When drift appears, the alert includes the automatic classification:

  • Security changes get an urgent Slack tag with @channel — open ports and IAM changes cannot wait
  • Performance changes get a @here review tag — likely an incident response that needs codifying
  • Config changes get an informational tag that does not ping anyone — these can be reviewed in the next business day

The alert includes the specific resource name and the nature of the change, so the on-call engineer can assess urgency without running terraform plan themselves.

The next time someone makes a console change during an incident, Alex gets a Slack alert within 6 hours instead of discovering the drift three months later. He can codify it the same day while the context is still fresh — while the engineer who made the change still remembers why they changed it and can verify the codification is correct.

Real-World Example

Alex applies the reconciled Terraform on a Wednesday afternoon. The plan shows exactly the changes he expects: 3 performance codifications, 3 security reverts, and 14 minor config alignments. For the first time in three months, terraform plan shows zero drift.

The security reverts trigger a brief but important conversation with the team. The open port on 9090 turns out to be a debug endpoint someone enabled during an incident six weeks ago — it was publicly accessible the entire time. The AmazonS3FullAccess policy was attached because a developer could not figure out the minimum IAM permissions for a Lambda function and took the expedient shortcut at 1 AM during an incident. The S3 public access block was disabled to debug a file upload issue and never re-enabled. All three are exactly the kind of silent risks that accumulate when drift goes unmanaged.

Two weeks later, the automated drift check catches its first new change: a developer resized an ECS task definition during a memory pressure incident. Alex gets the Slack alert at 9 AM the next morning, opens a PR to codify the change with a link to the incident ticket, and has it merged by lunch. The drift backlog never grows past a single change again.