Build an Automated Data Anonymization Pipeline with AI — AI Workflow

Use AI to detect PII in databases and files, apply anonymization rules, and validate that no sensitive data leaks into lower environments.

The Problem

A fintech team copies production data into staging every week so developers can test with realistic datasets. The database has 2 million customer records with names, emails, phone numbers, national IDs, and bank account numbers scattered across 45 tables. Someone manually wrote anonymization scripts two years ago, but new columns get added regularly and the scripts silently skip them. The original script author left the company eight months ago.

Last month a developer noticed real customer emails in staging error logs — a Sentry alert showed a customer's full name, email, and the last four digits of their bank account. The compliance team is furious, and the next SOC 2 audit is in six weeks. Nobody knows which tables contain PII anymore. The documentation says "see the anonymization scripts" but those scripts only cover the columns that existed two years ago. Every schema migration since then added columns that may or may not contain personal data, and nobody updated the scripts.

The gap between "we think staging is anonymized" and "we can prove it to an auditor" is getting wider with every deploy.

The Solution

Using the coding-agent, data-analysis, and sql-optimizer skills, the workflow scans every table and column for PII patterns by sampling actual data (not just reading column names), classifies each finding by sensitivity level, generates optimized SQL scripts that apply masking and synthetic data without breaking foreign key relationships, validates that zero real PII remains, and produces an audit-ready compliance report. The whole pipeline then runs automatically after every production-to-staging data copy.

Step-by-Step Walkthrough

Step 1: Scan the Database for PII

The first step is getting visibility into what's actually in the database, not what the documentation says is there:

Connect to our staging database (PostgreSQL, connection details in .env). Scan all 45 tables — for each column, sample 100 rows and classify whether it contains PII. Check for: email addresses, phone numbers, full names, national ID patterns, bank account numbers, IP addresses, and free-text fields that might contain personal info. Output a report grouped by sensitivity level.

The scan inspects column names, data types, and actual content — because a column named metadata can contain email addresses in its JSON payload, and a column named reference_code might store data that looks like a national ID:

HIGH sensitivity (must anonymize):

Table.Column	Pattern Match	Confidence
users.email	Email pattern	100%
users.phone	Phone pattern	98%
users.national_id	National ID pattern	100%
payments.bank_account	IBAN pattern	100%
support_tickets.body	Free text with names/emails	34% of samples

MEDIUM sensitivity (review needed):

orders.shipping_address — contains street addresses
users.display_name — may contain real names (or may contain usernames)
audit_log.actor_name — contains full names in every row

LOW sensitivity: 280 columns with no PII detected.

The free-text field is the sneaky one. support_tickets.body doesn't look like PII by its column name or data type, but sampling the actual data reveals that a third of the entries contain customer names and email addresses embedded in the text — things like "Customer jane.doe@gmail.com reported that..." The old manual scripts missed it entirely because they worked from a list of column names, not from actual data inspection.

Step 2: Generate Anonymization Rules

Each column type needs a different strategy. Blanking everything would break the application — foreign keys would fail, email validation would reject empty strings, and developers couldn't test realistic workflows:

For each HIGH and MEDIUM sensitivity column, propose an anonymization strategy. Use deterministic hashing for foreign key columns so relationships are preserved. Use synthetic data for names and addresses. Mask emails to keep the domain but replace the local part. Redact free-text PII with pattern replacement.

Table.Column	Strategy	Before	After
users.email	Hash local part	jane.doe@gmail.com	j3k9x@gmail.com
users.phone	Format-preserving fake	+1-415-555-0198	+1-555-0142
users.national_id	SHA256 hash	123-45-6789	a8f3e...
payments.bank_account	Format-preserving mask	DE89370400440532013000	DE89****4321
support_tickets.body	Regex redact names/emails	Jane filed a complaint...	[REDACTED] filed a complaint...
orders.shipping_address	Synthetic address	123 Main St, SF	742 Evergreen Terrace, Springfield
users.display_name	Faker name	Jane Doe	Alex Johnson

The deterministic hashing on users.email is critical. If the same email appears in users.email, audit_log.actor_email, and notifications.recipient, the hash must produce the same output everywhere. Otherwise, join queries return empty results, notification logs can't be traced to users, and developers can't debug realistic workflows. Deterministic means the same input always produces the same output — but the output is irreversible, so you can't recover the original email from the hash.

Step 3: Build Optimized SQL Scripts

Two million rows in users and 8 million in orders means the anonymization scripts need to be fast and safe. A naive UPDATE users SET email = ... on 2 million rows would lock the table for minutes:

Generate SQL scripts that apply these anonymization rules. Optimize for batch processing — we have 2 million rows in users and 8 million in orders. Use UPDATE ... FROM with batch commits of 10,000 rows to avoid lock contention. Preserve referential integrity by hashing users.email consistently everywhere it appears.

sql

-- Anonymize users.email (deterministic so FK references match)
DO $$
DECLARE batch_size INT := 10000; max_id INT;
BEGIN
  SELECT MAX(id) INTO max_id FROM users;
  FOR i IN 0..max_id BY batch_size LOOP
    UPDATE users SET email = encode(digest(email, 'sha256'), 'hex')
      || '@' || split_part(email, '@', 2)
    WHERE id BETWEEN i AND i + batch_size - 1;
    COMMIT;
  END LOOP;
END $$;

The batched approach processes 10,000 rows per commit, avoiding the lock contention that would grind the database to a halt with a single 2-million-row UPDATE. The domain part of the email is preserved — j3k9x@gmail.com still looks like a valid email for application code that validates format, but the personal identifier is gone. The full 2 million user records anonymize in about 4 minutes.

Similar batched scripts are generated for every HIGH and MEDIUM sensitivity column, each with its own strategy and batch size tuned to the table size. The scripts run in dependency order: users first (since other tables reference it via foreign keys), then orders, then payments, then log tables. Running them out of order would break the deterministic hashing — if audit_log.actor_email gets hashed before users.email, the same email produces a different hash because the salt context differs.

The full pipeline — 22 columns across 12 tables, 10+ million total rows — completes in about 12 minutes. Fast enough to run after every weekly production-to-staging copy without blocking the team's morning.

Step 4: Validate — Trust But Verify

Running the scripts is only half the job. Proving they worked is what matters for the audit:

After running the anonymization scripts, re-scan all HIGH and MEDIUM columns. Verify that no real PII remains. Also check logs tables and any materialized views. Produce a compliance report showing before/after sample data.

Post-anonymization validation:

Column	Real PII Remaining	Status
users.email	0 (was 2,000,000)	Clean
users.phone	0	Clean
users.national_id	0	Clean
payments.bank_account	0	Clean
support_tickets.body	0	Clean
email_logs.recipient	45,000 real emails	MISSED

The validation scan catches what the initial PII scan didn't cover — email_logs.recipient wasn't in the original 45-table scan because it lives in a separate schema that the anonymization scripts never knew about. This is exactly why the validation step exists as a separate pass: every anonymization pipeline misses something on the first run. The validation scan checks the actual data without relying on the list of known PII columns.

Step 5: Fix Gaps and Automate

Add email_logs.recipient to the anonymization rules. Then create a pre-copy hook script that runs the full anonymization pipeline automatically after every production-to-staging data copy. Include the PII scan as a validation gate that blocks access until it passes.

The pre-copy hook runs the full pipeline in sequence: copy production data to staging, run all anonymization scripts, run the PII validation scan across every table and schema, and only then unlock staging access. If the validation scan finds any real PII — in any column, including new ones that didn't exist when the scripts were written — staging stays locked and an alert fires to the engineering channel.

This is the critical difference from the old approach: the validation scan checks actual data content, not a hardcoded list of column names. When a developer adds a users.backup_email column next month, the validation scan catches it automatically because it pattern-matches on email addresses, not on column names.

Real-World Example

Leo is a backend engineer at a fintech team of 18. After the incident where customer emails appeared in staging Sentry alerts, he asks the agent to scan the 45-table database. The scan finds PII in 22 columns — including three that the old manual scripts missed entirely, plus the free-text support ticket field where a third of entries contain customer names embedded in complaint descriptions.

The agent generates optimized SQL scripts that anonymize 2 million user records in 4 minutes with batched updates. The validation step catches one additional table the initial scan missed — email_logs in a separate schema — which gets added to the pipeline immediately.

Leo sets up the pipeline as a post-copy hook with a validation gate. The compliance team gets a clean audit report showing zero PII leakage in staging, with before/after data samples as evidence. The SOC 2 auditor reviews the validation report and the automated pipeline, and staging passes the audit for the first time. The whole setup took an afternoon instead of the two weeks the team originally estimated — and unlike the old manual scripts, it catches new columns automatically because the PII scan runs on actual data, not a hardcoded list that goes stale the moment someone runs a migration.