Optimize Docker Image Sizes for Production with AI — AI Workflow

Reduce Docker image sizes by 60-90% through multi-stage builds, layer optimization, and base image selection guided by AI analysis.

The Problem

The Docker images have quietly ballooned to 1.5 GB each. Nobody noticed until Kubernetes autoscaling became painfully slow — new pods take 90 seconds to become ready, with 60 of those seconds spent pulling the image. During a flash sale last month, this meant dropped requests while nodes scaled up. The CI pipeline spends more time pushing images than running tests. The registry storage bill is climbing.

The Dockerfiles evolved organically over two years. Dev dependencies ship in production images. Entire build toolchains are baked in. node_modules includes native bindings that the app doesn't use. The base image was chosen because someone copied it from a Stack Overflow answer. Everyone knows multi-stage builds exist, but nobody has had time to refactor 4 Dockerfiles that "work fine" — until the autoscaling problem made them not fine at all.

The Solution

Using the docker-optimizer and docker-helper skills, the approach is to audit every Dockerfile for wasted space, rewrite them with multi-stage builds, choose minimal base images, optimize layer caching, and validate that nothing breaks in the process.

Step-by-Step Walkthrough

Step 1: Audit Current Images

Analyze all Dockerfiles in this repository. For each one, report: current base
image and tag, estimated image size, number of layers, identified issues (dev
dependencies in production, missing .dockerignore, unnecessary files copied,
inefficient layer ordering). Show me the total registry footprint.

The audit reveals the full picture:

Service	Base Image	Est. Size	Layers	Issues
api	node:18	1.4 GB	23	Dev deps in prod, no .dockerignore
worker	python:3.11	1.1 GB	18	Build tools in final image
frontend	node:18	980 MB	15	Source code shipped in prod image
migrations	node:18	1.4 GB	12	Full Node image for a CLI tool

Total registry footprint: ~4.9 GB per tag. Every deployment pushes nearly 5 GB. Every new node pulls nearly 5 GB. The numbers explain the autoscaling latency perfectly.

The worst offender is the API service. Its Dockerfile does COPY . . before npm install, which means every source code change invalidates the dependency layer and triggers a full npm install on rebuild. Layer caching is essentially useless.

Step 2: Rewrite Dockerfiles with Multi-Stage Builds

Rewrite the api Dockerfile using a multi-stage build. Stage 1: install all
dependencies and compile TypeScript. Stage 2: production-only node_modules
with npm ci --omit=dev. Stage 3: copy compiled JS and prod node_modules
into a node:18-alpine base. Pin the Alpine version. Add proper HEALTHCHECK,
non-root USER, and labels.

The rewritten Dockerfile has three stages that separate concerns cleanly:

dockerfile

# Stage 1: Install all deps and compile TypeScript
FROM node:18.19-alpine AS builder
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci
COPY tsconfig.json ./
COPY src/ ./src/
RUN npm run build

# Stage 2: Production dependencies only
FROM node:18.19-alpine AS deps
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci --omit=dev && npm cache clean --force

# Stage 3: Minimal runtime image
FROM node:18.19-alpine3.19
RUN addgroup -S app && adduser -S app -G app
WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules
COPY --from=builder /app/dist ./dist
COPY package.json ./
USER app
HEALTHCHECK --interval=30s CMD wget -qO- http://localhost:3000/health || exit 1
CMD ["node", "dist/index.js"]

The TypeScript compiler, dev dependencies, source .ts files, and everything else that isn't needed at runtime never makes it into the final image. The API image drops from 1.4 GB to 180 MB.

Step 3: Optimize .dockerignore and Layer Caching

Create or update .dockerignore files for all services. Then reorder the COPY
instructions in each Dockerfile to maximize layer caching — dependencies file
first, install, then source code. Show me the expected cache hit scenarios.

Each service gets a .dockerignore that excludes .git, node_modules, test files, documentation, and local config:

.git
node_modules
**/*.test.ts
**/*.spec.ts
*.md
.env*
coverage/

Layer ordering matters enormously. By copying package.json and package-lock.json first, running npm install, and only then copying source code, a source-only change rebuilds just the final 2 layers instead of 8. In practice, most deployments are source-only changes — which means most builds push about 15 MB of changed layers instead of the full image.

Step 4: Choose Minimal Base Images

For each service, recommend the smallest viable base image. Compare node:18-alpine
vs node:18-slim vs distroless for the Node services, and python:3.11-slim vs
python:3.11-alpine vs distroless/python3 for the worker. Consider native dependency
compatibility (bcrypt, sharp, psycopg2) and debugging needs.

The size differences between base images are dramatic:

Base Image	Size	Notes
node:18	992 MB	Full Debian, includes build tools
node:18-slim	243 MB	Minimal Debian, no build tools
node:18-alpine	175 MB	musl libc — test native modules carefully
distroless/nodejs18	128 MB	No shell, no package manager

The recommendation: node:18-alpine for the API and worker (both need a shell for healthcheck scripts), and distroless for the frontend (static nginx serving, no shell needed). The worker's Python image moves from python:3.11 (full Debian at 1.1 GB) to python:3.11-slim at 155 MB — Alpine is risky here because psycopg2 needs glibc.

Step 5: Validate and Measure Results

Build all optimized images and compare sizes against the originals. Run the
test suite against the optimized images to verify nothing is broken. Check
that native modules (bcrypt, sharp) work correctly on Alpine.

All tests pass. Native modules work on Alpine (bcrypt compiles from source, sharp uses the prebuilt musl binary). The final numbers:

Service	Before	After	Reduction
api	1.4 GB	180 MB	-87%
worker	1.1 GB	145 MB	-87%
frontend	980 MB	42 MB	-96%
migrations	1.4 GB	95 MB	-93%
Total	4.9 GB	462 MB	-91%

Real-World Example

The impact shows up immediately in the metrics that matter. Pod startup drops from 90 seconds to 18 seconds — the image pull that used to dominate startup time is now a fraction of it. The CI pipeline shaves 4 minutes off every build because it's pushing 462 MB instead of 4.9 GB. The monthly registry storage bill drops from $45 to $6.

But the real payoff comes during the next flash sale. Traffic spikes 4x, Kubernetes autoscaler spins up new pods, and they're ready to serve traffic in under 20 seconds. Zero dropped requests. The autoscaling that was "too slow" with 1.5 GB images works perfectly with 180 MB ones — the infrastructure didn't need to change, just the images running on it.