Overview
GKE is Google Cloud's managed Kubernetes platform. It runs the control plane for you, automates upgrades, and ships two operating modes: Autopilot (Google manages nodes; pay per pod) and Standard (you manage node pools; pay per node). Default to Autopilot — it eliminates node-level toil and is the recommended golden path for production.
Instructions
Autopilot vs Standard
| Autopilot | Standard | |
|---|---|---|
| Node management | You | |
| Billing model | Per-pod resources | Per-node VM |
| Node pool config | None | You configure |
| Best for | Most workloads | DaemonSets, GPUs with custom drivers, privileged pods |
| Workload Identity | Required | Recommended |
Use Standard only when you genuinely need node-level access (custom kernel, certain GPU configs, privileged DaemonSets). Otherwise, Autopilot.
Quick Start (Autopilot)
gcloud services enable container.googleapis.com
gcloud container clusters create-auto prod-cluster \
--region=us-central1 \
--release-channel=regular \
--enable-private-nodes \
--network=default --subnetwork=default
gcloud container clusters get-credentials prod-cluster --region=us-central1
kubectl create deployment hello \
--image=us-docker.pkg.dev/google-samples/containers/gke/hello-app:1.0
kubectl expose deployment hello --port=80 --target-port=8080 --type=LoadBalancer
Production Cluster Defaults
gcloud container clusters create-auto prod-cluster \
--region=us-central1 \
--release-channel=regular \
--enable-private-nodes \
--enable-master-authorized-networks \
--master-authorized-networks=10.0.0.0/8,YOUR_OFFICE_CIDR \
--network=prod-vpc --subnetwork=prod-subnet \
--cluster-secondary-range-name=pods \
--services-secondary-range-name=services \
--workload-pool=my-project.svc.id.goog \
--enable-shielded-nodes
Workload Identity (Pods → GCP APIs without keys)
# Create a Google Service Account for the workload
gcloud iam service-accounts create orders-api
gcloud projects add-iam-policy-binding my-project \
--member="serviceAccount:orders-api@my-project.iam.gserviceaccount.com" \
--role="roles/cloudsql.client"
# Bind the Kubernetes ServiceAccount to the GSA
gcloud iam service-accounts add-iam-policy-binding \
orders-api@my-project.iam.gserviceaccount.com \
--role="roles/iam.workloadIdentityUser" \
--member="serviceAccount:my-project.svc.id.goog[default/orders-api]"
apiVersion: v1
kind: ServiceAccount
metadata:
name: orders-api
namespace: default
annotations:
iam.gke.io/gcp-service-account: orders-api@my-project.iam.gserviceaccount.com
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: orders-api
spec:
replicas: 3
selector:
matchLabels: { app: orders-api }
template:
metadata:
labels: { app: orders-api }
spec:
serviceAccountName: orders-api # → maps to GSA via Workload Identity
containers:
- name: api
image: us-central1-docker.pkg.dev/my-project/repo/orders-api:v1.4.2
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "1"
memory: "1Gi"
Autoscaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: orders-api
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: orders-api
minReplicas: 3
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target: { type: Utilization, averageUtilization: 70 }
- type: Resource
resource:
name: memory
target: { type: Utilization, averageUtilization: 80 }
# PodDisruptionBudget — keep service available during voluntary disruptions
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: orders-api
spec:
minAvailable: 2
selector:
matchLabels: { app: orders-api }
Gateway API (Modern Ingress)
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: external-gateway
spec:
gatewayClassName: gke-l7-global-external-managed
listeners:
- name: https
protocol: HTTPS
port: 443
tls:
mode: Terminate
certificateRefs:
- name: api-cert
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: orders-route
spec:
parentRefs: [{ name: external-gateway }]
hostnames: ["api.example.com"]
rules:
- matches: [{ path: { type: PathPrefix, value: "/orders" } }]
backendRefs:
- name: orders-api
port: 80
GPU Inference Workload (Standard Mode)
# Create a node pool with L4 GPUs and Spot VMs for cheap inference
gcloud container node-pools create inference-l4 \
--cluster=ml-cluster --region=us-central1 \
--machine-type=g2-standard-8 \
--accelerator=type=nvidia-l4,count=1,gpu-driver-version=LATEST \
--num-nodes=0 --enable-autoscaling --min-nodes=0 --max-nodes=10 \
--spot --node-taints=workload=inference:NoSchedule
apiVersion: apps/v1
kind: Deployment
metadata: { name: vllm-llama }
spec:
replicas: 1
selector: { matchLabels: { app: vllm } }
template:
metadata: { labels: { app: vllm } }
spec:
tolerations:
- key: workload
operator: Equal
value: inference
effect: NoSchedule
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-l4
containers:
- name: vllm
image: vllm/vllm-openai:latest
args: ["--model", "meta-llama/Llama-3-8B-Instruct", "--port", "8000"]
resources:
limits:
nvidia.com/gpu: 1
memory: "24Gi"
cpu: "4"
Cost Optimization
# Spot VMs for fault-tolerant batch
gcloud container node-pools create batch-spot \
--cluster=prod-cluster --region=us-central1 \
--machine-type=e2-standard-4 \
--spot --num-nodes=0 --enable-autoscaling --max-nodes=20
# Compute Class definition (newer alternative to node pools)
apiVersion: cloud.google.com/v1
kind: ComputeClass
metadata: { name: spot-burst }
spec:
priorities:
- machineFamily: n4
spot: true
- machineFamily: n2
spot: true
- machineFamily: n2 # on-demand fallback
nodePoolAutoCreation: { enabled: true }
Observability — Managed Prometheus
# Scrape app metrics with Google Cloud Managed Service for Prometheus
apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata: { name: orders-api }
spec:
selector:
matchLabels: { app: orders-api }
endpoints:
- port: metrics
interval: 30s
Examples
Example 1 — Stand up a production Autopilot cluster
User wants a hardened GKE cluster for a new service. Create an Autopilot cluster on the regular release channel with private nodes, master authorized networks, Workload Identity enabled, and Shielded Nodes. Wire the app's Kubernetes ServiceAccount to a GSA with the minimum IAM roles, deploy via a Deployment + HPA + PDB, and front it with the Gateway API for managed TLS.
Example 2 — Run a vLLM Llama 3 inference service on Spot L4s
User needs cheap LLM inference. Create a Standard cluster (Autopilot doesn't support custom GPU drivers consistently), add an L4 node pool with --spot and autoscaling 0→10, deploy vLLM with a node selector + toleration so only inference pods schedule there, and add Managed Prometheus scraping for token-throughput metrics.
Guidelines
- Default to Autopilot — most workloads should never see a node pool config
- Use the regular release channel in production; rapid for staging; stable only for highly conservative orgs
- Always enable private nodes + master authorized networks; never expose the API server publicly
- Workload Identity is mandatory — never put service account JSON keys in Secrets
- Set resource
requestsANDlimits; Autopilot rejects pods without them - Add a PodDisruptionBudget to every Deployment that serves traffic
- Use Spot VMs / Compute Classes for batch and inference workloads to cut compute cost 60–90%
- Use Gateway API for new ingress; the legacy
Ingressresource is feature-frozen - Enable Managed Prometheus instead of self-hosting Prometheus
- For multi-tenant clusters, isolate teams by namespace + RBAC + ResourceQuota