Most engineering teams can get a Kubernetes cluster running in an afternoon with a managed platform like EKS, AKS, or GKE. Most of those clusters are not production-ready — and the gap between running and production-ready is where most Kubernetes incidents originate.
Production readiness has four dimensions that are frequently incomplete when a cluster first receives traffic: observability, security posture, cost governance, and recovery capability.
On observability: a cluster with metrics, logs, and traces configured is table stakes. Production readiness requires that those signals are actually used — alert thresholds are set and tested, runbooks exist for the alerts that fire, dashboards are reviewed and not just built. We have audited clusters where the observability tooling was fully configured and P95 latency had been elevated for three months without anyone noticing.
On security posture: Kubernetes ships with generous defaults that are not appropriate for production. Pod security standards need to be enforced. RBAC policies need to be scoped to the minimum necessary. Secrets need to come from a secrets manager, not environment variables in pod specs. Container images need to be scanned in the CI pipeline.
On cost governance: resource requests and limits need to be set on every workload. Without them, the scheduler cannot make good placement decisions and cost attribution is impossible. Cluster autoscaler needs to be configured and tested. Spot or preemptible node groups need to be configured for interruptible workloads.
On recovery capability: a cluster that has never been restored from backup is not a cluster with a backup. Disaster recovery for Kubernetes means tested restoration of persistent volume claims, tested recreation of the cluster configuration from infrastructure-as-code, and tested failover to a secondary region if availability requirements demand it.
We run Kubernetes production-readiness reviews as a standalone four-week engagement, producing a finding register across the four dimensions and a remediation roadmap the client's platform team can execute.