
While Kubenetes adoption is nearly universal, a new study from Komodor reveals that slow recovery, inflated cloud bills, and customers exposed to outages are pushing up costs and adding to risk.
According to Komodor, the key finding is that while Kubernetes is mature, organizational operations are not.
“Organizations have made Kubernetes their standard, but our report shows the real challenge is operational, not architectural,” Itiel Shwartz, CTO and Co-founder of Komodor, said in a statement. “Even as practices like GitOps and platform engineering gain traction, enterprises still grapple with change management, cost control, and skills gaps. At the same time, the growth of AI/ML workloads and AIOps marks the next frontier, reinforcing Kubernetes as the backbone of enterprise infrastructure.”
According to the report, keeping Kubernetes environments stable and controlling costs remain a challenge for organizations. Respondent data shows that some 8 in 10 incidents can be directly tied to system changes, and that when outages occur, they’re taking close to an hour to find and fix. Further, the data showed nearly three-quarters of workloads are not running even at half their requested CPU or memory, which leads to massive overspending on infrastructure.
Among the problems organizations continue to face, from the survey results, are that system changes are responsible for 79% of production issues; mean times to discovery and repair are 40 minutes and more than 50 minutes respectively; businesses estimate major downtime costs are $1 million per hour while 38% of businesses report weekly high-impact outages, and that operations teams are spending more than 60% of their time on troubleshooting, and only 20% of incidents are resolved without having to push the issue up the chain.
Other reported issues include that overspending is widespread, with 82% of Kubernetes workloads being overprovisioned; that typical enterprises run more than 20 clusters with half operating across more than four environments, which compounds risk; and that AI adoption is rising but skills gaps exist that pu pressure on troubleshooting, cost management and policy enforcement.
In a news release announcing the survey findings, Komodor offered up some best practices to follow to mitigate those challenges. They are:
- Harden the change pipeline. Enforce policy-as-code and admission controllers to block unsafe configs at deploy time. Pair GitOps with automated drift detection and rollback to keep multi-cluster environments consistent.
 - Embed AI into observability. Unify metrics, logs, traces, and events in a single pipeline. Use AI-powered anomaly detection, root cause analysis, and auto-remediation to cut MTTD and MTTR.
 - Codify and automate incident workflows. Version-control runbooks, standardize escalation policies, and rehearse cross-cluster failover. Let automated remediation handle common issues.
 - Continuously rightsize. Apply CPU/memory limits through admission policies, extend autoscaling coverage, and integrate predictive scaling to prevent both overspend and resource starvation.
 - Tie reliability to business outcomes. Correlate SLOs with revenue and customer metrics so improvements in uptime and recovery compete fairly with feature delivery.
 - Build golden paths. Provide developers with pre-vetted templates, operator bundles, and guardrails so they can deploy safely without deep Kubernetes expertise.
 
