Surge upgrades should wait for nodes to be ready
Anton Engelhardt
Currently, upgrading a Kubernetes cluster with surge upgrade enabled on DigitalOcean still results in downtime, even though the feature is intended to prevent this. We are requesting improvements to the upgrade process to ensure zero-downtime deployments as the cluster scales.
Current Behavior:
When performing a cluster upgrade:
1. Upgrade is initiated via the dashboard.
2. New nodes are created (takes 1–2 minutes).
3. New nodes appear in management tools (e.g., Lens) after 20–30 seconds.
4. New nodes are tainted as not ready.
5. Old nodes are immediately marked as
SchedulingDisabled
—this kills all pods before new nodes are ready.6. New nodes become ready 20–30 seconds later and begin running pods.
7. Full recovery takes several minutes, resulting in noticeable downtime.
Expected/Ideal Behavior:
The upgrade process should ensure that old nodes are only drained after new nodes are fully ready, minimizing or eliminating downtime. A more robust upgrade flow would be:
1. Start upgrade in the dashboard.
2. New nodes are created (ideally, this should also be faster).
3. New nodes appear in management tools.
4. New nodes are tainted as not ready.
5. Wait until new nodes are fully ready.
6. Only then, drain the old nodes, and wait for the drain process to complete (or timeout after 5 minutes).
7. Once drained, remove old nodes from the cluster.
8. Mark the update as complete.
Why this matters:
Zero-downtime upgrades are a critical expectation for Kubernetes users, especially as clusters scale and host production workloads. The current process undermines the value of surge upgrades and may force teams to consider alternative providers.
Request:
• Please adjust the upgrade workflow so that old nodes are only drained after new surge nodes are fully ready.
• Consider optimizing node provisioning speed.
• Provide more transparency or documentation if there are best practices or settings that can help achieve zero-downtime upgrades.
After a chat with support, they told us about Pod Disruption Budgets, however we still strongly encourage prioritizing this feature.
Thank you!