PRODENG-3442: fall back to forced swarm dissolution when uninstall-ucp times out#628
Open
james-nesbitt wants to merge 1 commit into
Open
PRODENG-3442: fall back to forced swarm dissolution when uninstall-ucp times out#628james-nesbitt wants to merge 1 commit into
james-nesbitt wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Pull request overview
Adds an automatic recovery path for uninstall-ucp timeouts during MKE reset by detecting the known timeout error and force-dissolving the Swarm so the reset process can continue.
Changes:
- Detect
uninstall-ucp“took too long” failures viaisUninstallTimeout(). - Add
dissolveSwarm()fallback that removes stuck uninstall-agent services and forces nodes to leave the swarm (leader last). - Add unit tests for the timeout detector.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
pkg/product/mke/phase/uninstall_mke.go |
Adds timeout detection and forced swarm dissolution fallback when uninstall-ucp times out. |
pkg/product/mke/phase/uninstall_mke_test.go |
Adds unit tests for isUninstallTimeout(). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
567899b to
ae09db7
Compare
The uninstall-ucp bootstrapper deploys ucp-uninstall-agent as a global
Swarm service, then waits (~2 min hardcoded) for every node to report
back. On large clusters or hosts with cold image caches this deadline is
missed, causing Reset() to fail.
Observed in CI:
smoke-modern (MKE 3.9.2, 7 nodes): all nodes missed the deadline
smoke-windows (MKE 3.8.8, Win2025): Win2025 missed the deadline
MKE documents the recovery path: remove the stuck ucp-uninstall-agent
service, then force every node to leave the swarm.
pkg/product/mke/phase/uninstall_mke.go:
- Capture Bootstrap output (not just error): the timeout message
'Uninstalling UCP took too long' is logged at error level by MKE and
appears only in the output stream, not in the Bootstrap error value
(which only aggregates fatal-level log lines).
- isUninstallTimeout(output string) detects the timeout from the output.
- dissolveSwarm() removes ucp-uninstall-agent/ucp-uninstall-agent-win
from the leader (best-effort), force-leaves all non-leader nodes in
parallel (per-node failures are warnings), then force-leaves the
leader last (hard failure if this fails).
- Non-timeout uninstall-ucp errors still propagate as hard failures.
pkg/mcr/mcr.go (DrainNode):
- Empty NodeID guard: after forced swarm dissolution every node returns
an empty NodeID from 'docker info'; previously this caused DrainNode
to run 'docker node update --availability drain <empty>' which fails.
Now treated as a no-op (node is already out of the swarm).
- Also removed a pre-existing duplicate drainCmd execution (the command
was being run twice on the happy path).
pkg/product/mke/phase/uninstall_mke_test.go:
- Updated tests to match the new isUninstallTimeout(string) signature.
Signed-off-by: James Nesbitt <jnesbitt@mirantis.com>
ae09db7 to
f26b07e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Jira: https://mirantis.jira.com/browse/PRODENG-3442
Problem
The
uninstall-ucpbootstrapper deploysucp-uninstall-agentas a global Swarm service and waits ~2 minutes (hardcoded) for every node to acknowledge completion. On large clusters or hosts with cold image caches (fresh CI runners) the deadline is missed, causingReset()to fail even though the cluster and infrastructure are otherwise healthy.Observed in CI:
The timeout is internal to the MKE container; there is no
--timeoutflag onuninstall-ucp.Fix
MKE documents the manual recovery path for this exact error:
UninstallMKE.Run()now detects the specific"Uninstalling UCP took too long"message and automatically executes that recovery:ucp-uninstall-agent/ucp-uninstall-agent-winservices from the leader (best-effort).All other
uninstall-ucperrors continue to propagate as hard failures unchanged. TheUninstallMCRphase that follows handles MCR cleanup on each host regardless of how the swarm was dissolved.Changes
pkg/product/mke/phase/uninstall_mke.go— captures Bootstrap output (not just error);isUninstallTimeout(output string)detects the timeout from the output stream;dissolveSwarm()fallbackpkg/mcr/mcr.go—DrainNodeno-ops whenNodeIDis empty (node already left the swarm after dissolution); also removes a pre-existing duplicatedrainCmdexecutionpkg/product/mke/phase/uninstall_mke_test.go— unit tests forisUninstallTimeoutCopilot review fixes
Copilot identified that after
dissolveSwarm()succeeds,UninstallMCRwould callDrainNodewhich runsdocker node update --availability drain <empty>— failing because the node is no longer in a swarm. Fixed inDrainNode: an emptyNodeIDis now treated as a no-op.Copilot also surfaced that the initial
isUninstallTimeoutimplementation checked the Bootstrap error string, but the timeout message"Uninstalling UCP took too long"is emitted aterrorlevel by MKE (notfatal), so it only appears in Bootstrap's output string. Fixed by capturing the output and checking it instead.CI results
Confirmed from run
25922740499logs:Reset()returned nil — no non-fatal warn, full reset completed successfully.