PRODENG-3442: fall back to forced swarm dissolution when uninstall-ucp times out by james-nesbitt · Pull Request #628 · Mirantis/launchpad

james-nesbitt · 2026-05-11T17:31:57Z

Jira: https://mirantis.jira.com/browse/PRODENG-3442

Problem

The uninstall-ucp bootstrapper deploys ucp-uninstall-agent as a global Swarm service and waits ~2 minutes (hardcoded) for every node to acknowledge completion. On large clusters or hosts with cold image caches (fresh CI runners) the deadline is missed, causing Reset() to fail even though the cluster and infrastructure are otherwise healthy.

Observed in CI:

smoke-modern (MKE 3.9.2, 7 nodes): all nodes missed the deadline
smoke-windows (MKE 3.8.8, Win2025 worker): Win2025 missed the deadline
smoke-legacy (MKE 3.8.8, 6 Linux nodes): passes cleanly — confirms size/platform dependency

The timeout is internal to the MKE container; there is no --timeout flag on uninstall-ucp.

Fix

MKE documents the manual recovery path for this exact error:

"Remove the ucp-uninstall-agent and ucp-uninstall-agent-win services from a swarm manager, then force each node to leave the swarm."

UninstallMKE.Run() now detects the specific "Uninstalling UCP took too long" message and automatically executes that recovery:

Remove the stuck ucp-uninstall-agent / ucp-uninstall-agent-win services from the leader (best-effort).
Force all non-leader nodes to leave the swarm in parallel (per-node failures logged as warnings, not fatal).
Force the leader to leave last (hard failure if this fails).

All other uninstall-ucp errors continue to propagate as hard failures unchanged. The UninstallMCR phase that follows handles MCR cleanup on each host regardless of how the swarm was dissolved.

Changes

pkg/product/mke/phase/uninstall_mke.go — captures Bootstrap output (not just error); isUninstallTimeout(output string) detects the timeout from the output stream; dissolveSwarm() fallback
pkg/mcr/mcr.go — DrainNode no-ops when NodeID is empty (node already left the swarm after dissolution); also removes a pre-existing duplicate drainCmd execution
pkg/product/mke/phase/uninstall_mke_test.go — unit tests for isUninstallTimeout

Copilot review fixes

Copilot identified that after dissolveSwarm() succeeds, UninstallMCR would call DrainNode which runs docker node update --availability drain <empty> — failing because the node is no longer in a swarm. Fixed in DrainNode: an empty NodeID is now treated as a no-op.

Copilot also surfaced that the initial isUninstallTimeout implementation checked the Bootstrap error string, but the timeout message "Uninstalling UCP took too long" is emitted at error level by MKE (not fatal), so it only appears in Bootstrap's output string. Fixed by capturing the output and checking it instead.

CI results

Run	Result	Notes
25912008675	✅ passed	After Copilot fix #1 (DrainNode guard) — fallback still did not fire (wrong detection)
25922740499	✅ passed	After Copilot fix #2 (output detection) — fallback fired and completed correctly

Confirmed from run 25922740499 logs:

MKE uninstall-ucp: Uninstalling UCP took too long!
[ssh] 44.192.39.182:22: force-leaving swarm
[ssh] 3.80.1.151:22: force-leaving swarm
[ssh] 32.197.224.118:22: force-leaving swarm
[ssh] 44.197.188.102:22: force-leaving swarm
[ssh] 100.48.222.71:22: force-leaving swarm
[ssh] 32.192.34.185:22: force-leaving swarm
[ssh] 32.197.48.207:22: force-leaving swarm (leader)
[ssh] 32.197.48.207:22: swarm dissolved; continuing with MCR uninstall

Reset() returned nil — no non-fatal warn, full reset completed successfully.

Copilot

Pull request overview

Adds an automatic recovery path for uninstall-ucp timeouts during MKE reset by detecting the known timeout error and force-dissolving the Swarm so the reset process can continue.

Changes:

Detect uninstall-ucp “took too long” failures via isUninstallTimeout().
Add dissolveSwarm() fallback that removes stuck uninstall-agent services and forces nodes to leave the swarm (leader last).
Add unit tests for the timeout detector.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
`pkg/product/mke/phase/uninstall_mke.go`	Adds timeout detection and forced swarm dissolution fallback when `uninstall-ucp` times out.
`pkg/product/mke/phase/uninstall_mke_test.go`	Adds unit tests for `isUninstallTimeout()`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The uninstall-ucp bootstrapper deploys ucp-uninstall-agent as a global Swarm service, then waits (~2 min hardcoded) for every node to report back. On large clusters or hosts with cold image caches this deadline is missed, causing Reset() to fail. Observed in CI: smoke-modern (MKE 3.9.2, 7 nodes): all nodes missed the deadline smoke-windows (MKE 3.8.8, Win2025): Win2025 missed the deadline MKE documents the recovery path: remove the stuck ucp-uninstall-agent service, then force every node to leave the swarm. pkg/product/mke/phase/uninstall_mke.go: - Capture Bootstrap output (not just error): the timeout message 'Uninstalling UCP took too long' is logged at error level by MKE and appears only in the output stream, not in the Bootstrap error value (which only aggregates fatal-level log lines). - isUninstallTimeout(output string) detects the timeout from the output. - dissolveSwarm() removes ucp-uninstall-agent/ucp-uninstall-agent-win from the leader (best-effort), force-leaves all non-leader nodes in parallel (per-node failures are warnings), then force-leaves the leader last (hard failure if this fails). - Non-timeout uninstall-ucp errors still propagate as hard failures. pkg/mcr/mcr.go (DrainNode): - Empty NodeID guard: after forced swarm dissolution every node returns an empty NodeID from 'docker info'; previously this caused DrainNode to run 'docker node update --availability drain <empty>' which fails. Now treated as a no-op (node is already out of the swarm). - Also removed a pre-existing duplicate drainCmd execution (the command was being run twice on the happy path). pkg/product/mke/phase/uninstall_mke_test.go: - Updated tests to match the new isUninstallTimeout(string) signature. Signed-off-by: James Nesbitt <jnesbitt@mirantis.com>

james-nesbitt added the smoke-test Run all smoke tests label May 11, 2026

pgedara requested a review from Copilot May 14, 2026 12:27

Copilot started reviewing on behalf of pgedara May 14, 2026 12:28 View session

Copilot AI reviewed May 14, 2026

View reviewed changes

Comment thread pkg/product/mke/phase/uninstall_mke.go Outdated

james-nesbitt force-pushed the PRODENG-3442-reset-swarm-dissolution-fallback branch from 567899b to ae09db7 Compare May 15, 2026 09:48

james-nesbitt added smoke-modern Run modern smoke test and removed smoke-test Run all smoke tests labels May 15, 2026

james-nesbitt force-pushed the PRODENG-3442-reset-swarm-dissolution-fallback branch from ae09db7 to f26b07e Compare May 15, 2026 14:16

james-nesbitt added smoke-modern Run modern smoke test and removed smoke-modern Run modern smoke test labels May 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PRODENG-3442: fall back to forced swarm dissolution when uninstall-ucp times out#628

PRODENG-3442: fall back to forced swarm dissolution when uninstall-ucp times out#628
james-nesbitt wants to merge 1 commit into
mainfrom
PRODENG-3442-reset-swarm-dissolution-fallback

james-nesbitt commented May 11, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

james-nesbitt commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Changes

Copilot review fixes

CI results

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

james-nesbitt commented May 11, 2026 •

edited

Loading