Skip to content

PRODENG-3442: fall back to forced swarm dissolution when uninstall-ucp times out#628

Open
james-nesbitt wants to merge 1 commit into
mainfrom
PRODENG-3442-reset-swarm-dissolution-fallback
Open

PRODENG-3442: fall back to forced swarm dissolution when uninstall-ucp times out#628
james-nesbitt wants to merge 1 commit into
mainfrom
PRODENG-3442-reset-swarm-dissolution-fallback

Conversation

@james-nesbitt
Copy link
Copy Markdown
Collaborator

@james-nesbitt james-nesbitt commented May 11, 2026

Jira: https://mirantis.jira.com/browse/PRODENG-3442

Problem

The uninstall-ucp bootstrapper deploys ucp-uninstall-agent as a global Swarm service and waits ~2 minutes (hardcoded) for every node to acknowledge completion. On large clusters or hosts with cold image caches (fresh CI runners) the deadline is missed, causing Reset() to fail even though the cluster and infrastructure are otherwise healthy.

Observed in CI:

  • smoke-modern (MKE 3.9.2, 7 nodes): all nodes missed the deadline
  • smoke-windows (MKE 3.8.8, Win2025 worker): Win2025 missed the deadline
  • smoke-legacy (MKE 3.8.8, 6 Linux nodes): passes cleanly — confirms size/platform dependency

The timeout is internal to the MKE container; there is no --timeout flag on uninstall-ucp.

Fix

MKE documents the manual recovery path for this exact error:

"Remove the ucp-uninstall-agent and ucp-uninstall-agent-win services from a swarm manager, then force each node to leave the swarm."

UninstallMKE.Run() now detects the specific "Uninstalling UCP took too long" message and automatically executes that recovery:

  1. Remove the stuck ucp-uninstall-agent / ucp-uninstall-agent-win services from the leader (best-effort).
  2. Force all non-leader nodes to leave the swarm in parallel (per-node failures logged as warnings, not fatal).
  3. Force the leader to leave last (hard failure if this fails).

All other uninstall-ucp errors continue to propagate as hard failures unchanged. The UninstallMCR phase that follows handles MCR cleanup on each host regardless of how the swarm was dissolved.

Changes

  • pkg/product/mke/phase/uninstall_mke.go — captures Bootstrap output (not just error); isUninstallTimeout(output string) detects the timeout from the output stream; dissolveSwarm() fallback
  • pkg/mcr/mcr.goDrainNode no-ops when NodeID is empty (node already left the swarm after dissolution); also removes a pre-existing duplicate drainCmd execution
  • pkg/product/mke/phase/uninstall_mke_test.go — unit tests for isUninstallTimeout

Copilot review fixes

Copilot identified that after dissolveSwarm() succeeds, UninstallMCR would call DrainNode which runs docker node update --availability drain <empty> — failing because the node is no longer in a swarm. Fixed in DrainNode: an empty NodeID is now treated as a no-op.

Copilot also surfaced that the initial isUninstallTimeout implementation checked the Bootstrap error string, but the timeout message "Uninstalling UCP took too long" is emitted at error level by MKE (not fatal), so it only appears in Bootstrap's output string. Fixed by capturing the output and checking it instead.

CI results

Run Result Notes
25912008675 ✅ passed After Copilot fix #1 (DrainNode guard) — fallback still did not fire (wrong detection)
25922740499 ✅ passed After Copilot fix #2 (output detection) — fallback fired and completed correctly

Confirmed from run 25922740499 logs:

MKE uninstall-ucp: Uninstalling UCP took too long!
[ssh] 44.192.39.182:22: force-leaving swarm
[ssh] 3.80.1.151:22: force-leaving swarm
[ssh] 32.197.224.118:22: force-leaving swarm
[ssh] 44.197.188.102:22: force-leaving swarm
[ssh] 100.48.222.71:22: force-leaving swarm
[ssh] 32.192.34.185:22: force-leaving swarm
[ssh] 32.197.48.207:22: force-leaving swarm (leader)
[ssh] 32.197.48.207:22: swarm dissolved; continuing with MCR uninstall

Reset() returned nil — no non-fatal warn, full reset completed successfully.

@james-nesbitt james-nesbitt added the smoke-test Run all smoke tests label May 11, 2026
@pgedara pgedara requested a review from Copilot May 14, 2026 12:27
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an automatic recovery path for uninstall-ucp timeouts during MKE reset by detecting the known timeout error and force-dissolving the Swarm so the reset process can continue.

Changes:

  • Detect uninstall-ucp “took too long” failures via isUninstallTimeout().
  • Add dissolveSwarm() fallback that removes stuck uninstall-agent services and forces nodes to leave the swarm (leader last).
  • Add unit tests for the timeout detector.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
pkg/product/mke/phase/uninstall_mke.go Adds timeout detection and forced swarm dissolution fallback when uninstall-ucp times out.
pkg/product/mke/phase/uninstall_mke_test.go Adds unit tests for isUninstallTimeout().

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/product/mke/phase/uninstall_mke.go Outdated
@james-nesbitt james-nesbitt force-pushed the PRODENG-3442-reset-swarm-dissolution-fallback branch from 567899b to ae09db7 Compare May 15, 2026 09:48
@james-nesbitt james-nesbitt added smoke-modern Run modern smoke test and removed smoke-test Run all smoke tests labels May 15, 2026
The uninstall-ucp bootstrapper deploys ucp-uninstall-agent as a global
Swarm service, then waits (~2 min hardcoded) for every node to report
back. On large clusters or hosts with cold image caches this deadline is
missed, causing Reset() to fail.

Observed in CI:
  smoke-modern (MKE 3.9.2, 7 nodes): all nodes missed the deadline
  smoke-windows (MKE 3.8.8, Win2025): Win2025 missed the deadline

MKE documents the recovery path: remove the stuck ucp-uninstall-agent
service, then force every node to leave the swarm.

pkg/product/mke/phase/uninstall_mke.go:
  - Capture Bootstrap output (not just error): the timeout message
    'Uninstalling UCP took too long' is logged at error level by MKE and
    appears only in the output stream, not in the Bootstrap error value
    (which only aggregates fatal-level log lines).
  - isUninstallTimeout(output string) detects the timeout from the output.
  - dissolveSwarm() removes ucp-uninstall-agent/ucp-uninstall-agent-win
    from the leader (best-effort), force-leaves all non-leader nodes in
    parallel (per-node failures are warnings), then force-leaves the
    leader last (hard failure if this fails).
  - Non-timeout uninstall-ucp errors still propagate as hard failures.

pkg/mcr/mcr.go (DrainNode):
  - Empty NodeID guard: after forced swarm dissolution every node returns
    an empty NodeID from 'docker info'; previously this caused DrainNode
    to run 'docker node update --availability drain <empty>' which fails.
    Now treated as a no-op (node is already out of the swarm).
  - Also removed a pre-existing duplicate drainCmd execution (the command
    was being run twice on the happy path).

pkg/product/mke/phase/uninstall_mke_test.go:
  - Updated tests to match the new isUninstallTimeout(string) signature.

Signed-off-by: James Nesbitt <jnesbitt@mirantis.com>
@james-nesbitt james-nesbitt force-pushed the PRODENG-3442-reset-swarm-dissolution-fallback branch from ae09db7 to f26b07e Compare May 15, 2026 14:16
@james-nesbitt james-nesbitt added smoke-modern Run modern smoke test and removed smoke-modern Run modern smoke test labels May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

smoke-modern Run modern smoke test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants