Skip to content

Instance random get stopped while live migration triggered by host maintenance #13010

@jgotteswinter

Description

@jgotteswinter

While enabling maintenance mode i see random instances getting stopped while the host is evacuated, the majority is migrated without any issues. But sometimes i see a instance which should have been live migrated being stopped.

the management server says this

2026-04-13 10:26:54,986 INFO [c.c.h.HighAvailabilityManagerExtImpl] (HA-Worker-1:[ctx-7abbe53d, work-3314]) (logid:5ce65c99) Migration attempt: for VM VM instance {"id":4930,"instanceName":"i-55-4930-VM","state":"Running","type":"User","uuid":"cf19-00b6-465e-98f1-c63b4860498d"}from host Host {"id":18,"name":"XXXch02","type":"Routing","uuid":"dc51-a18d-4f7d-9a2e-7dfbb7a1b908"}. Starting attempt: 1/5 times. 2026-04-13 10:42:32,197 INFO [c.c.v.ClusteredVirtualMachineManagerImpl] (Work-Job-Executor-21:[ctx-1e4a3543, job-742712/job-743693, ctx-ff1f267b]) (logid:279e8d1b) Migrating VM instance {"id":4930,"instanceName":"i-55-4930-VM","state":"Running","type":"User","uuid":"cf19-00b6-465e-98f1-c63b4860498d"} to Dest[Zone(Id)-Pod(Id)-Cluster(Id)-Host(Id)-Storage(Volume(Id|Type-->Pool(Id))] : Dest[Zone(3)-Pod(3)-Cluster(3)-Host(18)-Storage()] 2026-04-13 10:42:32,349 WARN [c.c.v.ClusteredVirtualMachineManagerImpl] (Work-Job-Executor-21:[ctx-1e4a3543, job-742712/job-743693, ctx-ff1f267b]) (logid:279e8d1b) Unable to migrate VM instance {"id":4930,"instanceName":"i-55-4930-VM","state":"Running","type":"User","uuid":"cf19-00b6-465e-98f1-c63b4860498d"} to Host {"id":18,"name":"XXXch02","type":"Routing","uuid":"dc51-a18d-4f7d-9a2e-7dfbb7a1b908"} due to [Resource [Host:18] is unreachable: Host 18: Operation timed out] com.cloud.exception.AgentUnavailableException: Resource [Host:18] is unreachable: Host 18: Operation timed out 2026-04-13 10:43:27,247 INFO [c.c.r.ResourceManagerImpl] (AgentMonitor-1:[ctx-6e6b2b3f]) (logid:afd387b5) Attempting maintenance for Host {"id":21,"name":"XXXch03","type":"Routing","uuid":"eacf-b3e7-4aa9-b4ae-ff5a41862c06"} found pending migration for VM instance {"id":4930,"instanceName":"i-55-4930-VM","state":"Stopping","type":"User","uuid":"cf19-00b6-465e-98f1-c63b4860498d"}. 2026-04-13 10:43:40,248 ERROR [c.c.v.VmWorkJobHandlerProxy] (Work-Job-Executor-21:[ctx-1e4a3543, job-742712/job-743693, ctx-ff1f267b]) (logid:279e8d1b) Invocation exception, caused by: com.cloud.utils.exception.CloudRuntimeException: Unable to migrate VM instance {"id":4930,"instanceName":"i-55-4930-VM","state":"Running","type":"User","uuid":"cf19-00b6-465e-98f1-c63b4860498d"} 2026-04-13 10:43:40,248 INFO [c.c.v.VmWorkJobHandlerProxy] (Work-Job-Executor-21:[ctx-1e4a3543, job-742712/job-743693, ctx-ff1f267b]) (logid:279e8d1b) Rethrow exception com.cloud.utils.exception.CloudRuntimeException: Unable to migrate VM instance {"id":4930,"instanceName":"i-55-4930-VM","state":"Running","type":"User","uuid":"cf19-00b6-465e-98f1-c63b4860498d"} 2026-04-13 10:43:40,248 ERROR [c.c.v.VmWorkJobDispatcher] (Work-Job-Executor-21:[ctx-1e4a3543, job-742712/job-743693]) (logid:279e8d1b) Unable to complete AsyncJob {"accountId":1,"cmd":"com.cloud.vm.VmWorkMigrateAway","cmdInfo":"rO0ABXNyAB5jb20uY2xvdWQudm0uVm1Xb3JrTWlncmF0ZUF3YXmt4MX4jtcEmwIAAUoACXNyY0hvc3RJZHhyABNjb20uY2xvdWQudm0uVm1Xb3Jrn5m2VvAlZ2sCAARKAAlhY2NvdW50SWRKAAZ1c2VySWRKAAR2bUlkTAALaGFuZGxlck5hbWV0ABJMamF2YS9sYW5nL1N0cmluZzt4cAAAAAAAAAABAAAAAAAAAAEAAAAAAAATQnQAGVZpcnR1YWxNYWNoaW5lTWFuYWdlckltcGwAAAAAAAAAFQ","cmdVersion":0,"completeMsid":null,"created":"Mon Apr 13 10:42:31 UTC 2026","id":743693,"initMsid":90520733699643,"instanceId":null,"instanceType":null,"lastPolled":null,"lastUpdated":null,"processStatus":0,"removed":null,"result":null,"resultCode":0,"status":"IN_PROGRESS","userId":1,"uuid":"1401-8cf9-4276-ab57-c6a844371dd2"}, job origin: 742712 com.cloud.utils.exception.CloudRuntimeException: Unable to migrate VM instance {"id":4930,"instanceName":"i-55-4930-VM","state":"Running","type":"User","uuid":"cf19-00b6-465e-98f1-c63b4860498d"}

i would expect to just leave the instance alone up and running on its origin host and trigger a failure for the maintenance mode.

versions

ACS 4.22
Ubuntu 24.04
KVM

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No fields configured for Bug.

    Projects

    Status

    No status

    Status

    Todo

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions