Skip to content

[AMORO-4172] Self-heal stuck RUNNING optimizing process in TableRuntimeRefreshExecutor#4195

Closed
fightBoxing wants to merge 1 commit intoapache:masterfrom
fightBoxing:fix/4172-self-heal-stuck-committing
Closed

[AMORO-4172] Self-heal stuck RUNNING optimizing process in TableRuntimeRefreshExecutor#4195
fightBoxing wants to merge 1 commit intoapache:masterfrom
fightBoxing:fix/4172-self-heal-stuck-committing

Conversation

@fightBoxing
Copy link
Copy Markdown

Fixes #4172.

Problem

When a table's optimizing process reaches the post-execution phase — all tasks
are SUCCESS and the code is about to transition the table to COMMITTING
a transient failure of TableRuntime#beginCommitting() (e.g. DB lock wait
timeout in the underlying UPDATE table_runtime) causes invokeConsistency
to roll back the in-memory status. The process stays RUNNING but the table
status remains *_OPTIMIZING, forever, until AMS is restarted. The issue
reporter confirmed exactly this symptom.

Root cause: there is no code path that retries or re-drives
`beginCommitting()` when it transiently fails. `OptimizingCommitExecutor`
only schedules when the status is already `COMMITTING`, so the stuck state
is invisible to it.

Approach (A1: narrow, targeted self-heal)

Leverage the existing `TableRuntimeRefreshExecutor`, which already scans every
table periodically. At the end of `execute()`, detect the stuck pattern and
re-drive the transition:

```
process != null
&& process.getStatus() == RUNNING
&& tableRuntime.getOptimizingStatus() != COMMITTING
&& tableRuntime.getOptimizingStatus().isProcessing()
&& process.allTasksPrepared()
```

When all conditions hold we call `tableRuntime.beginCommitting()`. On
success the normal `handleTableChanged` → `OptimizingCommitExecutor` path
resumes. On failure we log and retry on the next refresh cycle (≤ 1 minute
by default).

Why this location

  • `TableRuntimeRefreshExecutor` already does `enabled() == true` for every
    table — no new thread pool, no new scheduling mechanism.
  • `OptimizingCommitExecutor` is intentionally kept as-is; it is only notified
    via the existing `handleTableChanged` event once the self-heal succeeds.
  • `OptimizingQueue.acceptResult()` hot path (concurrent task completion) is
    not changed — we only expose an existing read method.

Changes

File Change
`OptimizingProcess.java` Add `boolean allTasksPrepared()` to the interface.
`OptimizingQueue.java` `TableOptimizingProcess.allTasksPrepared()` becomes `@Override public` and lock-protected for cross-thread visibility.
`TableRuntimeRefreshExecutor.java` New private `tryHealStuckCommitting()` invoked at the end of `execute()`.
`TestDefaultOptimizingService.java` New regression test `testRefreshExecutorHealsStuckRunningProcess` that uses reflection to simulate the stuck state, then verifies `execute()` transitions back to `COMMITTING`.

Verification

Target test passes locally:

```
mvn -pl amoro-ams -am test
-Dtest='TestDefaultOptimizingService#testRefreshExecutorHealsStuckRunningProcess'
-Dsurefire.failIfNoSpecifiedTests=false -DskipITs

Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
BUILD SUCCESS
```

Self-heal path logs are observable during the test:

```
WARN TableRuntimeRefresher: ...detected stuck RUNNING optimizing process
(processId=..., status=MINOR_OPTIMIZING): all tasks have succeeded but the
table never transitioned to COMMITTING. Self-healing by re-driving
beginCommitting() (issue #4172).
```

Risk

  • No change to concurrent hot paths; only one additional idempotent read
    (`allTasksPrepared`) per refresh cycle, short-circuited early for
    tables that are not processing.
  • Worst case if the underlying DB stays unhealthy: self-heal keeps logging
    at `WARN` and retrying every refresh cycle — exactly the same cost as
    before plus one cheap read. Strictly better than today's
    "stuck until restart".
  • Opened as draft first to let CI confirm, then I'll flip to ready
    for review if green.

…meRefreshExecutor

When a RUNNING optimizing process has all tasks succeeded but a transient
failure of beginCommitting() (e.g. DB lock wait timeout) leaves the table
stuck in *_OPTIMIZING indefinitely, the status is never recovered until
AMS is restarted.

Add a lightweight self-heal path in TableRuntimeRefreshExecutor.execute:
if it detects a process in RUNNING state whose status != COMMITTING and
allTasksPrepared() == true, it re-invokes beginCommitting(). On success
the normal OptimizingCommitExecutor pipeline resumes via handleTableChanged;
on failure it logs and retries on the next refresh cycle.

Changes:
- OptimizingProcess: add allTasksPrepared() to the interface
- OptimizingQueue.TableOptimizingProcess: make allTasksPrepared() public
  and lock-protected so other threads observe a consistent taskMap view
- TableRuntimeRefreshExecutor: invoke tryHealStuckCommitting at the end
  of execute()
- Test: regression test using reflection to simulate the stuck state

Fixes apache#4172
@github-actions github-actions Bot added the module:ams-server Ams server module label Apr 25, 2026
@fightBoxing
Copy link
Copy Markdown
Author

Superseded: master has since been refactored (scheduler framework, DefaultTableRuntime, ProcessStatus). Closing this branch-stale PR and re-opening a fresh one against current master. Root-cause analysis carries over unchanged (see issue #4172 for status update).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:ams-server Ams server module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: status of process hang on running

1 participant