[AMORO-4172] Self-heal stuck RUNNING optimizing process in TableRuntimeRefreshExecutor by fightBoxing · Pull Request #4195 · apache/amoro

fightBoxing · 2026-04-25T13:36:23Z

Problem

When a table's optimizing process reaches the post-execution phase — all tasks
are SUCCESS and the code is about to transition the table to COMMITTING —
a transient failure of TableRuntime#beginCommitting() (e.g. DB lock wait
timeout in the underlying UPDATE table_runtime) causes invokeConsistency
to roll back the in-memory status. The process stays RUNNING but the table
status remains *_OPTIMIZING, forever, until AMS is restarted. The issue
reporter confirmed exactly this symptom.

Root cause: there is no code path that retries or re-drives
`beginCommitting()` when it transiently fails. `OptimizingCommitExecutor`
only schedules when the status is already `COMMITTING`, so the stuck state
is invisible to it.

Approach (A1: narrow, targeted self-heal)

Leverage the existing `TableRuntimeRefreshExecutor`, which already scans every
table periodically. At the end of `execute()`, detect the stuck pattern and
re-drive the transition:

```
process != null
&& process.getStatus() == RUNNING
&& tableRuntime.getOptimizingStatus() != COMMITTING
&& tableRuntime.getOptimizingStatus().isProcessing()
&& process.allTasksPrepared()
```

When all conditions hold we call `tableRuntime.beginCommitting()`. On
success the normal `handleTableChanged` → `OptimizingCommitExecutor` path
resumes. On failure we log and retry on the next refresh cycle (≤ 1 minute
by default).

Why this location

`TableRuntimeRefreshExecutor` already does `enabled() == true` for every
table — no new thread pool, no new scheduling mechanism.
`OptimizingCommitExecutor` is intentionally kept as-is; it is only notified
via the existing `handleTableChanged` event once the self-heal succeeds.
`OptimizingQueue.acceptResult()` hot path (concurrent task completion) is
not changed — we only expose an existing read method.

Changes

File	Change
`OptimizingProcess.java`	Add `boolean allTasksPrepared()` to the interface.
`OptimizingQueue.java`	`TableOptimizingProcess.allTasksPrepared()` becomes `@Override public` and lock-protected for cross-thread visibility.
`TableRuntimeRefreshExecutor.java`	New private `tryHealStuckCommitting()` invoked at the end of `execute()`.
`TestDefaultOptimizingService.java`	New regression test `testRefreshExecutorHealsStuckRunningProcess` that uses reflection to simulate the stuck state, then verifies `execute()` transitions back to `COMMITTING`.

Verification

Target test passes locally:

```
mvn -pl amoro-ams -am test
-Dtest='TestDefaultOptimizingService#testRefreshExecutorHealsStuckRunningProcess'
-Dsurefire.failIfNoSpecifiedTests=false -DskipITs

Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
BUILD SUCCESS
```

Self-heal path logs are observable during the test:

```
WARN TableRuntimeRefresher: ...detected stuck RUNNING optimizing process
(processId=..., status=MINOR_OPTIMIZING): all tasks have succeeded but the
table never transitioned to COMMITTING. Self-healing by re-driving
beginCommitting() (issue #4172).
```

Risk

No change to concurrent hot paths; only one additional idempotent read
(`allTasksPrepared`) per refresh cycle, short-circuited early for
tables that are not processing.
Worst case if the underlying DB stays unhealthy: self-heal keeps logging
at `WARN` and retrying every refresh cycle — exactly the same cost as
before plus one cheap read. Strictly better than today's
"stuck until restart".
Opened as draft first to let CI confirm, then I'll flip to ready
for review if green.

…meRefreshExecutor When a RUNNING optimizing process has all tasks succeeded but a transient failure of beginCommitting() (e.g. DB lock wait timeout) leaves the table stuck in *_OPTIMIZING indefinitely, the status is never recovered until AMS is restarted. Add a lightweight self-heal path in TableRuntimeRefreshExecutor.execute: if it detects a process in RUNNING state whose status != COMMITTING and allTasksPrepared() == true, it re-invokes beginCommitting(). On success the normal OptimizingCommitExecutor pipeline resumes via handleTableChanged; on failure it logs and retries on the next refresh cycle. Changes: - OptimizingProcess: add allTasksPrepared() to the interface - OptimizingQueue.TableOptimizingProcess: make allTasksPrepared() public and lock-protected so other threads observe a consistent taskMap view - TableRuntimeRefreshExecutor: invoke tryHealStuckCommitting at the end of execute() - Test: regression test using reflection to simulate the stuck state Fixes apache#4172

fightBoxing · 2026-04-25T14:04:27Z

Superseded: master has since been refactored (scheduler framework, DefaultTableRuntime, ProcessStatus). Closing this branch-stale PR and re-opening a fresh one against current master. Root-cause analysis carries over unchanged (see issue #4172 for status update).

github-actions Bot added the module:ams-server Ams server module label Apr 25, 2026

fightBoxing mentioned this pull request Apr 25, 2026

[Bug]: status of process hang on running #4172

Open

2 tasks

fightBoxing closed this Apr 25, 2026

fightBoxing mentioned this pull request Apr 25, 2026

[AMORO-4172] Self-heal stuck RUNNING optimizing process in TableRuntimeRefreshExecutor #4196

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMORO-4172] Self-heal stuck RUNNING optimizing process in TableRuntimeRefreshExecutor#4195

[AMORO-4172] Self-heal stuck RUNNING optimizing process in TableRuntimeRefreshExecutor#4195
fightBoxing wants to merge 1 commit intoapache:masterfrom
fightBoxing:fix/4172-self-heal-stuck-committing

fightBoxing commented Apr 25, 2026

Uh oh!

fightBoxing commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fightBoxing commented Apr 25, 2026

Problem

Approach (A1: narrow, targeted self-heal)

Why this location

Changes

Verification

Risk

Uh oh!

fightBoxing commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant