[AMORO-4172] Self-heal stuck RUNNING optimizing process in TableRuntimeRefreshExecutor#4195
Closed
fightBoxing wants to merge 1 commit intoapache:masterfrom
Closed
Conversation
…meRefreshExecutor When a RUNNING optimizing process has all tasks succeeded but a transient failure of beginCommitting() (e.g. DB lock wait timeout) leaves the table stuck in *_OPTIMIZING indefinitely, the status is never recovered until AMS is restarted. Add a lightweight self-heal path in TableRuntimeRefreshExecutor.execute: if it detects a process in RUNNING state whose status != COMMITTING and allTasksPrepared() == true, it re-invokes beginCommitting(). On success the normal OptimizingCommitExecutor pipeline resumes via handleTableChanged; on failure it logs and retries on the next refresh cycle. Changes: - OptimizingProcess: add allTasksPrepared() to the interface - OptimizingQueue.TableOptimizingProcess: make allTasksPrepared() public and lock-protected so other threads observe a consistent taskMap view - TableRuntimeRefreshExecutor: invoke tryHealStuckCommitting at the end of execute() - Test: regression test using reflection to simulate the stuck state Fixes apache#4172
2 tasks
Author
|
Superseded: master has since been refactored (scheduler framework, DefaultTableRuntime, ProcessStatus). Closing this branch-stale PR and re-opening a fresh one against current master. Root-cause analysis carries over unchanged (see issue #4172 for status update). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #4172.
Problem
When a table's optimizing process reaches the post-execution phase — all tasks
are
SUCCESSand the code is about to transition the table toCOMMITTING—a transient failure of
TableRuntime#beginCommitting()(e.g. DB lock waittimeout in the underlying
UPDATE table_runtime) causesinvokeConsistencyto roll back the in-memory status. The process stays
RUNNINGbut the tablestatus remains
*_OPTIMIZING, forever, until AMS is restarted. The issuereporter confirmed exactly this symptom.
Root cause: there is no code path that retries or re-drives
`beginCommitting()` when it transiently fails. `OptimizingCommitExecutor`
only schedules when the status is already `COMMITTING`, so the stuck state
is invisible to it.
Approach (A1: narrow, targeted self-heal)
Leverage the existing `TableRuntimeRefreshExecutor`, which already scans every
table periodically. At the end of `execute()`, detect the stuck pattern and
re-drive the transition:
```
process != null
&& process.getStatus() == RUNNING
&& tableRuntime.getOptimizingStatus() != COMMITTING
&& tableRuntime.getOptimizingStatus().isProcessing()
&& process.allTasksPrepared()
```
When all conditions hold we call `tableRuntime.beginCommitting()`. On
success the normal `handleTableChanged` → `OptimizingCommitExecutor` path
resumes. On failure we log and retry on the next refresh cycle (≤ 1 minute
by default).
Why this location
table — no new thread pool, no new scheduling mechanism.
via the existing `handleTableChanged` event once the self-heal succeeds.
not changed — we only expose an existing read method.
Changes
Verification
Target test passes locally:
```
mvn -pl amoro-ams -am test
-Dtest='TestDefaultOptimizingService#testRefreshExecutorHealsStuckRunningProcess'
-Dsurefire.failIfNoSpecifiedTests=false -DskipITs
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
BUILD SUCCESS
```
Self-heal path logs are observable during the test:
```
WARN TableRuntimeRefresher: ...detected stuck RUNNING optimizing process
(processId=..., status=MINOR_OPTIMIZING): all tasks have succeeded but the
table never transitioned to COMMITTING. Self-healing by re-driving
beginCommitting() (issue #4172).
```
Risk
(`allTasksPrepared`) per refresh cycle, short-circuited early for
tables that are not processing.
at `WARN` and retrying every refresh cycle — exactly the same cost as
before plus one cheap read. Strictly better than today's
"stuck until restart".
for review if green.