HBASE-30112 Fix for flapping TestAsyncTableScan#8120
Open
haridsv wants to merge 1 commit intoapache:masterfrom
Open
HBASE-30112 Fix for flapping TestAsyncTableScan#8120haridsv wants to merge 1 commit intoapache:masterfrom
haridsv wants to merge 1 commit intoapache:masterfrom
Conversation
Tool: Claude Code Model: Opus 4.6 (1M context) Prompt: You need to diagnose the HBase test TestAsyncTableScan that is flapping. It had multiple failures and errors in the PR validation build, but none when I ran locally. The source code is at @../../src/apache/hbase/ and the logs for good run is at @Good-Run and bad run is at @bad-run. Can you compare the logs and try to understand why the run failed and look at the code for a potential fix by focussing solely on these logs alone? Summary of the fix by AI: ## Summary **Root cause**: A race condition between span export and test span isolation causes stale `TracedScanResultConsumer#onComplete` spans from a previous test to appear in the current test's span exporter. **The race**: In `TracedScanResultConsumer.onComplete()`, the `TraceUtil.trace()` wrapper calls `delegate.onComplete()` (which sets `finished=true` and `notifyAll()`) **before** calling `span.end()` (which exports the span). The test thread can wake up from `getAll()`, finish the current test, start the next test (which calls `clearSpans()`), and only then does the ForkJoinPool thread call `span.end()` — exporting the stale span into the next test's freshly cleared exporter. **Why it flaps**: The race is timing-dependent. On CI (slower machines), the ForkJoinPool thread is more likely to be preempted between `delegate.onComplete()` and `span.end()`, giving the test thread time to proceed to the next test. Locally (faster machines), the window is too small to hit. **Why only `scan=batch`**: The batch scan mode produces more results (each result is split in two), making each test slightly longer and increasing the probability that the ForkJoinPool thread scheduling creates the right timing for the race. **The fix**: Modified `assertTraceContinuity()` and `assertTraceError()` in `TestAsyncTableScan.java` to include `hasParentSpanId(scanOperationSpanId)` in the matchers used for both filtering and `waitForSpan`. This ensures the assertions only consider spans that belong to the current test's SCAN operation, ignoring any stale spans from previous tests. The `waitForSpan` with the more specific matcher also correctly waits for the current test's span rather than being satisfied by a stale one.
Contributor
Author
|
Bar run logs: bad-run.tar.gz |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Tool: Claude Code
Model: Opus 4.6 (1M context)
Prompt: You need to diagnose the HBase test TestAsyncTableScan that is flapping. It had multiple failures and errors in the PR validation build, but none when I ran locally. The source code is at @../../src/apache/hbase/ and the logs for good run is at @Good-Run and bad run is at @bad-run. Can you compare the logs and try to understand why the run failed and look at the code for a potential fix by focussing solely on these logs alone?
Summary of the fix by AI:
Summary
Root cause: A race condition between span export and test span isolation causes stale
TracedScanResultConsumer#onCompletespans from a previous test to appear in the current test's span exporter.The race: In
TracedScanResultConsumer.onComplete(), theTraceUtil.trace()wrapper callsdelegate.onComplete()(which setsfinished=trueandnotifyAll()) before callingspan.end()(which exports the span). The test thread can wake up fromgetAll(), finish the current test, start the next test (which callsclearSpans()), and only then does the ForkJoinPool thread callspan.end()— exporting the stale span into the next test's freshly cleared exporter.Why it flaps: The race is timing-dependent. On CI (slower machines), the ForkJoinPool thread is more likely to be preempted between
delegate.onComplete()andspan.end(), giving the test thread time to proceed to the next test. Locally (faster machines), the window is too small to hit.Why only
scan=batch: The batch scan mode produces more results (each result is split in two), making each test slightly longer and increasing the probability that the ForkJoinPool thread scheduling creates the right timing for the race.The fix: Modified
assertTraceContinuity()andassertTraceError()inTestAsyncTableScan.javato includehasParentSpanId(scanOperationSpanId)in the matchers used for both filtering andwaitForSpan. This ensures the assertions only consider spans that belong to the current test's SCAN operation, ignoring any stale spans from previous tests. ThewaitForSpanwith the more specific matcher also correctly waits for the current test's span rather than being satisfied by a stale one.