HBASE-30112 Fix for flapping TestAsyncTableScan by haridsv · Pull Request #8120 · apache/hbase

haridsv · 2026-04-23T14:34:12Z

Tool: Claude Code
Model: Opus 4.6 (1M context)
Prompt: You need to diagnose the HBase test TestAsyncTableScan that is flapping. It had multiple failures and errors in the PR validation build, but none when I ran locally. The source code is at @../../src/apache/hbase/ and the logs for good run is at @Good-Run and bad run is at @bad-run. Can you compare the logs and try to understand why the run failed and look at the code for a potential fix by focussing solely on these logs alone?

Summary of the fix by AI:

Summary

Root cause: A race condition between span export and test span isolation causes stale TracedScanResultConsumer#onComplete spans from a previous test to appear in the current test's span exporter.

The race: In TracedScanResultConsumer.onComplete(), the TraceUtil.trace() wrapper calls delegate.onComplete() (which sets finished=true and notifyAll()) before calling span.end() (which exports the span). The test thread can wake up from getAll(), finish the current test, start the next test (which calls clearSpans()), and only then does the ForkJoinPool thread call span.end() — exporting the stale span into the next test's freshly cleared exporter.

Why it flaps: The race is timing-dependent. On CI (slower machines), the ForkJoinPool thread is more likely to be preempted between delegate.onComplete() and span.end(), giving the test thread time to proceed to the next test. Locally (faster machines), the window is too small to hit.

Why only scan=batch: The batch scan mode produces more results (each result is split in two), making each test slightly longer and increasing the probability that the ForkJoinPool thread scheduling creates the right timing for the race.

The fix: Modified assertTraceContinuity() and assertTraceError() in TestAsyncTableScan.java to include hasParentSpanId(scanOperationSpanId) in the matchers used for both filtering and waitForSpan. This ensures the assertions only consider spans that belong to the current test's SCAN operation, ignoring any stale spans from previous tests. The waitForSpan with the more specific matcher also correctly waits for the current test's span rather than being satisfied by a stale one.

Tool: Claude Code Model: Opus 4.6 (1M context) Prompt: You need to diagnose the HBase test TestAsyncTableScan that is flapping. It had multiple failures and errors in the PR validation build, but none when I ran locally. The source code is at @../../src/apache/hbase/ and the logs for good run is at @Good-Run and bad run is at @bad-run. Can you compare the logs and try to understand why the run failed and look at the code for a potential fix by focussing solely on these logs alone? Summary of the fix by AI: ## Summary **Root cause**: A race condition between span export and test span isolation causes stale `TracedScanResultConsumer#onComplete` spans from a previous test to appear in the current test's span exporter. **The race**: In `TracedScanResultConsumer.onComplete()`, the `TraceUtil.trace()` wrapper calls `delegate.onComplete()` (which sets `finished=true` and `notifyAll()`) **before** calling `span.end()` (which exports the span). The test thread can wake up from `getAll()`, finish the current test, start the next test (which calls `clearSpans()`), and only then does the ForkJoinPool thread call `span.end()` — exporting the stale span into the next test's freshly cleared exporter. **Why it flaps**: The race is timing-dependent. On CI (slower machines), the ForkJoinPool thread is more likely to be preempted between `delegate.onComplete()` and `span.end()`, giving the test thread time to proceed to the next test. Locally (faster machines), the window is too small to hit. **Why only `scan=batch`**: The batch scan mode produces more results (each result is split in two), making each test slightly longer and increasing the probability that the ForkJoinPool thread scheduling creates the right timing for the race. **The fix**: Modified `assertTraceContinuity()` and `assertTraceError()` in `TestAsyncTableScan.java` to include `hasParentSpanId(scanOperationSpanId)` in the matchers used for both filtering and `waitForSpan`. This ensures the assertions only consider spans that belong to the current test's SCAN operation, ignoring any stale spans from previous tests. The `waitForSpan` with the more specific matcher also correctly waits for the current test's span rather than being satisfied by a stale one.

haridsv · 2026-04-23T14:38:54Z

Bar run logs: bad-run.tar.gz
Full chat history: chat-history.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HBASE-30112 Fix for flapping TestAsyncTableScan#8120

HBASE-30112 Fix for flapping TestAsyncTableScan#8120
haridsv wants to merge 1 commit intoapache:masterfrom
haridsv:HBASE-30112

haridsv commented Apr 23, 2026

Uh oh!

haridsv commented Apr 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

haridsv commented Apr 23, 2026

Summary

Uh oh!

haridsv commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

haridsv commented Apr 23, 2026 •

edited

Loading