HBASE-30111 Fix for flapper TestBlockBytesScannedQuota by AI#8117
HBASE-30111 Fix for flapper TestBlockBytesScannedQuota by AI#8117haridsv wants to merge 1 commit intoapache:masterfrom
Conversation
AI tool: Claude Code AI Model: Sonnet 4.5 (reverified using Opus 4.6 (1M context)) Prompt: You need to diagnose the HBase test TestBlockBytesScannedQuota that is flapping. It had multiple failures and errors in the PR validation build, but none when I ran locally. The source code is at @../../src/apache/hbase/ and the logs for good run is at @Good-Run and bad run is at @bad-run. Can you compare the logs and try to understand why the run failed and look at the code for a potential fix? Summary by AI: The test is flapping due to a **timing/race condition** in the quota system: 1. **5-second timeout too short**: The `testTraffic` method only waited 5 seconds for quotas to take effect 2. **Quota cache not fully propagated**: On slower systems (like CI), the quota cache refresh can be asynchronous and may not fully propagate in time 3. **Quotas bypassed**: When cache isn't refreshed, the logs show `"bypass expected false, actual true"`, meaning all requests succeed instead of being throttled 4. **Insufficient retries**: Each iteration takes ~1.3 seconds, so only 3-4 retries fit in 5 seconds, not enough for the quota system to stabilize **Bad run pattern:** - Test expects 1 successful request but gets 5 (all succeed because quotas not enforced) - Retries every ~1.3 seconds for 4 attempts - Times out after 5 seconds with "Waiting timed out after [5,000] msec" **Good run pattern:** - Quotas enforced immediately - Tests pass quickly (36.97s total vs 63.14s for failed run) Increased the timeout in `testTraffic()` from **5,000ms to 30,000ms** (line 263). This gives the quota system sufficient time to: - Complete cache refresh - Propagate quota settings across all components - Handle slower CI environments This is a conservative fix that maintains the retry logic while allowing adequate time for the distributed quota system to stabilize. The 30-second timeout is still reasonable for a test and should handle the asynchronous nature of quota enforcement.
|
Here is a more detailed reasoning by AI. Attaching the log files from bad run. Here are the key pieces of evidence: 1. The timeout error (bad run)File: All 3 failures share this same 5-second timeout at 2. Quota cache stuck in bypass mode (bad run)File: The test expects only 1 request to succeed (because quotas should throttle), but 5 succeed every time because the limiter is still in 3. Compare with good runFile: Quota enforcement kicks in immediately — no "not refreshed" messages at all for these assertions. 4. The 5-second timeout in source codeFile: TEST_UTIL.waitFor(5_000, () -> {5. The retry penalty eats the budgetFile: if (!success) {
triggerUserCacheRefresh(TEST_UTIL, false, TABLE_NAME);
waitMinuteQuota();
}Each failed attempt calls On CI (slower), the quota cache takes longer to propagate from bypass to enforcing. The good run on your local machine was fast enough that it resolved on the first attempt. |
AI tool: Claude Code
AI Model: Sonnet 4.5 (reverified using Opus 4.6 (1M context))
Prompt: You need to diagnose the HBase test TestBlockBytesScannedQuota that is flapping. It had multiple failures and errors in the PR validation build, but none when I ran locally. The source code is at @../../src/apache/hbase/ and the logs for good run is at @Good-Run and bad run is at @bad-run. Can you compare the logs and try to understand why the run failed and look at the code for a potential fix?
Summary by AI:
The test is flapping due to a timing/race condition in the quota system:
testTrafficmethod only waited 5 seconds for quotas to take effect"bypass expected false, actual true", meaning all requests succeed instead of being throttledBad run pattern:
Good run pattern:
Increased the timeout in
testTraffic()from 5,000ms to 30,000ms (line 263). This gives the quota system sufficient time to:This is a conservative fix that maintains the retry logic while allowing adequate time for the distributed quota system to stabilize. The 30-second timeout is still reasonable for a test and should handle the asynchronous nature of quota enforcement.