Skip to content

HBASE-30111 Fix for flapper TestBlockBytesScannedQuota by AI#8117

Open
haridsv wants to merge 1 commit intoapache:masterfrom
haridsv:HBASE-30111
Open

HBASE-30111 Fix for flapper TestBlockBytesScannedQuota by AI#8117
haridsv wants to merge 1 commit intoapache:masterfrom
haridsv:HBASE-30111

Conversation

@haridsv
Copy link
Copy Markdown
Contributor

@haridsv haridsv commented Apr 23, 2026

AI tool: Claude Code
AI Model: Sonnet 4.5 (reverified using Opus 4.6 (1M context))
Prompt: You need to diagnose the HBase test TestBlockBytesScannedQuota that is flapping. It had multiple failures and errors in the PR validation build, but none when I ran locally. The source code is at @../../src/apache/hbase/ and the logs for good run is at @Good-Run and bad run is at @bad-run. Can you compare the logs and try to understand why the run failed and look at the code for a potential fix?

Summary by AI:

The test is flapping due to a timing/race condition in the quota system:

  1. 5-second timeout too short: The testTraffic method only waited 5 seconds for quotas to take effect
  2. Quota cache not fully propagated: On slower systems (like CI), the quota cache refresh can be asynchronous and may not fully propagate in time
  3. Quotas bypassed: When cache isn't refreshed, the logs show "bypass expected false, actual true", meaning all requests succeed instead of being throttled
  4. Insufficient retries: Each iteration takes ~1.3 seconds, so only 3-4 retries fit in 5 seconds, not enough for the quota system to stabilize

Bad run pattern:

  • Test expects 1 successful request but gets 5 (all succeed because quotas not enforced)
  • Retries every ~1.3 seconds for 4 attempts
  • Times out after 5 seconds with "Waiting timed out after [5,000] msec"

Good run pattern:

  • Quotas enforced immediately
  • Tests pass quickly (36.97s total vs 63.14s for failed run)

Increased the timeout in testTraffic() from 5,000ms to 30,000ms (line 263). This gives the quota system sufficient time to:

  • Complete cache refresh
  • Propagate quota settings across all components
  • Handle slower CI environments

This is a conservative fix that maintains the retry logic while allowing adequate time for the distributed quota system to stabilize. The 30-second timeout is still reasonable for a test and should handle the asynchronous nature of quota enforcement.

AI tool: Claude Code
AI Model: Sonnet 4.5 (reverified using Opus 4.6 (1M context))
Prompt: You need to diagnose the HBase test TestBlockBytesScannedQuota that is flapping. It had multiple failures and errors in the PR validation build, but none when I ran locally. The source code is at @../../src/apache/hbase/ and the logs for good run is at @Good-Run and bad run is at @bad-run. Can you compare the logs and try to understand why the run failed and look at the code for a potential fix?

Summary by AI:

The test is flapping due to a **timing/race condition** in the quota system:

1. **5-second timeout too short**: The `testTraffic` method only waited 5 seconds for quotas to take effect
2. **Quota cache not fully propagated**: On slower systems (like CI), the quota cache refresh can be asynchronous and may not fully propagate in time
3. **Quotas bypassed**: When cache isn't refreshed, the logs show `"bypass expected false, actual true"`, meaning all requests succeed instead of being throttled
4. **Insufficient retries**: Each iteration takes ~1.3 seconds, so only 3-4 retries fit in 5 seconds, not enough for the quota system to stabilize

**Bad run pattern:**
- Test expects 1 successful request but gets 5 (all succeed because quotas not enforced)
- Retries every ~1.3 seconds for 4 attempts
- Times out after 5 seconds with "Waiting timed out after [5,000] msec"

**Good run pattern:**
- Quotas enforced immediately
- Tests pass quickly (36.97s total vs 63.14s for failed run)

Increased the timeout in `testTraffic()` from **5,000ms to 30,000ms** (line 263). This gives the quota system sufficient time to:
- Complete cache refresh
- Propagate quota settings across all components
- Handle slower CI environments

This is a conservative fix that maintains the retry logic while allowing adequate time for the distributed quota system to stabilize. The 30-second timeout is still reasonable for a test and should handle the asynchronous nature of quota enforcement.
@haridsv
Copy link
Copy Markdown
Contributor Author

haridsv commented Apr 23, 2026

Here is a more detailed reasoning by AI. Attaching the log files from bad run.

bad-run.tar.gz

Here are the key pieces of evidence:

1. The timeout error (bad run)

File: bad-run/org.apache.hadoop.hbase.quotas.TestBlockBytesScannedQuota.txt, line 6:

java.lang.AssertionError: Waiting timed out after [5,000] msec

All 3 failures share this same 5-second timeout at TestBlockBytesScannedQuota.java:263.

2. Quota cache stuck in bypass mode (bad run)

File: bad-run/org.apache.hadoop.hbase.quotas.TestBlockBytesScannedQuota-output.txt, repeated pattern starting around timestamp 18:29:34:

Traffic test yielded 5 successful requests. Expected 1 +/- 0
User limiter for user=runner ... not refreshed, bypass expected false, actual true

The test expects only 1 request to succeed (because quotas should throttle), but 5 succeed every time because the limiter is still in bypass=true (no quota enforced). The retry loop hits this same state 4 times in ~4 seconds, then times out at 5 seconds.

3. Compare with good run

File: good-run/org.apache.hadoop.hbase.quotas.TestBlockBytesScannedQuota-output.txt:

Traffic test yielded 1 successful requests. Expected 1 +/- 1

Quota enforcement kicks in immediately — no "not refreshed" messages at all for these assertions.

4. The 5-second timeout in source code

File: TestBlockBytesScannedQuota.java, line 263:

TEST_UTIL.waitFor(5_000, () -> {

5. The retry penalty eats the budget

File: TestBlockBytesScannedQuota.java, lines 274-277:

if (!success) {
    triggerUserCacheRefresh(TEST_UTIL, false, TABLE_NAME);
    waitMinuteQuota();
}

Each failed attempt calls triggerUserCacheRefresh, which in ThrottleQuotaTestUtil.java line 224 does Thread.sleep(250) plus a waitFor(60000, 250, ...) poll loop, and waitMinuteQuota() (line 316) calls envEdge.incValue(70000). The real wall-clock cost of each retry is ~1.3 seconds (visible in the bad-run log timestamps: 18:29:3418:29:3518:29:3618:29:38), so only 3-4 attempts fit within the 5-second window.

On CI (slower), the quota cache takes longer to propagate from bypass to enforcing. The good run on your local machine was fast enough that it resolved on the first attempt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants