Skip to content

fix(flow-02): poll node-ready + eRPC /rpc instead of one-shot#479

Closed
bussyjd wants to merge 1 commit into
mainfrom
fix/release-smoke-flow02-cold-start-polling
Closed

fix(flow-02): poll node-ready + eRPC /rpc instead of one-shot#479
bussyjd wants to merge 1 commit into
mainfrom
fix/release-smoke-flow02-cold-start-polling

Conversation

@bussyjd
Copy link
Copy Markdown
Collaborator

@bussyjd bussyjd commented May 12, 2026

Summary

Two cold-start races surfaced on spark1's smoke run:

  1. Node registration race. `obol stack up` returns once the k3s API responds, but the k3d node can take another few seconds to appear in kubectl get nodes. The old run_step_grep "Nodes ready" "Ready" kubectl get nodes raced that window and FAILed on "No resources found". Switched to poll_step_grep with 12×5s ceiling. Also tightened the pattern to Ready (surrounding spaces) so the word "NotReady" in the status column does not satisfy the match.
  2. eRPC upstream-pool warmup race. eRPC's HTTP listener becomes reachable before its upstream pool has fully resolved every alias. A one-shot GET /rpc moments after pods report Running often returns a partial list missing base-sepolia, which then cascades into the chains-OK and JSON-RPC checks. Poll the first eRPC assertion until base-sepolia appears (or 60s elapses); the subsequent assertions then have a stable list to reason about.

Test plan

  • bash -n flows/flow-02-stack-init-up.sh clean
  • Re-run release-smoke on spark1; expect flow-02 steps 5, 11, 12, 14, 15 to no longer race the cluster bring-up

Two cold-start races surfaced on spark1's smoke run:

1. `obol stack up` returns once the k3s API responds, but the k3d
   node can take another few seconds to appear in `kubectl get nodes`.
   Old `run_step_grep "Nodes ready" "Ready" kubectl get nodes` raced
   that window and FAILed on "No resources found". Switched to
   `poll_step_grep` with 12×5s = 60s ceiling. Also tightened the
   pattern to ` Ready ` (surrounding spaces) so the word "NotReady"
   in the status column does not satisfy the match.

2. eRPC's HTTP listener becomes reachable before its upstream pool
   has fully resolved every alias. A one-shot GET /rpc moments after
   pods report Running often returns a partial list missing
   `base-sepolia`, which then cascades into the chains-OK and
   JSON-RPC checks. Poll the first eRPC assertion until base-sepolia
   appears (or 60s elapses); the subsequent assertions then have a
   stable list to reason about.
@bussyjd bussyjd force-pushed the fix/release-smoke-flow02-cold-start-polling branch from 190c716 to b5f7628 Compare May 12, 2026 09:34
bussyjd added a commit that referenced this pull request May 13, 2026
…490)

Integration branch that takes the release-smoke gate from "broken at flow-11 step 43" to 13/13 PASS on spark1 against the production facilitator (Base Sepolia + x402.gcp.obol.tech).

Folds in the in-flight smoke fixes (#476 runner refactor, #477 ERE alternation, #478 wallet check + verifier readiness, #479 flow-02 cold-start polling, #483 sell-inference flag align, #484 frontend digest-pin v0.1.23) plus eight additional root-cause fixes uncovered while driving the gate green:

- internal/x402/setup.go EnsureVerifier rewrites image pins in-memory before kubectl apply so OBOL_DEVELOPMENT=true source changes actually reach the cluster
- internal/x402/chains.go ResolveChainInfo accepts both legacy aliases and CAIP-2 ids
- flows/flow-10-anvil-facilitator.sh drops --prune-history (which was enable-pruning, not retention) and adds --host 0.0.0.0 + cluster-reachability preflight
- internal/defaults/defaults.go combo-form image-pin regex now lists longest first
- flows/lib.sh paid-RPC support (BASE_SEPOLIA_RPC, ALCHEMY_BASE_SEPOLIA_API_KEY) + Bob top-up preflight + secret scrubbing collapsing paid-RPC URLs to TLD-only
- flows/flow-07-sell-verify.sh and flow-08-buy.sh wrap 402-body fetch in 12x5s retry to absorb first-request flake on freshly-deployed verifier
- cmd/obol/network.go redactRPCURL host-anchored against parsed URL (CodeQL fix, no unanchored regex)
- internal/x402/verifier.go drops debug log that leaked user-controlled path (CodeQL log-injection fix)
- .agents/skills/obol-stack-dev rebuilt: 1750 -> 882 lines, 8-row symptom->fix table indexed at the top of SKILL.md
- CLAUDE.md refreshed: stale CLI surface, added six release-smoke pitfalls, generalized personal-path Related Codebases

Validated: RELEASE_SMOKE_INCLUDE_OBOL=true RELEASE_SMOKE_INCLUDE_OBOL_FORK=true bash flows/release-smoke.sh on spark1 against commit 4082961 (and reverified on each subsequent commit) -> 13/13 PASS, RC=0, "Release smoke passed".

Full retrospective: plans/release-smoke-hardening-20260513.md.

Closes #476 #477 #478 #479 #483 #484.
@bussyjd
Copy link
Copy Markdown
Collaborator Author

bussyjd commented May 13, 2026

Superseded by #490 (merged in 6f1f9ed). The integration branch carried this change plus the rest of the in-flight smoke fixes and validated the bundle against a green release-smoke (13/13 PASS, RC=0) on spark1 against the production facilitator. Closing in favor of the merged integration.

@bussyjd bussyjd closed this May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant