test(flow-08): tighten buy-side correctness assertions#466
Conversation
Smoke evidence — assertion fires exactly as designedRan ``` The default Comparison vs main (same .env, same host)
Cascade after [6] (expected, derivative, not new findings)Once [6] fails, the next four checks in flow-08 inevitably fail because they all assume a properly-funded buyer: ``` All four are noise around the same root cause. Optionally we could short-circuit after [6] to keep the artifact cleaner, but I'd leave them as-is — they're useful diagnostic context if [6] passes someday and one of them still fails. Other smoke notes
Suggested merge order
|
Follow-up: same 503 symptom on flow-08 — root cause is Anvil stalenessReproduced the flow-11 step [43]
Root causeFacilitator ( The facilitator's Fix (verified)Restarted Anvil + facilitator with a fresh fork (new base block
The 3 remaining FAILs (steps 8/11/16) are stale-state artifacts from the first run's auths still being in the sidecar pool — not flow correctness issues. Suggested follow-up (separate PR, not this one)flow-10 currently reuses a running Anvil if port 8545 is bound. The reuse path is what put us in the pruned-state window. Options:
Also worth a tickbox: |
842473e to
e35d872
Compare
… review
Three correctness gaps surfaced by an audit against the named payment
invariants in references/live-obol-qa.md and references/paid-commerce.md:
1. Buyer-wallet invariant. flow-08 previously funded whatever wallet the
default obol-agent happened to generate at stack init — the exact
"do not fund a generated signer" anti-pattern named in the live-OBOL
QA reference. Now derives the deterministic Bob address from
.env REMOTE_SIGNER_PRIVATE_KEY (the canonical keccak-of-abi-encode
pattern used by flow-11/13/14) and asserts AGENT_WALLET == BOB_WALLET
before funding. The flow header documents the upstream pre-seed
requirement.
2. Exact balance deltas. Replaces "seller balance increased" + missing
buyer-side check with strict pre/post deltas on both sides:
post_seller - pre_seller == PAID_AMOUNT AND
pre_buyer - post_buyer == PAID_AMOUNT. Also removes a swallowed-
failure else-branch that emitted `pass` when the seller balance had
neither increased nor stayed equal (i.e. decrease was reported as
pass).
3. Decouple paid-inference correctness from model wording. The pre-
existing assertion required the model to return the verbatim string
"USDC payment smoke test passed." Replaced with a structural check:
HTTP 200 + non-empty TEXT. The verbatim match is kept as a separate
informational `pass` line. Aligns with paid-commerce.md ("do not
rely on agent wording").
Secondary correctness tightenings rolled in:
- Fail-fast on empty PAID_AMOUNT from the 402 body (previously silent;
only surfaced much later at the settlement-receipt step).
- Master-key read failure now `emit_metrics; exit 1` instead of
continuing with an empty bearer token.
- x402-buyer auth-pool assertion now requires the exact expected count
(EXPECTED_AUTHS, derived from BUY_BUDGET_USDC / per-request price)
instead of the loose `remaining=[1-9]` (single-digit) pattern.
- New post-call step asserts remaining decremented by exactly 1 — the
spend-proof half of the sidecar contract.
- Anvil funding poll regex broadened from exact `^1000000000 ` to
`^[1-9][0-9]{8,} ` so a re-run with pre-existing balance doesn't
fail the poll.
The unused BUY_AUTH_COUNT=5 declaration is removed; the same value is
now derived and asserted via EXPECTED_AUTHS.
… race)
Three independent bugs were keeping flow-08 red against the new buyer-wallet
invariants — fixing all three takes the buy-side smoke from 10/16 to 16/16:
1. Drop publicnode.com from the Base Sepolia fork-RPC candidates and lead
with archive-capable endpoints (drpc, sepolia.base.org, tenderly,
onfinality, sentio, pocket). publicnode is non-archive, so once the
Anvil fork drifted past its retention window the facilitator's
`eth_getStorageAt` for USDC balances returned `state at block #N is
pruned` and every paid call failed with `Payment verification failed`.
List source: chainlist.org/rpcs.json, archive-tested against USDC.
2. Switch run_step_grep / poll_step_grep from `grep -q` (BRE) to
`grep -qE` (ERE). Step [8]'s pattern `^[1-9][0-9]{8,} ` uses an ERE
quantifier; under BRE the braces are literal and the pattern can never
match, so the step was silently timing out for 120s on every run even
though the underlying `cast call balanceOf` was returning the expected
1e9 USDC value. Plain-substring callers (step [11], etc.) are not
affected.
3. Wrap step [16] (`x402-buyer auth pool decremented by 1`) in
poll_step_grep. The buyer sidecar persists the spent-auth state
asynchronously after the upstream returns, so a one-shot read of
/status could still report `remaining=EXPECTED_AUTHS` for a few
seconds even when settlement, the on-chain Transfer, and the
buyer/seller balance deltas had already cleared.
Also silence Foundry's nightly-build stderr warning globally for flow
runs (FOUNDRY_DISABLE_NIGHTLY_WARNING=1). Nightly is what we want for
Base Sepolia archive-lookup support, but its per-invocation warning
contaminated cast output and triggered exactly the kind of pattern-match
false-FAIL that (2) above was already vulnerable to.
Skill update: add a step-6 rule to the obol-stack-dev skill that
dev-branch work must use OBOL_DEVELOPMENT=true on obolup.sh and
obol stack up — without it, the installer pulls the latest release
binary and local branch changes are never exercised.
Smoke evidence: clean flow-08 run on this branch posts 16/16 PASS with
on-chain settlement tx 0x8da4bc3990853fce60b942fd6bc435ed0c373cdb44b228c8e5dea92a83da75b8,
exact ±1000 micro-USDC deltas on buyer and seller, and the sidecar
correctly decremented to remaining=4.
…grep -E Pull the lessons from the flow-08 green-up into the obol-stack-dev skill so future sessions don't rediscover them: - paid-commerce.md: Anvil must be nightly (stable lags ~5mo behind on Base Sepolia archive lookups); fork-RPC must be archive (publicnode is out, drpc/base/tenderly/onfinality/sentio/pocket are in); long-lived Anvil drifts past upstream retention; FOUNDRY_DISABLE_NIGHTLY_WARNING=1 is load-bearing; poll_step_grep / run_step_grep use grep -E so ERE quantifiers work; sidecar /status is asynchronously consistent with the spent-auth count. - dev-environment.md: the OBOL_DEVELOPMENT=true obolup wrapper is `go run` and its per-invocation rebuild trips short port-forward polls — build a real binary into .workspace/bin/obol before running flows. Foundry isn't managed by obolup; install nightly via foundryup. - troubleshooting.md: three new entries with concrete diagnoses and fix commands — facilitator "state pruned" 503, the silent ERE-quantifier pattern timeout, and the PurchaseRequest tombstone-cleanup ritual when the controller's finalizer doesn't fire.
d8fa369 to
b011fd2
Compare
Summary
A specialist audit of
flows/flow-08-buy.shagainst the named payment invariants in.claude/skills/obol-stack-dev/references/live-obol-qa.mdandreferences/paid-commerce.mdsurfaced three correctness gaps. This PR addresses all three plus secondary precision issues from the same review.1. Buyer-wallet invariant —
L211–217(was)flow-08 previously funded whatever wallet
obol agent wallet list obol-agentreturned. Ifobol stack upgenerated a random agent wallet, flow-08 happily funded that withanvil_setStorageAtand the test passed — exactly the "do not fund a generated signer to make the test pass" anti-pattern named inlive-obol-qa.md.Now derives the canonical Bob address from
.env REMOTE_SIGNER_PRIVATE_KEYusing the keccak-of-abi-encode pattern thatflow-11-dual-stack.shalready uses (line 794), and assertsAGENT_WALLET == BOB_WALLETbefore funding.2. Exact balance deltas —
L300–330(was)Old assertion was "seller balance increased" (post > pre), no buyer-side check, and a swallowed-failure else-branch at L322–323 that emitted
passeven when seller balance decreased. Now both sides are checked strictly:post_seller - pre_seller == PAID_AMOUNTpre_buyer - post_buyer == PAID_AMOUNTwith no catch-all
pass. Adds aPRE_BUYER_BALcapture next to the existingPRE_SELLER_BAL.3. Decouple paid-inference correctness from verbatim model wording —
L274–281(was)Old check required the model to return the literal string
"USDC payment smoke test passed.". Payment correctness should not depend on the model's instruction-following (paid-commerce.md: "Do not rely on agent wording"). Replaced with a structural assertion (HTTP 200 + non-emptyTEXT). The verbatim match is preserved as a separate informationalpassline.Secondary fixes rolled in
PAID_AMOUNTparse (was silent).LITELLM_MASTER_KEYempty nowemit_metrics; exit 1instead of continuing with empty bearer token.remaining=$EXPECTED_AUTHSinstead of looseremaining=[1-9].remainingdecremented by exactly 1 after the paid call.^1000000000to^[1-9][0-9]{8,}so a re-run with pre-existing balance doesn't fail.BUY_AUTH_COUNT=5removed; now derived asEXPECTED_AUTHSand actually asserted.Test plan
bash -n flows/flow-08-buy.sh— syntax cleanflows/release-smoke.sh) on spark1 — currently in flight against main astmux qa-release-20260511-193603. Will re-run against this branch and report.obol-agentwas created with--private-key-file <bob-derived>: confirm flow-08 passes end-to-end with exact deltas and the new sidecar-decrement step.