Skip to content

fix: surface per-block execution timeouts instead of silently swallowing them#790

Merged
sduchesneau merged 2 commits into
streamingfast:developfrom
YaroShkvorets:fix/swallowed-deadline-exceeded
May 28, 2026
Merged

fix: surface per-block execution timeouts instead of silently swallowing them#790
sduchesneau merged 2 commits into
streamingfast:developfrom
YaroShkvorets:fix/swallowed-deadline-exceeded

Conversation

@YaroShkvorets
Copy link
Copy Markdown
Contributor

@YaroShkvorets YaroShkvorets commented May 27, 2026

Summary

Since 5b3f59b2 if block execution times out (like with non-deterministic eth_call call loop), the substreams engine quietly swallows that timeout error and skips the block as if it doesn't have any output.
This PR should fix it.

Symptom

Running in development mode against tier1 node with --substreams-block-execution-timeout=90s:

❯ substreams run https://github.com/pinax-network/substreams-evm/releases/download/evm-balances-v0.3.3/evm-clickhouse-balances-v0.3.3.spkg db_out -e hyperevm.substreams.pinax.network:443 -s 35129899 -t +2 --limit-processed-blocks 0 --noop-mode
2026-05-27T16:43:48.966-0400 WARN (substreams) noop-mode used without production-mode: server will execute in development mode without sending the data, this is probably not what you want
Connecting...


----------- BLOCK #35,129,899 (8c85da1b36f703305b32e4d5128b7582c749c3b22decac5b4e4359ddde13d8df) age=307h6m26.769813s ---------------
----------- BLOCK #35,129,900 (7fe17e72c0412fb100795cfc9de4567949755e085c635354ee3d873a758cff1d) age=307h8m1.02479s ---------------
📊 Usage Report
 • Egress Bytes (uncompressed): 638 B
 • Processed Blocks: 1 blocks
 • Received Blocks: 2 blocks
Completed successfully

  • Tier1 logs flood with retrying RPCCall on non-deterministic RPC call error for ~90s on a stuck eth_call.
  • After 90s, INFO (rpc-cache) stopping rpc calls here, context is canceled fires (rpccalls.go:467) — proving the per-block ctx hit its deadline.
  • 1 second later, the request successfully ends with substreams request stats ... "error": "". The client sees Completed successfully. Egress stats show Processed Blocks: 1, Received Blocks: 2 — one block silently dropped.

Root cause

Two combined bugs in pipeline/process_block.go:

1. Dead code in recoverExecutionPanic:

// If recovering from context cancel, simply return the execution error as-is
if ctx.Err() != nil || errors.Is(recoveredErr, context.Canceled) {
    return executionError
}

// If the context deadline was exceeded, return a deadline exceeded error RPC error
if ctx.Err() == context.DeadlineExceeded {            // ← unreachable
    return connect.NewError(connect.CodeDeadlineExceeded, ...)
}

The catch-all ctx.Err() != nil matches DeadlineExceeded, so the deadline-specific branch never fires. Function returns executionError (typically nil) → executeModules returns nil → block silently skipped.

2. Broken error chain in Pipeline.execute's re-panic:

panic(fmt.Errorf("unknown error: %s", r))

Uses %s instead of %w, severing the wrap chain. Even if the dead code were fixed, downstream callers couldn't errors.Is(err, context.DeadlineExceeded) on the result.

Path through the code (for wasmtime)

  1. Per-block ctx fires at 90s.
  2. rpccalls.go:476 returns "timeout while doing eth_call ... cause: %w" wrapping context.DeadlineExceeded.
  3. wasm/wasmtime/instance.go:49 panic(fmt.Errorf("running wasm extension ...: %w", err)).
  4. wasmtime propagates the panic as a Go panic through entrypoint.Call() (no recover in wasm/wasmtime/module.go).
  5. Propagates through wasmCallmapexec.runRunModule to Pipeline.execute's deferred recover.
  6. Doesn't match ErrWasmDeterministicExec / ErrFoundationalStoreCanceled → re-panic with broken %s wrap.
  7. Caught by executeModules's deferred recover → recoverExecutionPanic(ctx, nil, ...).
  8. ctx.Err() != nil (DeadlineExceeded) → returns nil.
  9. Pipeline marches on. Once block.Number > stopBlockNum, handleStepNew returns io.EOF → stream "succeeds".

Wazero is unaffected: it converts host-function panics to error returns at f.Call(), so the recover path isn't entered.

Fix

  • recoverExecutionPanic: check errors.Is(ctx.Err(), context.DeadlineExceeded) before the catch-all so deadline-exceeded gets the intended CodeDeadlineExceeded connect error.
  • Pipeline.execute's re-panic: wrap with %w so the chain is preserved for any callers that inspect it.

Bug introduced

Commit 5b3f59b (2025-10-21, "Improved panic handling and make clearer the intent when throwing errors in the WASM handlers") added ctx.Err() != nil || to a previously-correct condition. Present in v1.17.0 and every release since.

Test plan

  • go vet ./pipeline/... clean
  • go test -short ./pipeline/... passes
  • Manual repro: stuck eth_call on a non-deterministic error (e.g. Reth StackUnderflow) with --substreams-block-execution-timeout=90s. Before this PR: client sees Completed successfully with Processed: N-1 / Received: N. After: client sees DeadlineExceeded gRPC error with execution timed out at block ....

🤖 Generated with Claude Code

YaroShkvorets and others added 2 commits May 27, 2026 16:29
…ing them

When a WASM host function (e.g. eth_call) panics in wasmtime because the
per-block execution context's deadline has fired, the panic propagates up
to Pipeline.execute's deferred recover, which re-panics with "unknown
error: %s" (chain broken) up to executeModules's deferred recover, which
calls recoverExecutionPanic.

recoverExecutionPanic had two bugs that combined to silently drop the
offending block:

1. The DeadlineExceeded handler was dead code: the catch-all
   `if ctx.Err() != nil || errors.Is(recoveredErr, context.Canceled)`
   above it matched DeadlineExceeded too, so the deadline-specific branch
   below was never reached. The function would return `executionError`
   (typically nil), making executeModules return nil and the pipeline
   move on to the next block as if nothing happened.

2. Pipeline.execute's re-panic used `%s` instead of `%w` when wrapping
   the recovered error, severing the wrap chain so callers couldn't
   `errors.Is(err, context.DeadlineExceeded)` even when the swallow
   condition was avoided.

Fix:
- Check `errors.Is(ctx.Err(), context.DeadlineExceeded)` first so the
  intended `CodeDeadlineExceeded` connect error is returned. The
  catch-all for Canceled remains below.
- Wrap the recovered error with `%w` on re-panic so the deadline
  information survives all the way to the gRPC error mapper.

Bug introduced in commit 5b3f59b (2025-10-21, "Improved panic handling
and make clearer the intent when throwing errors in the WASM handlers"),
present in v1.17.0 onward. Affects wasmtime (default runtime for
Rust-built spkgs); wazero is unaffected because it converts host panics
to error returns at f.Call(), bypassing the recover path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Five focused unit tests covering each branch of recoverExecutionPanic.
The deadline-exceeded test asserts:

  - result is non-nil (catches the silent-swallow regression directly)
  - result is a connect.Error with CodeDeadlineExceeded
  - the original context.DeadlineExceeded survives the wrap chain so
    callers can errors.Is() on it
  - the block ref is in the message

Verified: this test fails on the pre-fix code with the exact assertion
"deadline-exceeded panic must not be silently swallowed", and passes
with the fix.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@YaroShkvorets YaroShkvorets marked this pull request as ready for review May 27, 2026 20:49
Copy link
Copy Markdown
Contributor

@sduchesneau sduchesneau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! thanks

@sduchesneau sduchesneau merged commit 5a7f57b into streamingfast:develop May 28, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants