feat(checkpoint): wire checkpointing into agent event loop by JackYPCOnline · Pull Request #2190 · strands-agents/sdk-python

JackYPCOnline · 2026-04-22T20:38:45Z

Description

Wires the Checkpoint data model (landed in #2181) into the agent runtime so an opt-in checkpointing=True agent pauses at ReAct cycle boundaries and emits a serializable pause-point marker. The marker travels through AgentResult.checkpoint and is passed back to a fresh agent through a checkpointResume block. The design mirrors the existing interrupt pattern: stop_reason="checkpoint" to signal the pause, content-block resume, no new method surface to learn.

A Checkpoint is a position marker (which boundary fired and which cycle index), not a state snapshot. State persistence is the caller's responsibility. The recommended pairing is a SessionManager for state continuity plus checkpointing=True for boundary signalling. Callers who want to own persistence themselves can put state in Checkpoint.snapshot or Checkpoint.app_data.

User-facing API (zero breaking changes — opt-in only):

from strands import Agent
from strands.session import FileSessionManager

agent = Agent(
    tools=[...],
    session_manager=FileSessionManager(session_id="run-1", storage_dir="..."),
    checkpointing=True,
)

result = await agent.invoke_async("do the thing")

while result.stop_reason == "checkpoint":
    save_somewhere(result.checkpoint.to_dict())
    # later, possibly in a fresh process / activity:
    fresh = Agent(
        tools=[...],
        session_manager=FileSessionManager(session_id="run-1", storage_dir="..."),
        checkpointing=True,
    )
    result = await fresh.invoke_async(
        {"checkpointResume": {"checkpoint": load_somewhere()}}
    )

print(result.message)  # stop_reason == "end_turn"

V0 known limitations:

Metrics reset on each resume call.
OpenAIResponsesModel(stateful=True) not supported.
BeforeInvocationEvent / AfterInvocationEvent fire on every resume (same as interrupts).
Per-tool granularity within a cycle requires a custom ToolExecutor. The SDK checkpoint operates at cycle boundaries.
Streaming callbacks do not re-emit on replay.

Related Issues

Documentation PR

Type of Change

New feature

Testing

Verified the changes do not break functionality or introduce warnings in consuming repositories.

I ran hatch run prepare

Evidence from fresh runs:

hatch test — 3026 passed, 4 skipped, 0 failed.
hatch run hatch-static-analysis:lint-check — ruff check and mypy both clean.
hatch run hatch-static-analysis:format-check — all files formatted.\

Checklist

I have read the CONTRIBUTING document
I have added any necessary tests that prove my fix is effective or my feature works
I have updated the documentation accordingly (user-guide page is a follow-up PR in agent-docs; module-level docstring in checkpoint.py covers V0 limitations, precedence rules, and the recommended SessionManager pairing)
I have added an appropriate example to the documentation to outline the feature, or no new docs are needed (reference Temporal / Dapr / Step Functions examples are the next milestone in the durable-execution tracking plan)
My changes generate no new warnings
Any dependent changes have been merged and published (Part A — feat: introduce checkpoint in experimental #2181 — is merged on main; this PR builds on it)

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

codecov · 2026-04-22T20:41:11Z

Codecov Report

❌ Patch coverage is 90.62500% with 6 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
strands-py/src/strands/agent/agent.py	80.00%	3 Missing and 2 partials ⚠️
strands-py/src/strands/event_loop/event_loop.py	96.42%	0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

github-actions · 2026-04-22T20:53:59Z

Assessment: Comment

This is a well-structured PR that wires checkpoint functionality into the agent loop with a clean opt-in design. The state machine is carefully reasoned and the integration tests (especially the crash-after-tools test) are compelling. Two themes warrant attention before merge:

Review Themes

API Review Required: This introduces meaningful new public API surface (Agent parameter, AgentResult field, new StopReason, content block types). Per the API Bar Raising process, it needs a needs-api-review label and reviewer sign-off. Key design questions: is checkpointing: bool the right level of configurability, and should there be a high-level resume_from_checkpoint() method alongside the content-block primitive?
Error Contract Consistency: The resume validation comments claim to mirror _InterruptState.resume() conventions (TypeError/KeyError/ValueError), but Checkpoint.from_dict now raises CheckpointException. The exception hierarchy should be consistent and documented.
Coupling Pattern: The event loop directly accesses private agent attributes (_checkpointing, _checkpoint_resume_context). This mirrors the existing interrupt pattern but extends the coupling surface. Consider exposing checkpoint config as an explicit parameter or read-only property.
Test Coverage Gap: Missing a test for the checkpointing=True + end_turn (no tool use) path, and the Codecov report shows 1 partial line in event_loop.py.

The feature design, state-machine logic, and durability proof are solid. The integration tests are particularly well-designed.

github-actions · 2026-04-22T21:34:56Z

Assessment: Comment

Good progress since the last round — the frozen=True dataclass, _build_checkpoint_stop_event extraction, and updated error convention documentation address several prior concerns. A few new items surfaced:

New Review Items

Docstring accuracy: event_loop_cycle Yields docstring still documents a 4-element tuple but the actual event is now 7 elements. The cancel() docstring uses "checkpoint" in a way that now conflicts with the durable-execution Checkpoint concept introduced here.
AGENTS.md update: The directory structure section needs to be updated to include experimental/checkpoint/ per the repo's own guidelines.
Cancel + checkpoint interaction: When both checkpointing=True and cancel_signal are set, checkpoint emission takes precedence over cancel. This is probably correct but should be documented or tested.

The prior-round items around API review (needs-api-review label, high-level resume method question) remain open for maintainer decision. The core state-machine logic and test coverage are solid.

github-actions · 2026-04-30T18:20:54Z

Assessment: Request Changes

The checkpoint wiring itself remains solid (approved in R4). However, the PR has expanded to include two additional unrelated features that weren't present in prior review rounds and aren't documented in the PR description.

Review Details

PR scope expansion: The PR now bundles three unrelated changes — checkpoint wiring, token estimation (_estimate_input_tokens / projected_input_tokens), and context offloader improvements (default change + S3 URI). These should be separated or at minimum documented.
Breaking default change: ContextOffloader(include_retrieval_tool=...) changed from False to True, affecting existing users on upgrade without any code change. This conflicts with the "Pay for Play" decision record.
Minor robustness: _estimate_input_tokens directly indexes usage["inputTokens"] without fallback — caught by outer try/except but could be more defensive.

The checkpoint feature itself is ready to merge. The additional features need separate review or at minimum PR description documentation.

JackYPCOnline · 2026-04-30T18:27:23Z

/strands I did rebase from main, it is not adding new feature, can you review the PR against current main. Breaking change is not introduced by checkpointing.

github-actions · 2026-04-30T18:30:05Z

Final Review (Post-Rebase) — No Blocking Issues

I've re-reviewed the PR diff against current main and confirmed the R5 automated review was a false positive — _estimate_input_tokens, projected_input_tokens, and the BeforeModelCallEvent changes all already exist on main. The rebase brought those in; this PR does not introduce them.

What this PR actually changes (vs. current main):

File	Change
`agent.py`	+`checkpointing` param, `_try_consume_checkpoint_resume` helper, cancel docstring fix
`agent_result.py`	+`checkpoint` field, round-trip serialization, typo fix (`invalid@` → `invalid`)
`event_loop.py`	+`_build_checkpoint_stop_event` helper, checkpoint emission at `after_model`/`after_tools`, 6→7 tuple
`_events.py`	+`checkpoint` kwarg on `EventLoopStopEvent`
`checkpoint.py`	Docstring rewrite, `ValueError` → `CheckpointException`, `frozen=True`
`exceptions.py`	+`CheckpointException`
Tests	New checkpoint unit/integration tests, mechanical 6→7 tuple unpackers in existing tests

Verification against main:

✅ EventLoopStopEvent on main is a 6-tuple → PR correctly extends to 7 with checkpoint as final element
✅ _estimate_input_tokens already on main — not introduced by this PR
✅ StopReason literal on main already includes "checkpoint" (from Part A, feat: introduce checkpoint in experimental #2181)
✅ No breaking changes for checkpointing=False (default) — all new paths gated on agent._checkpointing

Open review threads from @pgrayy:

These are design discussion items, not bugs:

TemporalToolExecutor pattern recommendation — open question for future design
"Precedence" vs "order" terminology — stylistic
model_state capture — acknowledged as planned follow-up

Conclusion:

No crucial bugs, gaps, or new risks. The PR is correctly scoped to checkpoint wiring only. The prior R5 "Request Changes" assessment was based on incorrectly attributing rebased-from-main code to this PR.

github-actions · 2026-05-27T17:45:44Z

Assessment: Approve

The PR is correctly scoped to checkpoint wiring after the rebase onto current main. The new tests_integ/test_agent_checkpoint.py is a solid addition — it exercises the full pause/resume/durability contract against real Bedrock, with proper skip markers for CI without credentials. The _drive_to_completion helper pattern is clean and reusable.

One trivial nit: duplicate @pytest.mark.asyncio decorator in test_model.py (harmless but should be removed).

No other issues. The pgrayy design threads are open discussions for future direction, not blockers for this PR.

github-actions · 2026-05-29T18:09:20Z

Assessment: Comment

The checkpoint wiring is clean and well-designed. The evolution from snapshot-based state capture to lightweight pause-point markers (paired with SessionManager for state continuity) is a good simplification that keeps the SDK surface area small. The state machine logic, cancel/interrupt precedence, and one-shot resume consumption are all correct.

Review Details

Integration test skip guard: tests_integ/test_agent_checkpoint.py is missing the pytestmark skip used by analogous Bedrock tests (test_cancellation.py), which will cause failures rather than skips in CI without AWS credentials.
Dead test setup: Six take_snapshot mock assignments in checkpoint event_loop tests are leftover from the previous snapshot-based design and never get called — they should be cleaned up to avoid confusion.

The core implementation is solid and ready for merge once the skip guard is added.

Add an opt-in `checkpointing=True` flag on Agent that pauses the event loop at ReAct cycle boundaries (after_model and after_tools) and returns a serializable Checkpoint via AgentResult. Resume by passing the persisted checkpoint back as a `checkpointResume` block (dict or single-element list). The Checkpoint itself is a pause-point marker (position + cycle_index). State persistence is the caller's job. The recommended pattern is to pair `checkpointing=True` with a `SessionManager`: SessionManager handles the conversation state, Checkpoint signals the boundary. Callers who need to own the persistence boundary themselves can put state in `Checkpoint.snapshot` or `Checkpoint.app_data`. Non-checkpointing behavior is unchanged: emission sites and the post-tool cancel fast-path are gated on agent._checkpointing, so callers who do not opt in see identical runtime behavior to main. Includes: - Checkpoint dataclass (Part A) integration with the event loop - Agent._try_consume_checkpoint_resume helper that validates and consumes the checkpointResume block (TypeError for shape, KeyError for lookup, ValueError for misconfig, CheckpointException for schema mismatch). Accepts both dict and list shapes for the resume block. - _build_checkpoint_stop_event helper factoring the two emission sites - AgentResult.checkpoint field and round-trip in to_dict/from_dict - EventLoopStopEvent extended with the checkpoint kwarg (7-tuple) - Unit tests for emission, resume, cancel/interrupt precedence - Integration tests against Bedrock that demonstrate the recommended Checkpoint + SessionManager pattern end-to-end

github-actions Bot added the size/l label Apr 22, 2026

JackYPCOnline had a problem deploying to auto-approve April 22, 2026 20:38 — with GitHub Actions Failure

JackYPCOnline temporarily deployed to manual-approval April 22, 2026 20:39 — with GitHub Actions Inactive

JackYPCOnline marked this pull request as draft April 22, 2026 20:39

github-actions Bot added the strands-running label Apr 22, 2026