Skip to content

feat(checkpoint): wire checkpointing into agent event loop#2190

Open
JackYPCOnline wants to merge 1 commit into
strands-agents:mainfrom
JackYPCOnline:checkpoint_1
Open

feat(checkpoint): wire checkpointing into agent event loop#2190
JackYPCOnline wants to merge 1 commit into
strands-agents:mainfrom
JackYPCOnline:checkpoint_1

Conversation

@JackYPCOnline
Copy link
Copy Markdown
Contributor

@JackYPCOnline JackYPCOnline commented Apr 22, 2026

Description

Wires the Checkpoint data model (landed in #2181) into the agent runtime so an opt-in checkpointing=True agent pauses at ReAct cycle boundaries and emits a serializable pause-point marker. The marker travels through AgentResult.checkpoint and is passed back to a fresh agent through a checkpointResume block. The design mirrors the existing interrupt pattern: stop_reason="checkpoint" to signal the pause, content-block resume, no new method surface to learn.

A Checkpoint is a position marker (which boundary fired and which cycle index), not a state snapshot. State persistence is the caller's responsibility. The recommended pairing is a SessionManager for state continuity plus checkpointing=True for boundary signalling. Callers who want to own persistence themselves can put state in Checkpoint.snapshot or Checkpoint.app_data.

User-facing API (zero breaking changes — opt-in only):

from strands import Agent
from strands.session import FileSessionManager

agent = Agent(
    tools=[...],
    session_manager=FileSessionManager(session_id="run-1", storage_dir="..."),
    checkpointing=True,
)

result = await agent.invoke_async("do the thing")

while result.stop_reason == "checkpoint":
    save_somewhere(result.checkpoint.to_dict())
    # later, possibly in a fresh process / activity:
    fresh = Agent(
        tools=[...],
        session_manager=FileSessionManager(session_id="run-1", storage_dir="..."),
        checkpointing=True,
    )
    result = await fresh.invoke_async(
        {"checkpointResume": {"checkpoint": load_somewhere()}}
    )

print(result.message)  # stop_reason == "end_turn"

V0 known limitations:

  • Metrics reset on each resume call.
  • OpenAIResponsesModel(stateful=True) not supported.
  • BeforeInvocationEvent / AfterInvocationEvent fire on every resume (same as interrupts).
  • Per-tool granularity within a cycle requires a custom ToolExecutor. The SDK checkpoint operates at cycle boundaries.
  • Streaming callbacks do not re-emit on replay.

Related Issues

Documentation PR

Type of Change

New feature

Testing

Verified the changes do not break functionality or introduce warnings in consuming repositories.

  • I ran hatch run prepare

Evidence from fresh runs:

  • hatch test — 3026 passed, 4 skipped, 0 failed.
  • hatch run hatch-static-analysis:lint-checkruff check and mypy both clean.
  • hatch run hatch-static-analysis:format-check — all files formatted.\

Checklist

  • I have read the CONTRIBUTING document
  • I have added any necessary tests that prove my fix is effective or my feature works
  • I have updated the documentation accordingly (user-guide page is a follow-up PR in agent-docs; module-level docstring in checkpoint.py covers V0 limitations, precedence rules, and the recommended SessionManager pairing)
  • I have added an appropriate example to the documentation to outline the feature, or no new docs are needed (reference Temporal / Dapr / Step Functions examples are the next milestone in the durable-execution tracking plan)
  • My changes generate no new warnings
  • Any dependent changes have been merged and published (Part A — feat: introduce checkpoint in experimental #2181 — is merged on main; this PR builds on it)

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@JackYPCOnline JackYPCOnline marked this pull request as draft April 22, 2026 20:39
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 22, 2026

Codecov Report

❌ Patch coverage is 90.62500% with 6 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
strands-py/src/strands/agent/agent.py 80.00% 3 Missing and 2 partials ⚠️
strands-py/src/strands/event_loop/event_loop.py 96.42% 0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Comment thread strands-py/src/strands/types/_events.py
Comment thread src/strands/event_loop/event_loop.py Outdated
Comment thread src/strands/agent/agent.py Outdated
Comment thread strands-py/src/strands/agent/agent.py
Comment thread src/strands/agent/agent.py Outdated
Comment thread strands-py/src/strands/event_loop/event_loop.py Outdated
Comment thread strands-py/src/strands/experimental/checkpoint/checkpoint.py
Comment thread tests/strands/experimental/checkpoint/test_checkpoint.py Outdated
Comment thread src/strands/agent/agent.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

Assessment: Comment

This is a well-structured PR that wires checkpoint functionality into the agent loop with a clean opt-in design. The state machine is carefully reasoned and the integration tests (especially the crash-after-tools test) are compelling. Two themes warrant attention before merge:

Review Themes
  • API Review Required: This introduces meaningful new public API surface (Agent parameter, AgentResult field, new StopReason, content block types). Per the API Bar Raising process, it needs a needs-api-review label and reviewer sign-off. Key design questions: is checkpointing: bool the right level of configurability, and should there be a high-level resume_from_checkpoint() method alongside the content-block primitive?

  • Error Contract Consistency: The resume validation comments claim to mirror _InterruptState.resume() conventions (TypeError/KeyError/ValueError), but Checkpoint.from_dict now raises CheckpointException. The exception hierarchy should be consistent and documented.

  • Coupling Pattern: The event loop directly accesses private agent attributes (_checkpointing, _checkpoint_resume_context). This mirrors the existing interrupt pattern but extends the coupling surface. Consider exposing checkpoint config as an explicit parameter or read-only property.

  • Test Coverage Gap: Missing a test for the checkpointing=True + end_turn (no tool use) path, and the Codecov report shows 1 partial line in event_loop.py.

The feature design, state-machine logic, and durability proof are solid. The integration tests are particularly well-designed.

Comment thread src/strands/event_loop/event_loop.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

Assessment: Comment

Good progress since the last round — the frozen=True dataclass, _build_checkpoint_stop_event extraction, and updated error convention documentation address several prior concerns. A few new items surfaced:

New Review Items
  • Docstring accuracy: event_loop_cycle Yields docstring still documents a 4-element tuple but the actual event is now 7 elements. The cancel() docstring uses "checkpoint" in a way that now conflicts with the durable-execution Checkpoint concept introduced here.
  • AGENTS.md update: The directory structure section needs to be updated to include experimental/checkpoint/ per the repo's own guidelines.
  • Cancel + checkpoint interaction: When both checkpointing=True and cancel_signal are set, checkpoint emission takes precedence over cancel. This is probably correct but should be documented or tested.

The prior-round items around API review (needs-api-review label, high-level resume method question) remain open for maintainer decision. The core state-machine logic and test coverage are solid.

@github-actions
Copy link
Copy Markdown
Contributor

Assessment: Request Changes

The checkpoint wiring itself remains solid (approved in R4). However, the PR has expanded to include two additional unrelated features that weren't present in prior review rounds and aren't documented in the PR description.

Review Details
  • PR scope expansion: The PR now bundles three unrelated changes — checkpoint wiring, token estimation (_estimate_input_tokens / projected_input_tokens), and context offloader improvements (default change + S3 URI). These should be separated or at minimum documented.
  • Breaking default change: ContextOffloader(include_retrieval_tool=...) changed from False to True, affecting existing users on upgrade without any code change. This conflicts with the "Pay for Play" decision record.
  • Minor robustness: _estimate_input_tokens directly indexes usage["inputTokens"] without fallback — caught by outer try/except but could be more defensive.

The checkpoint feature itself is ready to merge. The additional features need separate review or at minimum PR description documentation.

@JackYPCOnline
Copy link
Copy Markdown
Contributor Author

/strands I did rebase from main, it is not adding new feature, can you review the PR against current main. Breaking change is not introduced by checkpointing.

@github-actions
Copy link
Copy Markdown
Contributor

Final Review (Post-Rebase) — No Blocking Issues

I've re-reviewed the PR diff against current main and confirmed the R5 automated review was a false positive_estimate_input_tokens, projected_input_tokens, and the BeforeModelCallEvent changes all already exist on main. The rebase brought those in; this PR does not introduce them.

What this PR actually changes (vs. current main):

File Change
agent.py +checkpointing param, _try_consume_checkpoint_resume helper, cancel docstring fix
agent_result.py +checkpoint field, round-trip serialization, typo fix (invalid@invalid)
event_loop.py +_build_checkpoint_stop_event helper, checkpoint emission at after_model/after_tools, 6→7 tuple
_events.py +checkpoint kwarg on EventLoopStopEvent
checkpoint.py Docstring rewrite, ValueErrorCheckpointException, frozen=True
exceptions.py +CheckpointException
Tests New checkpoint unit/integration tests, mechanical 6→7 tuple unpackers in existing tests

Verification against main:

  • EventLoopStopEvent on main is a 6-tuple → PR correctly extends to 7 with checkpoint as final element
  • _estimate_input_tokens already on main — not introduced by this PR
  • StopReason literal on main already includes "checkpoint" (from Part A, feat: introduce checkpoint in experimental #2181)
  • ✅ No breaking changes for checkpointing=False (default) — all new paths gated on agent._checkpointing

Open review threads from @pgrayy:

These are design discussion items, not bugs:

  1. TemporalToolExecutor pattern recommendation — open question for future design
  2. "Precedence" vs "order" terminology — stylistic
  3. model_state capture — acknowledged as planned follow-up

Conclusion:

No crucial bugs, gaps, or new risks. The PR is correctly scoped to checkpoint wiring only. The prior R5 "Request Changes" assessment was based on incorrectly attributing rebased-from-main code to this PR.

@JackYPCOnline JackYPCOnline enabled auto-merge (squash) May 1, 2026 21:38
@yonib05 yonib05 added the area-persistence Session management or checkpointing label May 27, 2026
@github-actions github-actions Bot added size/xl and removed size/xl labels May 27, 2026
Comment thread strands-py/tests/strands/models/test_model.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

Assessment: Approve

The PR is correctly scoped to checkpoint wiring after the rebase onto current main. The new tests_integ/test_agent_checkpoint.py is a solid addition — it exercises the full pause/resume/durability contract against real Bedrock, with proper skip markers for CI without credentials. The _drive_to_completion helper pattern is clean and reusable.

One trivial nit: duplicate @pytest.mark.asyncio decorator in test_model.py (harmless but should be removed).

No other issues. The pgrayy design threads are open discussions for future direction, not blockers for this PR.

Comment thread strands-py/src/strands/agent/agent.py Outdated
Comment thread strands-py/src/strands/event_loop/event_loop.py Outdated
Comment thread strands-py/src/strands/agent/agent.py
Comment thread strands-py/src/strands/event_loop/event_loop.py Outdated
Comment thread strands-py/src/strands/agent/agent.py Outdated
Comment thread strands-py/src/strands/agent/agent.py Outdated
Comment thread strands-py/src/strands/event_loop/event_loop.py Outdated
Comment thread strands-py/tests_integ/test_agent_checkpoint.py
Comment thread strands-py/tests/strands/event_loop/test_event_loop.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

Assessment: Comment

The checkpoint wiring is clean and well-designed. The evolution from snapshot-based state capture to lightweight pause-point markers (paired with SessionManager for state continuity) is a good simplification that keeps the SDK surface area small. The state machine logic, cancel/interrupt precedence, and one-shot resume consumption are all correct.

Review Details
  • Integration test skip guard: tests_integ/test_agent_checkpoint.py is missing the pytestmark skip used by analogous Bedrock tests (test_cancellation.py), which will cause failures rather than skips in CI without AWS credentials.
  • Dead test setup: Six take_snapshot mock assignments in checkpoint event_loop tests are leftover from the previous snapshot-based design and never get called — they should be cleaned up to avoid confusion.

The core implementation is solid and ready for merge once the skip guard is added.

Add an opt-in `checkpointing=True` flag on Agent that pauses the event
loop at ReAct cycle boundaries (after_model and after_tools) and returns
a serializable Checkpoint via AgentResult. Resume by passing the persisted
checkpoint back as a `checkpointResume` block (dict or single-element list).

The Checkpoint itself is a pause-point marker (position + cycle_index).
State persistence is the caller's job. The recommended pattern is to pair
`checkpointing=True` with a `SessionManager`: SessionManager handles the
conversation state, Checkpoint signals the boundary. Callers who need to
own the persistence boundary themselves can put state in
`Checkpoint.snapshot` or `Checkpoint.app_data`.

Non-checkpointing behavior is unchanged: emission sites and the post-tool
cancel fast-path are gated on agent._checkpointing, so callers who do not
opt in see identical runtime behavior to main.

Includes:
- Checkpoint dataclass (Part A) integration with the event loop
- Agent._try_consume_checkpoint_resume helper that validates and consumes
  the checkpointResume block (TypeError for shape, KeyError for lookup,
  ValueError for misconfig, CheckpointException for schema mismatch).
  Accepts both dict and list shapes for the resume block.
- _build_checkpoint_stop_event helper factoring the two emission sites
- AgentResult.checkpoint field and round-trip in to_dict/from_dict
- EventLoopStopEvent extended with the checkpoint kwarg (7-tuple)
- Unit tests for emission, resume, cancel/interrupt precedence
- Integration tests against Bedrock that demonstrate the recommended
  Checkpoint + SessionManager pattern end-to-end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-persistence Session management or checkpointing enhancement New feature or request python Pull requests that update python code size/xl

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants