feat(checkpoint): wire checkpointing into agent event loop#2190
feat(checkpoint): wire checkpointing into agent event loop#2190JackYPCOnline wants to merge 1 commit into
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
Assessment: Comment This is a well-structured PR that wires checkpoint functionality into the agent loop with a clean opt-in design. The state machine is carefully reasoned and the integration tests (especially the crash-after-tools test) are compelling. Two themes warrant attention before merge: Review Themes
The feature design, state-machine logic, and durability proof are solid. The integration tests are particularly well-designed. |
|
Assessment: Comment Good progress since the last round — the New Review Items
The prior-round items around API review ( |
|
Assessment: Request Changes The checkpoint wiring itself remains solid (approved in R4). However, the PR has expanded to include two additional unrelated features that weren't present in prior review rounds and aren't documented in the PR description. Review Details
The checkpoint feature itself is ready to merge. The additional features need separate review or at minimum PR description documentation. |
|
/strands I did rebase from main, it is not adding new feature, can you review the PR against current main. Breaking change is not introduced by checkpointing. |
Final Review (Post-Rebase) — No Blocking IssuesI've re-reviewed the PR diff against current What this PR actually changes (vs. current main):
Verification against main:
Open review threads from @pgrayy:These are design discussion items, not bugs:
Conclusion:No crucial bugs, gaps, or new risks. The PR is correctly scoped to checkpoint wiring only. The prior R5 "Request Changes" assessment was based on incorrectly attributing rebased-from-main code to this PR. |
d61e9f0 to
d89ed65
Compare
|
Assessment: Approve The PR is correctly scoped to checkpoint wiring after the rebase onto current main. The new One trivial nit: duplicate No other issues. The pgrayy design threads are open discussions for future direction, not blockers for this PR. |
d89ed65 to
9e6e091
Compare
9e6e091 to
1325af5
Compare
|
Assessment: Comment The checkpoint wiring is clean and well-designed. The evolution from snapshot-based state capture to lightweight pause-point markers (paired with SessionManager for state continuity) is a good simplification that keeps the SDK surface area small. The state machine logic, cancel/interrupt precedence, and one-shot resume consumption are all correct. Review Details
The core implementation is solid and ready for merge once the skip guard is added. |
Add an opt-in `checkpointing=True` flag on Agent that pauses the event loop at ReAct cycle boundaries (after_model and after_tools) and returns a serializable Checkpoint via AgentResult. Resume by passing the persisted checkpoint back as a `checkpointResume` block (dict or single-element list). The Checkpoint itself is a pause-point marker (position + cycle_index). State persistence is the caller's job. The recommended pattern is to pair `checkpointing=True` with a `SessionManager`: SessionManager handles the conversation state, Checkpoint signals the boundary. Callers who need to own the persistence boundary themselves can put state in `Checkpoint.snapshot` or `Checkpoint.app_data`. Non-checkpointing behavior is unchanged: emission sites and the post-tool cancel fast-path are gated on agent._checkpointing, so callers who do not opt in see identical runtime behavior to main. Includes: - Checkpoint dataclass (Part A) integration with the event loop - Agent._try_consume_checkpoint_resume helper that validates and consumes the checkpointResume block (TypeError for shape, KeyError for lookup, ValueError for misconfig, CheckpointException for schema mismatch). Accepts both dict and list shapes for the resume block. - _build_checkpoint_stop_event helper factoring the two emission sites - AgentResult.checkpoint field and round-trip in to_dict/from_dict - EventLoopStopEvent extended with the checkpoint kwarg (7-tuple) - Unit tests for emission, resume, cancel/interrupt precedence - Integration tests against Bedrock that demonstrate the recommended Checkpoint + SessionManager pattern end-to-end
Description
Wires the
Checkpointdata model (landed in #2181) into the agent runtime so an opt-incheckpointing=Trueagent pauses at ReAct cycle boundaries and emits a serializable pause-point marker. The marker travels throughAgentResult.checkpointand is passed back to a fresh agent through acheckpointResumeblock. The design mirrors the existing interrupt pattern:stop_reason="checkpoint"to signal the pause, content-block resume, no new method surface to learn.A
Checkpointis a position marker (which boundary fired and which cycle index), not a state snapshot. State persistence is the caller's responsibility. The recommended pairing is aSessionManagerfor state continuity pluscheckpointing=Truefor boundary signalling. Callers who want to own persistence themselves can put state inCheckpoint.snapshotorCheckpoint.app_data.User-facing API (zero breaking changes — opt-in only):
V0 known limitations:
OpenAIResponsesModel(stateful=True)not supported.BeforeInvocationEvent/AfterInvocationEventfire on every resume (same as interrupts).ToolExecutor. The SDK checkpoint operates at cycle boundaries.Related Issues
Documentation PR
Type of Change
New feature
Testing
Verified the changes do not break functionality or introduce warnings in consuming repositories.
hatch run prepareEvidence from fresh runs:
hatch test— 3026 passed, 4 skipped, 0 failed.hatch run hatch-static-analysis:lint-check—ruff checkandmypyboth clean.hatch run hatch-static-analysis:format-check— all files formatted.\Checklist
agent-docs; module-level docstring incheckpoint.pycovers V0 limitations, precedence rules, and the recommended SessionManager pairing)main; this PR builds on it)By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.