Skip to content

Add standby compression start delay#184

Merged
sjmiller609 merged 18 commits intomainfrom
codex/standby-compression-delay
Apr 17, 2026
Merged

Add standby compression start delay#184
sjmiller609 merged 18 commits intomainfrom
codex/standby-compression-delay

Conversation

@sjmiller609
Copy link
Copy Markdown
Collaborator

@sjmiller609 sjmiller609 commented Apr 4, 2026

Summary

  • add a standby-only compression delay override on POST /instances/{id}/standby and a per-instance default in snapshot_policy
  • keep delayed standby compression jobs cancelable before start and distinguish pending-delay skips from active compression cancellation
  • add metrics, logs, traces, OpenAPI updates, and tests for the new standby compression delay behavior

Testing

  • go test ./lib/instances
  • go test ./cmd/api/api -run 'Test(CreateInstance_MapsStandbyCompressionDelayInSnapshotPolicy|CreateInstance_InvalidStandbyCompressionDelayInSnapshotPolicy|InstanceToOAPI_EmitsStandbyCompressionDelayInSnapshotPolicy|StandbyInstance_MapsCompressionDelay|StandbyInstance_InvalidCompressionDelay|StandbyInstance_InvalidRequest)$'

Notes

  • go test ./cmd/api/api still hits unrelated environment-dependent volume tests on this machine because mkfs.ext4 is not available in $PATH.

Note

Medium Risk
Medium risk because it changes standby/snapshot compression job orchestration, persistence, cancellation semantics, and adds startup recovery behavior that could affect restore/delete/snapshot flows and background job accounting.

Overview
Adds a standby-only snapshot compression delay configurable per request (POST /instances/{id}/standby compression_delay) and as an instance default (snapshot_policy.standby_compression_delay), including API parsing/validation and OAPI spec updates.

Refactors snapshot compression jobs to support a pending-delay state: delayed jobs wait on a timer, can be canceled and recorded as skipped (distinct from canceling an active compression), and persist a PendingStandbyCompression plan in instance metadata so delayed jobs are recovered on manager startup.

Updates restore/snapshot/delete/fork paths to clear pending plans and avoid copying them into derived metadata, adds new compression wait/active vs pending metrics and traces, and hardens/flakes fixes in integration/network tests (iptables -w, guest exec retry).

Reviewed by Cursor Bugbot for commit af8e7e3. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 4, 2026

✱ Stainless preview builds

This PR will update the hypeman SDKs with the following commit message.

feat: Add standby compression start delay
hypeman-openapi studio · code

Your SDK build had at least one "note" diagnostic.
generate ✅

hypeman-go studio · code

Your SDK build had at least one "note" diagnostic.
generate ✅build ⏭️lint ✅test ✅

go get github.com/stainless-sdks/hypeman-go@bf5811644658e76684892345a2485b0425c07bb4
hypeman-typescript studio · code

Your SDK build had at least one "note" diagnostic.
generate ✅build ✅lint ✅test ✅


This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
If you push custom code to the preview branch, re-run this workflow to update the comment.
Last updated: 2026-04-17 15:54:48 UTC

sjmiller609

This comment was marked as resolved.

@sjmiller609 sjmiller609 marked this pull request as ready for review April 8, 2026 13:12
Comment thread lib/instances/fork.go Outdated
Comment thread lib/instances/snapshot_compression.go
Comment thread lib/instances/network_test.go
Comment thread lib/network/bridge_linux.go
Comment thread skills/test-agent/agents/test-agent/NOTES.md
@sjmiller609 sjmiller609 requested a review from hiroTamada April 8, 2026 14:56
@sjmiller609
Copy link
Copy Markdown
Collaborator Author

waiting until data or use case justified

@sjmiller609 sjmiller609 marked this pull request as draft April 9, 2026 19:09
@sjmiller609 sjmiller609 marked this pull request as ready for review April 15, 2026 15:38
@firetiger-agent
Copy link
Copy Markdown

I'll monitor this standby compression delay feature for Hypeman. The change adds timing parameters for delaying compression after standby, with persistence across server restarts.

Key things I'll watch:

  • Hypeman invocation spawn success rate — baseline is ~99% for prod-jfk-hypeman-0/1. Any significant drop indicates the new compression state management may be affecting instance operations.
  • Compression job leaks — the new pending state and recovery logic could fail to clean up jobs properly. I'll watch for growing pending counts without matching completions.
  • Restore latency — changes to ensureSnapshotMemoryReady could slow instance restores if the new preemption logic has issues.

The new metrics (hypeman_snapshot_compression_wait_duration_seconds, hypeman_snapshot_compression_pending_total) will provide direct visibility once they appear in telemetry. I'll post updates as the deployment progresses.

View agent

Comment thread lib/instances/compression_integration_linux_test.go
@sjmiller609 sjmiller609 requested a review from hiroTamada April 17, 2026 15:03
Copy link
Copy Markdown
Contributor

@hiroTamada hiroTamada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed incrementally - looks good overall.

Summary:

  • Clean implementation of delayed standby compression
  • Well-designed crash recovery via persisted PendingStandbyCompression
  • Good distinction between preemption (running) vs skipped (pending)
  • Comprehensive metrics and test coverage

Minor notes:

  • Left one nit on execCommandWithRetry re: potential shared test util
  • iptables -w flag change is unrelated but reasonable to bundle

Ship it!

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit ecaa47f. Configure here.

Comment thread lib/instances/fork.go Outdated
@sjmiller609 sjmiller609 merged commit 968c7aa into main Apr 17, 2026
9 of 11 checks passed
@sjmiller609 sjmiller609 deleted the codex/standby-compression-delay branch April 17, 2026 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants