Skip to content

feat(server): add bundled docker compute driver#888

Draft
drew wants to merge 5 commits intomainfrom
drew/creating-a-docker-driver-like-the-vm-driver
Draft

feat(server): add bundled docker compute driver#888
drew wants to merge 5 commits intomainfrom
drew/creating-a-docker-driver-like-the-vm-driver

Conversation

@drew
Copy link
Copy Markdown
Collaborator

@drew drew commented Apr 20, 2026

Summary

Add a bundled Docker compute driver to the gateway on top of the supervisor-relay base. This lets the gateway provision sandboxes directly through the local Docker daemon without exposing sandbox ports or spawning a separate Docker driver binary.

Related Issue

N/A

Changes

  • add docker as a supported compute driver kind
  • add an in-process Docker backend in openshell-server using Bollard
  • wire Docker CLI and env config for supervisor and TLS bind mounts
  • keep Docker sandboxes as long-lived containers with no /sandbox volume
  • update architecture and support docs for the Docker backend

Testing

  • mise run pre-commit passes
  • Unit tests added/updated
  • E2E tests added/updated (if applicable)

mise run pre-commit currently fails in python:proto because grpc_tools.protoc is not installed in the local Python environment. Rust formatting and the touched crate test suites passed:

  • cargo fmt --all
  • cargo test -p openshell-server --lib
  • cargo test -p openshell-core --lib

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

@drew drew self-assigned this Apr 20, 2026
@drew drew requested a review from a team as a code owner April 20, 2026 00:48
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 20, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link
Copy Markdown

@drew drew marked this pull request as draft April 20, 2026 02:16
Base automatically changed from feat/supervisor-session-grpc-data to main April 21, 2026 15:38
drew added 2 commits April 21, 2026 20:56
Signed-off-by: Drew Newberry <anewberry@nvidia.com>

# Conflicts:
#	architecture/gateway.md
Signed-off-by: Drew Newberry <anewberry@nvidia.com>
@drew drew force-pushed the drew/creating-a-docker-driver-like-the-vm-driver branch from 2f26563 to 8bf06db Compare April 22, 2026 04:01
drew added 3 commits April 21, 2026 21:26
- Preserve sandbox id suffix in container name when sandbox name is long,
  preventing collisions and confusing 'already exists' errors.
- Use container id (not name) in delete_sandbox_inner so transient
  ContainerSummary entries without names still get cleaned up.
- Subscribe to the watch broadcast before snapshotting so events that fire
  between snapshot and subscribe aren't missed by new watchers.
- Apply exponential backoff to the Docker poll loop on consecutive
  failures, capping at 30s to avoid log floods on daemon outages.
- Reject --docker-tls-* flags when the gRPC endpoint is plaintext http://
  instead of silently discarding them.
- Add e2e:docker mise task and e2e/rust/e2e-docker.sh harness that boots a
  standalone gateway with the docker driver and runs the existing smoke
  test against it.
…nects

The docker compute driver mapped RUNNING containers to Ready=False with
reason DependenciesNotReady indefinitely, so sandboxes never transitioned
to the Ready phase and ExecSandbox was gated out behind a 180s timeout.

Introduces a SupervisorReadiness trait that the driver polls on every
watch tick. The gateway's SupervisorSessionRegistry implements it via a
new is_connected(sandbox_id) method. When a ConnectSupervisor session is
live for a sandbox, the driver emits Ready=True with reason
SupervisorConnected; the condition falls back to DependenciesNotReady
if the supervisor disconnects.

Also:
- Wires the registry through run_server/ServerState so the docker driver
  can be constructed before ServerState exists.
- Adds a host.openshell.internal / host.docker.internal SAN to the
  e2e-docker.sh generated mTLS cert so supervisor TLS handshakes succeed.
- Points the e2e harness at the community sandbox base image (which has
  the required 'sandbox' user) and preserves container logs on failure
  for post-mortem debugging.
- Passes mise run e2e:docker end to end.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants