Skip to content

feat(server,driver-vm,e2e): gateway-owned readiness + VM compute driver e2e#901

Open
drew wants to merge 5 commits intomainfrom
drew/vm-driver-install-hangs-on-startup
Open

feat(server,driver-vm,e2e): gateway-owned readiness + VM compute driver e2e#901
drew wants to merge 5 commits intomainfrom
drew/vm-driver-install-hangs-on-startup

Conversation

@drew
Copy link
Copy Markdown
Collaborator

@drew drew commented Apr 21, 2026

Summary

Makes the VM compute driver end-to-end path work on top of the supervisor-initiated relay in #867, and moves the authoritative "sandbox is Ready" transition from each compute driver onto the gateway. The smoke test against openshell-gateway --drivers vm (mise run e2e:vm) goes from hanging at 180s to passing in ~10s.

Changes

feat(server): promote sandbox phase on supervisor session connect

  • New SupervisorSessionObserver trait. SupervisorSessionRegistry invokes the observer on register / remove_if_current outside the internal mutex.
  • ComputeRuntime::install_supervisor_observer wires a ComputeSessionObserver bridge; the runtime holds a Weak<SupervisorSessionRegistry> to break the Arc cycle between registry and observer.
  • New mark_sandbox_session_connected / mark_sandbox_session_disconnected flip phase and rewrite the Ready condition with reason=SupervisorConnected / SupervisorDisconnected. Terminal states (Deleting, Error) are preserved.
  • Backfill path in apply_sandbox_update_locked handles the register-before-store race: if a driver snapshot arrives and the registry already holds a live session for that sandbox, phase is promoted on the spot.

refactor(driver-vm): drop log-grep readiness; always run gvproxy

  • Delete guest_ssh_ready() and ready_condition(). The driver no longer owns Ready; monitor_sandbox only surfaces Error for launcher-process failures.
  • Critical fix: runtime.rs now starts gvproxy unconditionally. With the SSH port forward removed in feat(server,sandbox): supervisor-initiated SSH connect and exec over gRPC-multiplexed relay #867, port_map was empty by default, which skipped gvproxy startup entirely — leaving the guest with no eth0 and no route to the host gateway. The guest supervisor's ConnectSupervisor stream needs gvproxy to reach host.containers.internal (rewritten to 192.168.127.1 inside the guest).
  • Remove dead VmContext::set_port_map; mark the libkrun FFI binding #[allow(dead_code)].

e2e(vm): run smoke against openshell-gateway with the VM compute driver

  • Rewrite e2e/rust/e2e-vm.sh for the split-binary flow (former openshell-vm K8s-in-a-VM binary is gone).
  • Pin --driver-dir target/debug so the gateway picks up the freshly cargo-built driver rather than a stale ~/.local/libexec/openshell/openshell-driver-vm from a prior install-vm.sh run.
  • Anchor per-run state under /tmp (macOS AF_UNIX SUN_LEN is 104 bytes; worktree paths routinely blow it).
  • On failure, preserve the state dir and dump the gateway log + every sandbox's rootfs-console.log inline for post-mortem.
  • Drop build:docker:gateway and vm:build dependencies from tasks/test.toml's e2e:vm.

Testing

  • mise run pre-commit passes (lint + format + license headers clean; clippy warnings unchanged from baseline)
  • Unit tests added/updated
    • openshell-server lib: 255 pass (+8 compute promotion tests, +4 registry observer tests)
    • openshell-driver-vm lib: 17 pass
    • openshell-server integration (supervisor_relay_integration): 6 pass
  • E2E tests added/updated: mise run e2e:vm passes in ~10s, stable across back-to-back runs

Checklist

@drew drew self-assigned this Apr 21, 2026
@drew drew requested a review from a team as a code owner April 21, 2026 05:24
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 21, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Base automatically changed from feat/supervisor-session-grpc-data to main April 21, 2026 15:38
@drew drew force-pushed the drew/vm-driver-install-hangs-on-startup branch 2 times, most recently from bed7fa5 to 0c901d4 Compare April 22, 2026 15:35
drew added 5 commits April 22, 2026 08:35
The VM driver no longer owns the Ready transition — the gateway-side
SupervisorSessionObserver now promotes sandboxes to Ready when their
supervisor session connects. Remove guest_ssh_ready() (a brittle
grep over the serial console) and the ready_condition() helper.
monitor_sandbox still watches the launcher child process and emits
Error conditions on ProcessExited / ProcessPollFailed.

Also always start gvproxy, not just when port_map is non-empty. With
the supervisor-initiated relay migration in #867, the SSH port forward
was dropped; that left port_map empty in the default path, which in
turn skipped gvproxy startup, which left the guest with no eth0 and
no route to the host gateway. The guest supervisor's outbound
ConnectSupervisor stream needs gvproxy to reach
host.containers.internal (rewritten to 192.168.127.1 inside the guest),
so gvproxy is structurally required for any sandbox that talks to
the gateway.

Inline the gvproxy setup into an unconditional block that returns
(guard, api_sock, forwarded_port_map), dropping the mutable plumbing
the prior conditional form needed. Remove the now-dead
VmContext::set_port_map wrapper; mark its libkrun FFI binding
#[allow(dead_code)] so a future reintroduction doesn't need to touch
the symbol table.
Rewrite e2e/rust/e2e-vm.sh for the split-binary flow (openshell-gateway
+ openshell-driver-vm) now that the former openshell-vm K8s-in-a-VM
binary is gone. The new flow:

  1. Stage the embedded VM runtime (libkrun + gvproxy + base rootfs)
     via mise run vm:setup and mise run vm:rootfs -- --base, both
     idempotent and run only when artifacts are missing.
  2. Build openshell-gateway, openshell-driver-vm, and the openshell
     CLI from the current workspace with cargo.
  3. On macOS, codesign the driver with the Hypervisor.framework
     entitlement so libkrun can start the microVM.
  4. Start the gateway with --drivers vm --disable-tls
     --disable-gateway-auth --db-url sqlite::memory:, pinning
     --driver-dir target/debug so the gateway picks up the freshly
     built driver rather than ~/.local/libexec/openshell from a
     prior install-vm.sh run.
  5. Wait for 'Server listening', run the cluster-agnostic Rust smoke
     test against OPENSHELL_GATEWAY_ENDPOINT=http://127.0.0.1:<port>,
     then SIGTERM the gateway.

State paths root under /tmp rather than target/ because the VM
driver's compute-driver.sock lives under --vm-driver-state-dir; with
AF_UNIX SUN_LEN = 104 bytes on macOS (108 on Linux), worktree paths
under target/ routinely blow the limit.

On failure, the trap preserves the per-run state dir plus dumps the
gateway log and every sandbox's rootfs-console.log inline so CI
artifacts capture post-mortem data.

Drop the former --vm-port / --vm-name reuse path entirely — the new
gateway is cheap to start (a few seconds, no k3s bootstrap) and that
reuse flow mapped to openshell-vm's StatefulSet rollout, which no
longer exists. Drop the build:docker:gateway and vm:build task
dependencies from tasks/test.toml's e2e:vm for the same reason.
With the SSH port forward removed in #867 and no other host→guest port
mappings in play, everything that configured gvproxy's port-forwarder
is dead weight. gvproxy stays because the VM still needs its virtual
NIC, DHCP server, and default router for guest egress, and because
the sandbox supervisor's per-sandbox netns (veth + iptables, see
openshell-sandbox/src/sandbox/linux/netns.rs) needs a real kernel
network stack inside the guest to branch off of — libkrun's built-in
TSI socket impersonation would not satisfy those primitives.

What we stop doing:

* Dropping the `-listen` API socket. No one calls
  `/services/forwarder/expose` on it any more.
* Passing `-ssh-port -1`. gvproxy's default 2222 SSH forward binds
  a host-side TCP listener that would race concurrent sandboxes
  and surface a misleading 'sshd is reachable' endpoint.
  `-1` is gvproxy's documented switch for 'no SSH forward'; see
  getForwardsMap in containers/gvisor-tap-vsock cmd/gvproxy/main.go.
* Removing VmLaunchConfig::port_map and the CLI --vm-port flag.
* Removing krun_set_port_map from the libkrun FFI bindings.
* Removing helpers that only made sense when we had a port map to
  manage: plan_gvproxy_ports, parse_port_mapping, expose_port_map,
  gvproxy_expose, pick_gvproxy_ssh_port, kill_stale_gvproxy_by_port,
  kill_stale_gvproxy_by_port_map, kill_gvproxy_pid, is_process_named,
  and the GUEST_SSH_PORT constant.
* Removing the four port-mapping unit tests.

Verified: after `sandbox create -- echo hi`, `lsof` shows gvproxy
opens zero TCP listeners; only its qemu/vfkit unixgram data socket
remains. E2E smoke still passes in ~10s.
@drew drew force-pushed the drew/vm-driver-install-hangs-on-startup branch from 0c901d4 to 9c30881 Compare April 22, 2026 15:35
Comment thread crates/openshell-driver-vm/src/runtime.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants