Skip to content

fix(hermes): host-side chown PVC backing dirs after sync (#475)#481

Open
bussyjd wants to merge 2 commits into
mainfrom
fix/hermes-pvc-perm-host-chown
Open

fix(hermes): host-side chown PVC backing dirs after sync (#475)#481
bussyjd wants to merge 2 commits into
mainfrom
fix/hermes-pvc-perm-host-chown

Conversation

@bussyjd
Copy link
Copy Markdown
Collaborator

@bussyjd bussyjd commented May 12, 2026

Summary

Closes #475.

The in-pod init-hermes-perms chown added in #446 (c066baa) silently no-ops on Linux k3d because the embedded k3d config sets KubeletInUserNamespace=true (internal/embed/k3d-config.yaml:31-37). With user-namespacing the pod's "root" maps to a host subuid that lacks chown authority over the host bind-mount path; the chown succeeds inside the namespace but has no effect on host-side ownership. The next init container, init-hermes-data, runs as UID 10000 and fails with mkdir /data/.hermes/home: Permission denied against a directory that local-path-provisioner created as 1000:1000.

This PR adds ensureHermesPVCOwnership, called from hermes.Sync after helmfile sync. It chowns the PVC backing dirs from outside the user namespace via docker exec into the k3d server container — running at the Docker daemon's real root, not the namespaced kubelet's. Reuses the existing fixRuntimeVolumeOwnership helper (already used for wallet keystores) plus a small kubectl wait for PVC binding and a conditional restart for stuck pods.

Why not just patch local-path-provisioner?

We considered changing internal/embed/infrastructure/base/templates/local-path.yaml's setup script to chown to 10000:10000 instead of 1000:1000. That breaks OpenClaw, which runs as 1000:1000 (internal/openclaw/openclaw.go:801 already chowns to that). Two workloads with different runtime UIDs share the same StorageClass, so the chown has to live in each workload's deploy path, not in the provisioner.

Behavior

  • First sync after fresh bootstrap: PVCs go from "Pending → Bound" during helmfile sync; helper-pod sets ownership to 1000:1000; our chown immediately overwrites to 10000:10000; pod's init containers run cleanly on first try.
  • Subsequent syncs (e.g. obol model sync after obol model prefer): chown is idempotent. The conditional restart only fires if a pod is actually in Init:CrashLoopBackOff or Errorobol model sync against a healthy stack does not gratuitously restart the agent.
  • Stuck-pod recovery: if a fresh sync races with kubelet (helper-pod runs, init-perms no-ops, init-data fails before our chown lands), we detect Init:CrashLoopBackOff via kubectl get pods … jsonpath=…state.waiting.reason and force a single pod delete. Kubelet recreates immediately rather than waiting up to ~5 min for exponential backoff.

Live validation on spark2 (Linux ARM64, Ubuntu 24.04, NVIDIA GB10)

$ sudo chown -R 1000:1000 .workspace/data/hermes-obol-agent
$ ls -la .workspace/data/hermes-obol-agent
drwxr-xr-x 3 claude claude  ...  hermes-data
drwxrwx--- 2 claude claude  ...  remote-signer-keystores

$ obol kubectl -n hermes-obol-agent delete pod -l app.kubernetes.io/name=hermes
$ obol model sync
==> Reading model list from LiteLLM...
==> Syncing Hermes: hermes/obol-agent
  ✓ Running helmfile sync (2s)
  ✓ Hermes installed successfully!

$ ls -la .workspace/data/hermes-obol-agent
drwxr-xr-x 3  10000  10000 ...  hermes-data
drwxrwx--- 2  10000  10000 ...  remote-signer-keystores

$ obol kubectl -n hermes-obol-agent get pods
NAME                            READY   STATUS    RESTARTS   AGE
hermes-69f56c9f75-9dwh8         2/2     Running   0          28s

Before this PR the exact same teardown left hermes-* in Init:CrashLoopBackOff for ~16 minutes until manual sudo chown intervention.

Test plan

  • go test ./internal/hermes -count=1 — all existing tests pass, plus the new TestHermesPVCPaths pins the two host paths the helper chowns so a future rename of the PVCs or the namespace layout can't silently regress the fix.
  • go build ./... clean.
  • End-to-end on spark2 — see "Live validation" above. Pod reaches Running 2/2 in 28s with zero CrashLoopBackOff cycles, where the prior behavior was a 5-15 min stuck state.

Notes

  • Issue hermes init PVC perm-denied regression survives #446 (k3d local-path on Linux, both fresh and dev-mode) #475 also raised the option of wiring fsGroupChangePolicy: OnRootMismatch onto the pod spec. We did not take that path because fsGroup-style chowns are themselves performed by kubelet, which on this stack runs with KubeletInUserNamespace=true — they're affected by the same namespacing the in-pod chown is. docker exec is the only path that bypasses it cleanly.
  • obol model sync is now a few-hundred-ms slower because of the kubectl wait calls (timeout 60s but typically finish in ~1s since the PVC is already Bound on subsequent syncs). If this matters to anyone, we can short-circuit the wait when both PVCs are already Bound. Left as a follow-up.

Generated with Claude Code

The in-pod `init-hermes-perms` init container from #446 (c066baa) is
neutered on Linux k3d because the embedded k3d config sets
`KubeletInUserNamespace=true` (internal/embed/k3d-config.yaml). With
user-namespacing, the pod's "root" maps to a host subuid that lacks
chown authority over the host bind-mount path, so the in-pod
`chown -R 10000:10000 /data` silently no-ops. The next init container
(`init-hermes-data`) then fails with `mkdir /data/.hermes/home:
Permission denied` and the pod CrashLoopBackOffs.

local-path-provisioner's helper-pod sets the volume to 1000:1000
(internal/embed/infrastructure/base/templates/local-path.yaml), which
happens to suit OpenClaw but not Hermes (containerUID = 10000).

Fix: after `helmfile sync`, host-side chown the PVC backing dirs to
containerUID:containerGID by exec-ing into the k3d server container
via `docker exec`. That runs at the Docker daemon's real root and is
not subject to the user-namespacing that silently breaks the in-pod
attempt. The existing `fixRuntimeVolumeOwnership` helper already does
exactly this for wallet keystore paths; we wire it into the agent
deploy path via a new `ensureHermesPVCOwnership` that:

  1. Waits up to 60s for each PVC (`hermes-data`, `remote-signer-
     keystores`) to be Bound — local-path is WaitForFirstConsumer so
     the host dir doesn't exist until the pod is scheduled.
  2. Chowns each backing dir.
  3. If a Hermes pod is currently stuck in Init:CrashLoopBackOff,
     deletes it so kubelet recreates immediately rather than after
     exponential backoff (~5 min worst case). Skips the delete when
     no pod is stuck so routine syncs (e.g. `obol model sync` after
     `obol model prefer`) do not gratuitously restart a healthy
     agent.

Called from `hermes.Sync` after `helmfile sync` succeeds, so every
Onboard / Setup / Sync call exercises it.

Validated on spark2 (Linux ARM64, Ubuntu 24.04, NVIDIA GB10) by
reverting the PVC dirs to 1000:1000, deleting the Hermes pod, and
running `obol model sync`. Before: pod stuck in Init:CrashLoopBackOff
with "Permission denied" in init-hermes-data logs. After: PVC dirs
flip to 10000:10000 within the sync, pod reaches `Running 2/2` in
~30s with zero CrashLoopBackOff cycles.

Test:
- `TestHermesPVCPaths` pins the two host paths the helper chowns, so
  renaming `hermes-data`/`remote-signer-keystores` or relocating the
  namespace prefix can't silently regress the fix.
- Full chown side-effect needs a live k3d cluster and is exercised
  by the spark2 validation above plus all existing integration
  flows; no unit-mocked k3d test added.
@bussyjd bussyjd force-pushed the fix/hermes-pvc-perm-host-chown branch from 8c10251 to 41653c4 Compare May 12, 2026 09:34
The first revision of ensureHermesPVCOwnership chowned BOTH PVCs in the
Hermes namespace — `hermes-data` AND `remote-signer-keystores` — to
containerUID:containerGID (10000:10000). That broke the remote-signer
pod on Linux k3d:

  remote-signer-7bc7c4759-nj8lx  0/1  CrashLoopBackOff
  {"level":"ERROR","message":"failed to load keystores",
   "error":"io error: Permission denied (os error 13)"}

The two PVCs have different UID/GID contracts:

  hermes-data              owned by the Hermes Deployment
                           runAsUser=10000, runAsGroup=10000, fsGroup=10000
                           ✓ chown to 10000:10000 is correct

  remote-signer-keystores  owned by the obol/remote-signer Helm release
                           runAsUser=65532, fsGroup=1000 (read-only mount)
                           ✗ chown to 10000:10000 makes /data/keystores
                             unreadable to the pod
                           ✓ the local-path-provisioner default of
                             1000:1000 already matches the fsGroup
                             contract, so leaving it untouched is the
                             safe behavior

Reproduced today on spark2 after the original PR #475 fix landed: the
Hermes pod recovered correctly, but the remote-signer pod that had been
healthy on the prior boot now CrashLoopBackOff'd until the keystore
volume was manually chowned back to 1000:1000.

Changes:
- hermesPVCPaths: drop the remote-signer-keystores entry. Doc comment
  explains why the negative is intentional.
- ensureHermesPVCOwnership: drop remote-signer-keystores from the
  PVC-wait loop too — no reason to wait on a volume we no longer touch.
- TestHermesPVCPaths: tightened to assert the single Hermes path.
- TestHermesPVCPaths_ExcludesRemoteSignerKeystores: NEW negative
  guard so a future "re-include all PVCs in the namespace" refactor
  fails the test before it ships.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

hermes init PVC perm-denied regression survives #446 (k3d local-path on Linux, both fresh and dev-mode)

1 participant