fix(hermes): host-side chown PVC backing dirs after sync (#475) by bussyjd · Pull Request #481 · ObolNetwork/obol-stack

bussyjd · 2026-05-12T08:14:46Z

Summary

Closes #475.

The in-pod init-hermes-perms chown added in #446 (c066baa) silently no-ops on Linux k3d because the embedded k3d config sets KubeletInUserNamespace=true (internal/embed/k3d-config.yaml:31-37). With user-namespacing the pod's "root" maps to a host subuid that lacks chown authority over the host bind-mount path; the chown succeeds inside the namespace but has no effect on host-side ownership. The next init container, init-hermes-data, runs as UID 10000 and fails with mkdir /data/.hermes/home: Permission denied against a directory that local-path-provisioner created as 1000:1000.

This PR adds ensureHermesPVCOwnership, called from hermes.Sync after helmfile sync. It chowns the PVC backing dirs from outside the user namespace via docker exec into the k3d server container — running at the Docker daemon's real root, not the namespaced kubelet's. Reuses the existing fixRuntimeVolumeOwnership helper (already used for wallet keystores) plus a small kubectl wait for PVC binding and a conditional restart for stuck pods.

Why not just patch local-path-provisioner?

We considered changing internal/embed/infrastructure/base/templates/local-path.yaml's setup script to chown to 10000:10000 instead of 1000:1000. That breaks OpenClaw, which runs as 1000:1000 (internal/openclaw/openclaw.go:801 already chowns to that). Two workloads with different runtime UIDs share the same StorageClass, so the chown has to live in each workload's deploy path, not in the provisioner.

Behavior

First sync after fresh bootstrap: PVCs go from "Pending → Bound" during helmfile sync; helper-pod sets ownership to 1000:1000; our chown immediately overwrites to 10000:10000; pod's init containers run cleanly on first try.
Subsequent syncs (e.g. obol model sync after obol model prefer): chown is idempotent. The conditional restart only fires if a pod is actually in Init:CrashLoopBackOff or Error — obol model sync against a healthy stack does not gratuitously restart the agent.
Stuck-pod recovery: if a fresh sync races with kubelet (helper-pod runs, init-perms no-ops, init-data fails before our chown lands), we detect Init:CrashLoopBackOff via kubectl get pods … jsonpath=…state.waiting.reason and force a single pod delete. Kubelet recreates immediately rather than waiting up to ~5 min for exponential backoff.

Live validation on spark2 (Linux ARM64, Ubuntu 24.04, NVIDIA GB10)

$ sudo chown -R 1000:1000 .workspace/data/hermes-obol-agent
$ ls -la .workspace/data/hermes-obol-agent
drwxr-xr-x 3 claude claude  ...  hermes-data
drwxrwx--- 2 claude claude  ...  remote-signer-keystores

$ obol kubectl -n hermes-obol-agent delete pod -l app.kubernetes.io/name=hermes
$ obol model sync
==> Reading model list from LiteLLM...
==> Syncing Hermes: hermes/obol-agent
  ✓ Running helmfile sync (2s)
  ✓ Hermes installed successfully!

$ ls -la .workspace/data/hermes-obol-agent
drwxr-xr-x 3  10000  10000 ...  hermes-data
drwxrwx--- 2  10000  10000 ...  remote-signer-keystores

$ obol kubectl -n hermes-obol-agent get pods
NAME                            READY   STATUS    RESTARTS   AGE
hermes-69f56c9f75-9dwh8         2/2     Running   0          28s

Before this PR the exact same teardown left hermes-* in Init:CrashLoopBackOff for ~16 minutes until manual sudo chown intervention.

Test plan

go test ./internal/hermes -count=1 — all existing tests pass, plus the new TestHermesPVCPaths pins the two host paths the helper chowns so a future rename of the PVCs or the namespace layout can't silently regress the fix.
go build ./... clean.
End-to-end on spark2 — see "Live validation" above. Pod reaches Running 2/2 in 28s with zero CrashLoopBackOff cycles, where the prior behavior was a 5-15 min stuck state.

Notes

Issue hermes init PVC perm-denied regression survives #446 (k3d local-path on Linux, both fresh and dev-mode) #475 also raised the option of wiring fsGroupChangePolicy: OnRootMismatch onto the pod spec. We did not take that path because fsGroup-style chowns are themselves performed by kubelet, which on this stack runs with KubeletInUserNamespace=true — they're affected by the same namespacing the in-pod chown is. docker exec is the only path that bypasses it cleanly.
obol model sync is now a few-hundred-ms slower because of the kubectl wait calls (timeout 60s but typically finish in ~1s since the PVC is already Bound on subsequent syncs). If this matters to anyone, we can short-circuit the wait when both PVCs are already Bound. Left as a follow-up.

Generated with Claude Code

The in-pod `init-hermes-perms` init container from #446 (c066baa) is neutered on Linux k3d because the embedded k3d config sets `KubeletInUserNamespace=true` (internal/embed/k3d-config.yaml). With user-namespacing, the pod's "root" maps to a host subuid that lacks chown authority over the host bind-mount path, so the in-pod `chown -R 10000:10000 /data` silently no-ops. The next init container (`init-hermes-data`) then fails with `mkdir /data/.hermes/home: Permission denied` and the pod CrashLoopBackOffs. local-path-provisioner's helper-pod sets the volume to 1000:1000 (internal/embed/infrastructure/base/templates/local-path.yaml), which happens to suit OpenClaw but not Hermes (containerUID = 10000). Fix: after `helmfile sync`, host-side chown the PVC backing dirs to containerUID:containerGID by exec-ing into the k3d server container via `docker exec`. That runs at the Docker daemon's real root and is not subject to the user-namespacing that silently breaks the in-pod attempt. The existing `fixRuntimeVolumeOwnership` helper already does exactly this for wallet keystore paths; we wire it into the agent deploy path via a new `ensureHermesPVCOwnership` that: 1. Waits up to 60s for each PVC (`hermes-data`, `remote-signer- keystores`) to be Bound — local-path is WaitForFirstConsumer so the host dir doesn't exist until the pod is scheduled. 2. Chowns each backing dir. 3. If a Hermes pod is currently stuck in Init:CrashLoopBackOff, deletes it so kubelet recreates immediately rather than after exponential backoff (~5 min worst case). Skips the delete when no pod is stuck so routine syncs (e.g. `obol model sync` after `obol model prefer`) do not gratuitously restart a healthy agent. Called from `hermes.Sync` after `helmfile sync` succeeds, so every Onboard / Setup / Sync call exercises it. Validated on spark2 (Linux ARM64, Ubuntu 24.04, NVIDIA GB10) by reverting the PVC dirs to 1000:1000, deleting the Hermes pod, and running `obol model sync`. Before: pod stuck in Init:CrashLoopBackOff with "Permission denied" in init-hermes-data logs. After: PVC dirs flip to 10000:10000 within the sync, pod reaches `Running 2/2` in ~30s with zero CrashLoopBackOff cycles. Test: - `TestHermesPVCPaths` pins the two host paths the helper chowns, so renaming `hermes-data`/`remote-signer-keystores` or relocating the namespace prefix can't silently regress the fix. - Full chown side-effect needs a live k3d cluster and is exercised by the spark2 validation above plus all existing integration flows; no unit-mocked k3d test added.

The first revision of ensureHermesPVCOwnership chowned BOTH PVCs in the Hermes namespace — `hermes-data` AND `remote-signer-keystores` — to containerUID:containerGID (10000:10000). That broke the remote-signer pod on Linux k3d: remote-signer-7bc7c4759-nj8lx 0/1 CrashLoopBackOff {"level":"ERROR","message":"failed to load keystores", "error":"io error: Permission denied (os error 13)"} The two PVCs have different UID/GID contracts: hermes-data owned by the Hermes Deployment runAsUser=10000, runAsGroup=10000, fsGroup=10000 ✓ chown to 10000:10000 is correct remote-signer-keystores owned by the obol/remote-signer Helm release runAsUser=65532, fsGroup=1000 (read-only mount) ✗ chown to 10000:10000 makes /data/keystores unreadable to the pod ✓ the local-path-provisioner default of 1000:1000 already matches the fsGroup contract, so leaving it untouched is the safe behavior Reproduced today on spark2 after the original PR #475 fix landed: the Hermes pod recovered correctly, but the remote-signer pod that had been healthy on the prior boot now CrashLoopBackOff'd until the keystore volume was manually chowned back to 1000:1000. Changes: - hermesPVCPaths: drop the remote-signer-keystores entry. Doc comment explains why the negative is intentional. - ensureHermesPVCOwnership: drop remote-signer-keystores from the PVC-wait loop too — no reason to wait on a volume we no longer touch. - TestHermesPVCPaths: tightened to assert the single Hermes path. - TestHermesPVCPaths_ExcludesRemoteSignerKeystores: NEW negative guard so a future "re-include all PVCs in the namespace" refactor fails the test before it ships.

bussyjd force-pushed the fix/hermes-pvc-perm-host-chown branch from 8c10251 to 41653c4 Compare May 12, 2026 09:34

bussyjd mentioned this pull request May 12, 2026

feat(stack): paid services survive stack down/up (sell-inference + sell-http resume + storefront fixes) #487

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(hermes): host-side chown PVC backing dirs after sync (#475)#481

fix(hermes): host-side chown PVC backing dirs after sync (#475)#481
bussyjd wants to merge 2 commits into
mainfrom
fix/hermes-pvc-perm-host-chown

bussyjd commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bussyjd commented May 12, 2026

Summary

Why not just patch local-path-provisioner?

Behavior

Live validation on spark2 (Linux ARM64, Ubuntu 24.04, NVIDIA GB10)

Test plan

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant