fix(hermes): host-side chown PVC backing dirs after sync (#475)#481
Open
bussyjd wants to merge 2 commits into
Open
fix(hermes): host-side chown PVC backing dirs after sync (#475)#481bussyjd wants to merge 2 commits into
bussyjd wants to merge 2 commits into
Conversation
The in-pod `init-hermes-perms` init container from #446 (c066baa) is neutered on Linux k3d because the embedded k3d config sets `KubeletInUserNamespace=true` (internal/embed/k3d-config.yaml). With user-namespacing, the pod's "root" maps to a host subuid that lacks chown authority over the host bind-mount path, so the in-pod `chown -R 10000:10000 /data` silently no-ops. The next init container (`init-hermes-data`) then fails with `mkdir /data/.hermes/home: Permission denied` and the pod CrashLoopBackOffs. local-path-provisioner's helper-pod sets the volume to 1000:1000 (internal/embed/infrastructure/base/templates/local-path.yaml), which happens to suit OpenClaw but not Hermes (containerUID = 10000). Fix: after `helmfile sync`, host-side chown the PVC backing dirs to containerUID:containerGID by exec-ing into the k3d server container via `docker exec`. That runs at the Docker daemon's real root and is not subject to the user-namespacing that silently breaks the in-pod attempt. The existing `fixRuntimeVolumeOwnership` helper already does exactly this for wallet keystore paths; we wire it into the agent deploy path via a new `ensureHermesPVCOwnership` that: 1. Waits up to 60s for each PVC (`hermes-data`, `remote-signer- keystores`) to be Bound — local-path is WaitForFirstConsumer so the host dir doesn't exist until the pod is scheduled. 2. Chowns each backing dir. 3. If a Hermes pod is currently stuck in Init:CrashLoopBackOff, deletes it so kubelet recreates immediately rather than after exponential backoff (~5 min worst case). Skips the delete when no pod is stuck so routine syncs (e.g. `obol model sync` after `obol model prefer`) do not gratuitously restart a healthy agent. Called from `hermes.Sync` after `helmfile sync` succeeds, so every Onboard / Setup / Sync call exercises it. Validated on spark2 (Linux ARM64, Ubuntu 24.04, NVIDIA GB10) by reverting the PVC dirs to 1000:1000, deleting the Hermes pod, and running `obol model sync`. Before: pod stuck in Init:CrashLoopBackOff with "Permission denied" in init-hermes-data logs. After: PVC dirs flip to 10000:10000 within the sync, pod reaches `Running 2/2` in ~30s with zero CrashLoopBackOff cycles. Test: - `TestHermesPVCPaths` pins the two host paths the helper chowns, so renaming `hermes-data`/`remote-signer-keystores` or relocating the namespace prefix can't silently regress the fix. - Full chown side-effect needs a live k3d cluster and is exercised by the spark2 validation above plus all existing integration flows; no unit-mocked k3d test added.
8c10251 to
41653c4
Compare
The first revision of ensureHermesPVCOwnership chowned BOTH PVCs in the
Hermes namespace — `hermes-data` AND `remote-signer-keystores` — to
containerUID:containerGID (10000:10000). That broke the remote-signer
pod on Linux k3d:
remote-signer-7bc7c4759-nj8lx 0/1 CrashLoopBackOff
{"level":"ERROR","message":"failed to load keystores",
"error":"io error: Permission denied (os error 13)"}
The two PVCs have different UID/GID contracts:
hermes-data owned by the Hermes Deployment
runAsUser=10000, runAsGroup=10000, fsGroup=10000
✓ chown to 10000:10000 is correct
remote-signer-keystores owned by the obol/remote-signer Helm release
runAsUser=65532, fsGroup=1000 (read-only mount)
✗ chown to 10000:10000 makes /data/keystores
unreadable to the pod
✓ the local-path-provisioner default of
1000:1000 already matches the fsGroup
contract, so leaving it untouched is the
safe behavior
Reproduced today on spark2 after the original PR #475 fix landed: the
Hermes pod recovered correctly, but the remote-signer pod that had been
healthy on the prior boot now CrashLoopBackOff'd until the keystore
volume was manually chowned back to 1000:1000.
Changes:
- hermesPVCPaths: drop the remote-signer-keystores entry. Doc comment
explains why the negative is intentional.
- ensureHermesPVCOwnership: drop remote-signer-keystores from the
PVC-wait loop too — no reason to wait on a volume we no longer touch.
- TestHermesPVCPaths: tightened to assert the single Hermes path.
- TestHermesPVCPaths_ExcludesRemoteSignerKeystores: NEW negative
guard so a future "re-include all PVCs in the namespace" refactor
fails the test before it ships.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #475.
The in-pod
init-hermes-permschown added in #446 (c066baa) silently no-ops on Linux k3d because the embedded k3d config setsKubeletInUserNamespace=true(internal/embed/k3d-config.yaml:31-37). With user-namespacing the pod's "root" maps to a host subuid that lacks chown authority over the host bind-mount path; the chown succeeds inside the namespace but has no effect on host-side ownership. The next init container,init-hermes-data, runs as UID 10000 and fails withmkdir /data/.hermes/home: Permission deniedagainst a directory that local-path-provisioner created as 1000:1000.This PR adds
ensureHermesPVCOwnership, called fromhermes.Syncafterhelmfile sync. It chowns the PVC backing dirs from outside the user namespace viadocker execinto the k3d server container — running at the Docker daemon's real root, not the namespaced kubelet's. Reuses the existingfixRuntimeVolumeOwnershiphelper (already used for wallet keystores) plus a smallkubectl waitfor PVC binding and a conditional restart for stuck pods.Why not just patch local-path-provisioner?
We considered changing
internal/embed/infrastructure/base/templates/local-path.yaml's setup script to chown to 10000:10000 instead of 1000:1000. That breaks OpenClaw, which runs as 1000:1000 (internal/openclaw/openclaw.go:801already chowns to that). Two workloads with different runtime UIDs share the same StorageClass, so the chown has to live in each workload's deploy path, not in the provisioner.Behavior
helmfile sync; helper-pod sets ownership to 1000:1000; our chown immediately overwrites to 10000:10000; pod's init containers run cleanly on first try.obol model syncafterobol model prefer): chown is idempotent. The conditional restart only fires if a pod is actually inInit:CrashLoopBackOfforError—obol model syncagainst a healthy stack does not gratuitously restart the agent.Init:CrashLoopBackOffviakubectl get pods … jsonpath=…state.waiting.reasonand force a single pod delete. Kubelet recreates immediately rather than waiting up to ~5 min for exponential backoff.Live validation on spark2 (Linux ARM64, Ubuntu 24.04, NVIDIA GB10)
Before this PR the exact same teardown left
hermes-*inInit:CrashLoopBackOfffor ~16 minutes until manualsudo chownintervention.Test plan
go test ./internal/hermes -count=1— all existing tests pass, plus the newTestHermesPVCPathspins the two host paths the helper chowns so a future rename of the PVCs or the namespace layout can't silently regress the fix.go build ./...clean.Running 2/2in 28s with zero CrashLoopBackOff cycles, where the prior behavior was a 5-15 min stuck state.Notes
fsGroupChangePolicy: OnRootMismatchonto the pod spec. We did not take that path becausefsGroup-style chowns are themselves performed by kubelet, which on this stack runs withKubeletInUserNamespace=true— they're affected by the same namespacing the in-pod chown is.docker execis the only path that bypasses it cleanly.obol model syncis now a few-hundred-ms slower because of thekubectl waitcalls (timeout 60s but typically finish in ~1s since the PVC is already Bound on subsequent syncs). If this matters to anyone, we can short-circuit the wait when both PVCs are alreadyBound. Left as a follow-up.Generated with Claude Code