Skip to content

openshell-sandbox tracing to stdout on main/dev corrupts stdout payloads when Landlock is unavailable #902

@nam685

Description

@nam685

Agent Diagnostic

  • On April 21, 2026, main at SHA 9ac725f00d8ceaa0969a7fce21b8b09e0fa99396 still wires both shorthand tracing layers in crates/openshell-sandbox/src/main.rs to std::io::stdout().
  • I reproduced the stdout corruption directly against ghcr.io/nvidia/openshell/cluster:dev on macOS 15.3 / Docker Desktop 29.2.0 using a single-container command that runs openshell-sandbox and captures its stdout as a tar stream.
  • In that repro, the captured file starts with the Landlock WARN line instead of tar header bytes, and tar -tf fails with Unrecognized archive format.
  • Inside Docker Desktop's Linux VM, /sys/kernel/security/landlock is absent, and /app is absent in the image, so best-effort Landlock emits the unavailability warning during sandbox startup.
  • I did not reproduce this with ghcr.io/nvidia/openshell/cluster:latest (openshell-sandbox 0.0.33) using the same direct-container command. So this report is scoped to current main and cluster:dev; I am not claiming the April 20, 2026 v0.0.33 release image is affected.

Description

openshell-sandbox on main writes shorthand tracing to stdout. On Linux hosts where Landlock is unavailable, startup emits a WARN-level detection message from landlock::apply. Any code path that expects stdout to contain only payload bytes gets corrupted by that log line.

openshell sandbox download is the most obvious user-facing symptom because it streams a tar archive on stdout, but the underlying contract violation is broader: any SSH-exec or direct process path that treats stdout as a clean payload channel can be broken by sandbox tracing.

The direct repro below demonstrates the bug without involving any gateway or downstream integration.

Reproduction Steps

From an OpenShell checkout on a machine where Docker runs on a Linux kernel without Landlock (Docker Desktop on macOS reproduced this for me):

mkdir -p /tmp/openshell-stdout-repro

docker run --rm \
  --cap-add NET_ADMIN \
  --cap-add SYS_ADMIN \
  -v "$PWD/crates/openshell-sandbox/data/sandbox-policy.rego:/work/sandbox-policy.rego:ro" \
  -v "$PWD/examples/sandbox-policy-quickstart/policy.yaml:/work/policy.yaml:ro" \
  --entrypoint /bin/sh \
  ghcr.io/nvidia/openshell/cluster:dev \
  -lc '
    groupadd -g 998 sandbox >/dev/null 2>&1 || true
    useradd -u 998 -g sandbox -M -s /usr/sbin/nologin sandbox >/dev/null 2>&1 || true
    mkdir -p /tmp/artifacts
    printf hello >/tmp/artifacts/summary.txt
    OPENSHELL_POLICY_RULES=/work/sandbox-policy.rego \
    OPENSHELL_POLICY_DATA=/work/policy.yaml \
    OPENSHELL_LOG_LEVEL=warn \
    /opt/openshell/bin/openshell-sandbox -- \
      /bin/sh -lc "tar cf - -C /tmp artifacts"
  ' > /tmp/openshell-stdout-repro/artifacts.tar

tar -tf /tmp/openshell-stdout-repro/artifacts.tar

Observed result:

tar: Error opening archive: Unrecognized archive format

Optional inspection makes the corruption obvious:

xxd -g 1 -l 128 /tmp/openshell-stdout-repro/artifacts.tar

The first bytes are ANSI-colored WARN openshell_sandbox::sandbox::linux::landlock ... Landlock filesystem sandbox is UNAVAILABLE ..., not a tar header.

Expected Behavior

Tracing should go to stderr, leaving stdout as a clean payload channel. The tar stream produced by sandbox download or direct tar cf - ... style commands should remain valid regardless of whether Landlock is available.

Root Cause

On main, crates/openshell-sandbox/src/main.rs still has two OcsfShorthandLayer::new(std::io::stdout()) call sites. When landlock::apply falls back in best_effort mode, the warning is emitted through that subscriber onto fd 1, interleaving with command stdout.

Proposed Fix

Redirect both shorthand tracing layers in crates/openshell-sandbox/src/main.rs from std::io::stdout() to std::io::stderr(), and update the fallback warning text from stdout-only logging to stderr-only logging.

Checklist

  • I searched for existing issues before filing
  • I verified current main still uses stdout at the affected call sites
  • I reproduced the corruption against ghcr.io/nvidia/openshell/cluster:dev
  • I confirmed the repro does not require ellarun or a live OpenShell gateway

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions