Skip to content

Replace hardcoded init.krun with generic virtual file overlay#673

Draft
mtjhrc wants to merge 13 commits into
containers:mainfrom
mtjhrc:virtual-inodes-v1
Draft

Replace hardcoded init.krun with generic virtual file overlay#673
mtjhrc wants to merge 13 commits into
containers:mainfrom
mtjhrc:virtual-inodes-v1

Conversation

@mtjhrc
Copy link
Copy Markdown
Collaborator

@mtjhrc mtjhrc commented May 11, 2026

This PR replaces the hardcoded init.krun handling in the virtiofs passthrough backends with a generic virtual-files overlay (AugmentFs).

This introduces 2 new filesystem trait implementations:

  • AugmentFs<T>, a wrapper that intercepts FUSE operations for virtual inodes - synthetic read-only files/directories backed by static data. It also handles our custom ioctls
  • NullFs, a minimal FileSystem impl with just an empty root directory — used when no host directory is needed

The init.krun is registered as just a virtual file from the API layer. As a bonus you can even inject the .krun_config.json as a virtual file.

Reimplemented krun_set_root_disk_remount() via NullFs+AugmentFs #551 (comment)

The public API is still mostly compatible. There are minor differences like init.krun dissapears after it has been looked up once.

API breaking changes - applying krun_disable_implicit_init() and other disable_implicit_* will be applied by default in a follow up PR.

The init binary is now in its own init-blob crate. The direction for #634 (2.0 API) is to invert the dependency: init-blob would depend on libkrun's overlay APIs to inject itself, rather than libkrun depending on a specific init.

This supersedes #593 by @ggoodman, which tackled the same problem of decoupling init from the fs backends. This PR takes that idea further by removing awareness of init from the filesystem layer entirely - it's just another virtual file. #593 also introduced InitPolicy startup validation - how that fits into the 2.0 API (#634) with different payload types is still an open question.

Known limitations / future work:

  • Virtual inodes don't appear in readdir (pre-existing — init.krun was also lookup-only)
  • The EXPORT_FD ioctl (GPU cross-domain shared memory) remains in passthrough for now
  • No DAX (setupmapping) for virtual files on macOS (pre-existing — init.krun never had DAX on macOS either)
  • DAX setupmapping/removemapping layering: the overlay creates mappings but the inner passthrough tears them down (works correctly but is architecturally messy)

mtjhrc added 13 commits May 11, 2026 17:48
Move the init binary build script and include_bytes!() from the
devices crate into a new init-blob crate. The passthrough modules
reference the binary as init_blob::INIT_BINARY instead of using
include_bytes! directly.

build.rs based on code from containers#593.
Co-authored-by: Geoffrey Goodman <geoff@goodman.dev>

Assisted-by: OpenCode:claude-opus-4.6
Signed-off-by: Matej Hrica <mhrica@redhat.com>
Replace the private next_inode AtomicU64 inside PassthroughFs with a
shared InodeAllocator that is passed in at construction. This lets
multiple layers (e.g. a future virtual-inode overlay) allocate from
the same counter without implicit coordination via reserved ranges.

PassthroughFs::new() and PassthroughFsRo::new() now take an
Arc<InodeAllocator> parameter. FsWorker::new() creates the allocator
and passes it through.

Assisted-by: OpenCode:claude-opus-4.6
Signed-off-by: Matej Hrica <mhrica@redhat.com>
Introduce AugmentFs<T>, a generic overlay that wraps any FileSystem
implementation and intercepts FUSE operations for virtual inodes —
synthetic read-only files backed by static data. One-shot files
can only be looked up once.

The overlay uses the shared InodeAllocator to assign inode numbers,
so virtual and passthrough inodes never collide.

Remove all init.krun special-case code (init_inode, init_handle,
INIT_CSTR, init_payload) from both the Linux and macOS passthrough
implementations. The init.krun virtual file is now configured via
VirtualEntry in the krun API layer and handled generically by the
overlay.

FsDeviceConfig carries a Vec<VirtualEntry> and FsWorker wraps
AugmentFs<PassthroughFs> / AugmentFs<PassthroughFsRo>.

Assisted-by: OpenCode:claude-opus-4.6
Signed-off-by: Matej Hrica <mhrica@redhat.com>
Add API to prevent the default init binary (/init.krun) from being
injected into the root filesystem. Follows the existing
krun_disable_implicit_{console,vsock} pattern.

Must be called before krun_set_root().

Assisted-by: OpenCode:claude-opus-4.6
Signed-off-by: Matej Hrica <mhrica@redhat.com>
Add C API to inject arbitrary virtual files into a virtiofs device.
The file appears in the root directory of the specified mount and is
backed entirely by host memory. Supports one-shot semantics (the file can only be
looked up once).

The data pointer follows the same lifetime contract as other krun
APIs: the caller must keep the memory valid until krun_start_enter()
returns.

Assisted-by: OpenCode:claude-opus-4.6
Signed-off-by: Matej Hrica <mhrica@redhat.com>
Add API to retrieve the built-in default init binary. Callers that
use krun_disable_implicit_init() can use this to obtain the init
binary and inject it themselves via krun_fs_add_overlay_file().

Assisted-by: OpenCode:claude-opus-4.6
Signed-off-by: Matej Hrica <mhrica@redhat.com>
Assisted-by: OpenCode:claude-opus-4.6
Signed-off-by: Matej Hrica <mhrica@redhat.com>
NullFs implements the FileSystem trait with just an empty root
directory. It can be wrapped with AugmentFs to serve virtual
files without any host directory involvement.

Assisted-by: OpenCode:claude-opus-4.6
Signed-off-by: Matej Hrica <mhrica@redhat.com>
krun_set_root_disk_remount no longer creates a temporary empty host
directory. Instead it configures a NullFs-backed virtiofs device
(shared_dir: None) with init.krun overlaid via AugmentFs.

Fs::new() now accepts Option<String> for shared_dir — None selects
NullFs. FsDeviceConfig and FsServer gain the corresponding variants.

Assisted-by: OpenCode:claude-opus-4.6
Signed-off-by: Matej Hrica <mhrica@redhat.com>
The temporary root directory hack is gone (replaced by NullFs), so
the ioctl that cleaned it up and the config flag that gated it are
no longer needed.

Assisted-by: OpenCode:claude-opus-4.6
Signed-off-by: Matej Hrica <mhrica@redhat.com>
The exit-code ioctl is a krun mechanism, not a filesystem operation.
Move it to the AugmentFs where it is handled before any delegation
to the inner filesystem.

The Linux passthrough retains only EXPORT_FD (which needs access to
passthrough-internal handle and export tables). The macOS passthrough
no longer implements ioctl at all (the trait default returns ENOSYS
for any cmd that reaches it).

Assisted-by: OpenCode:claude-opus-4.6
Signed-off-by: Matej Hrica <mhrica@redhat.com>
Boot a VM with a pure NullFs root — no host directory at all. Every
file in the root (init.krun, guest-agent, .krun_config.json, test
data) is injected as a virtual overlay, and /dev, /proc, /sys are
virtual empty directories used as mount points.

The guest verifies:
  - One-shot files (init.krun, guest-agent, .krun_config.json) are
    gone after being consumed
  - Persistent files (marker.txt, testdata.bin) survive and are
    re-readable
  - Write access to virtual files is denied (EACCES)
  - stat reports correct sizes
  - Range reads at various offsets return correct data
  - Read past EOF returns zero bytes

Assisted-by: OpenCode:claude-opus-4.6
Signed-off-by: Matej Hrica <mhrica@redhat.com>
Boot from an ext4 block device via krun_set_root_disk_remount. The
virtiofs root uses NullFs with init.krun and virtual mount-point
directories overlaid. The guest verifies it pivoted to the block
device root successfully.

Assisted-by: OpenCode:claude-opus-4.6
Signed-off-by: Matej Hrica <mhrica@redhat.com>
@ggoodman
Copy link
Copy Markdown
Contributor

No comments on the code but I really love the direction!

@jakecorrenti
Copy link
Copy Markdown
Member

@mtjhrc do you want this to merge before #670 or after?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants