Skip to content

fix: filter correct disk image during boot disk selection#110

Open
aditsharma55 wants to merge 1 commit intomainfrom
BREV-8794/fix-bootdisk-image-selection
Open

fix: filter correct disk image during boot disk selection#110
aditsharma55 wants to merge 1 commit intomainfrom
BREV-8794/fix-bootdisk-image-selection

Conversation

@aditsharma55
Copy link
Copy Markdown
Contributor

Summary

  • Fix boot disk image selection that was picking k8s worker-node images (e.g. worker-node-v-1-33-ubuntu24.04-cuda12.8) instead of instance images (e.g. ubuntu24.04-cuda13.0) due to Nebius API pagination and non-deterministic ordering
  • Replace the old first-match image selection with a score-based system that evaluates all images across all pages and picks the highest-scored one, ensuring ubuntu24+cuda13 is always preferred over worker-node images for default deployments
  • Remove iptables-persistent / netfilter-persistent from cloud-init since the correct instance image (ubuntu24.04-cuda13.0) does not ship with netfilter-persistent, and the previous cloud-init commands were causing failures (sudo: netfilter-persistent: command not found)

What changed

Image selection — pagination fix:

The old code used Image().List() which only returned the first page of results. With a small default page size, ubuntu24.04-cuda13.0 could be omitted entirely. Replaced with Image().Filter() which auto-paginates via the SDK iterator.

Image selection — score-based ranking:

The old if/else matching was order-dependent and first-match-wins. The new approach scores every non-ARM64 image using a tiered system

Cloud-init cleanup:

Removed iptables-persistent package, the netfilter-persistent.service systemd ordering drop-in, and the netfilter-persistent save command. These were added for a previous image that shipped with netfilter-persistent pre-installed; the current image does not have it, and these commands caused cloud-init to fail.

Relevant Linear Ticket:

https://linear.app/nvidia/issue/BRE2-901/issue-when-deploying-nebius-h100

@aditsharma55 aditsharma55 requested a review from a team as a code owner April 17, 2026 23:03
@drewmalin
Copy link
Copy Markdown
Contributor

I'm not sure I understand the motivation behind scoring all instances -- could we do something simpler if we already know that we prefer a specific image (e.g. ubuntu24+cuda13)? Granted, the images may change over time, but it should be reasonable for us to pin to a known image rather than possibly dynamically upgrading if Nebius releases new images.

I agree that paginating the list is going to be necessary if we scan through them.

However, I think we should be much more direct. For example: if ubuntu24.04-cuda13.0 has been shown to be a stable, valid Nebius image, let's always request that image.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants