cuda: add DS4_CUDA_MANAGED env var for full UMA pool access on Strix Halo#313
Open
kmc6042 wants to merge 1 commit into
Open
cuda: add DS4_CUDA_MANAGED env var for full UMA pool access on Strix Halo#313kmc6042 wants to merge 1 commit into
kmc6042 wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds an opt-in CUDA managed-memory allocation path for GPU tensors and adjusts attention scaling constant computation.
Changes:
- Add
DS4_CUDA_MANAGEDenvironment toggle to allocate tensors viacudaMallocManaged(otherwisecudaMalloc). - Replace
rsqrtf(head_dim)with1.0f / sqrtf(head_dim)for attention scaling constants. - Add allocation failure logging that prints the CUDA error string.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+1306
to
+1312
| static int use_managed = -1; | ||
| if (use_managed == -1) { | ||
| use_managed = (getenv("DS4_CUDA_MANAGED") != NULL) ? 1 : 0; | ||
| if (use_managed) { | ||
| fprintf(stderr, "ds4: CUDA managed memory enabled (hipMallocManaged) — full UMA pool\n"); | ||
| } | ||
| } |
| if (use_managed == -1) { | ||
| use_managed = (getenv("DS4_CUDA_MANAGED") != NULL) ? 1 : 0; | ||
| if (use_managed) { | ||
| fprintf(stderr, "ds4: CUDA managed memory enabled (hipMallocManaged) — full UMA pool\n"); |
|
|
||
| cudaError_t err; | ||
| if (use_managed) { | ||
| /* hipMemAttachGlobal = 1: GPU-accessible across all streams. |
| /* hipMemAttachGlobal = 1: GPU-accessible across all streams. | ||
| * On UMA platforms (Strix Halo, Grace-Hopper) this allocates from | ||
| * the full unified pool, bypassing the BIOS VRAM carve-out. */ | ||
| err = cudaMallocManaged(&t->ptr, (size_t)bytes, 1); |
Comment on lines
1324
to
1329
| if (err != cudaSuccess) { | ||
| fprintf(stderr, "ds4: CUDA tensor alloc failed: %s\n", cudaGetErrorString(err)); | ||
| (void)cudaGetLastError(); | ||
| free(t); | ||
| return NULL; | ||
| } |
Add DS4_CUDA_MANAGED=1 environment variable that switches ds4_gpu_tensor_alloc from cudaMalloc (VRAM carve-out only) to cudaMallocManaged (full unified memory pool). This is critical for UMA platforms like AMD Strix Halo where the BIOS VRAM carve-out (e.g. 96 GB) is smaller than physical memory (128 GB). Without this, context buffers are limited to the BIOS carve-out, capping usable context at ~870K when the model weights occupy ~81 GB. With managed memory, the full 128 GB pool is available, enabling 1M-token context. The change is opt-in and zero-overhead when the env var is unset. Also fix two rsqrtf() calls to 1.0f/sqrtf() for ROCm compatibility. Tested on Strix Halo (Ryzen AI MAX+ 395, gfx1151): DS4_CUDA_MANAGED=1 ./ds4-server --ctx 1000000 → context buffers 17222.50 MiB, server starts successfully → 1M context chat completion: 8.71 t/s, correct output Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
e11e755 to
21a54de
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add
DS4_CUDA_MANAGED=1environment variable that switchesds4_gpu_tensor_allocfromhipMalloc(VRAM carve-out only) tohipMallocManaged(full unified memory pool).Problem
On UMA platforms like AMD Strix Halo (Ryzen AI MAX+ 395, 128 GB), the BIOS VRAM carve-out is often capped at 96 GB. Since ds4 uses
hipMallocfor all tensor allocations, context buffers are limited to this carve-out, capping usable context at ~870K when model weights occupy ~81 GB.With managed memory, the full 128 GB unified pool is available, enabling the full 1M-token context that DeepSeek V4 Flash supports.
Change
ds4_gpu_tensor_alloc(): whenDS4_CUDA_MANAGEDis set, usecudaMallocManaged(→hipMallocManaged) instead ofcudaMalloc(→hipMalloc)hipMalloc(VRAM), only runtime buffers use managed memoryrsqrtf()→1.0f/sqrtf()calls for ROCm compatibilityTesting
Tested on Strix Halo (Ryzen AI MAX+ 395, gfx1151, 128 GB UMA, BIOS VRAM 96 GB):
Related
This should also help other UMA platforms (DGX Spark / GB10, Grace-Hopper) where the BIOS carve-out may be smaller than physical memory.