cuda: add DS4_CUDA_MANAGED env var for full UMA pool access on Strix Halo by kmc6042 · Pull Request #313 · antirez/ds4

kmc6042 · 2026-05-31T17:29:59Z

Summary

Add DS4_CUDA_MANAGED=1 environment variable that switches ds4_gpu_tensor_alloc from hipMalloc (VRAM carve-out only) to hipMallocManaged (full unified memory pool).

Problem

On UMA platforms like AMD Strix Halo (Ryzen AI MAX+ 395, 128 GB), the BIOS VRAM carve-out is often capped at 96 GB. Since ds4 uses hipMalloc for all tensor allocations, context buffers are limited to this carve-out, capping usable context at ~870K when model weights occupy ~81 GB.

With managed memory, the full 128 GB unified pool is available, enabling the full 1M-token context that DeepSeek V4 Flash supports.

Change

Single function change in ds4_gpu_tensor_alloc(): when DS4_CUDA_MANAGED is set, use cudaMallocManaged (→ hipMallocManaged) instead of cudaMalloc (→ hipMalloc)
Opt-in via env var, zero overhead when unset
Model weights remain on hipMalloc (VRAM), only runtime buffers use managed memory
Also fix two rsqrtf() → 1.0f/sqrtf() calls for ROCm compatibility

Testing

Tested on Strix Halo (Ryzen AI MAX+ 395, gfx1151, 128 GB UMA, BIOS VRAM 96 GB):

# Without the flag (before fix): OOM at 1M context
$ ./ds4-server --ctx 1000000
ds4: CUDA tensor alloc failed: out of memory
ds4: failed to create cuda session

# With the flag: 1M context works
$ DS4_CUDA_MANAGED=1 ./ds4-server --ctx 1000000
ds4: CUDA managed memory enabled (hipMallocManaged) — full UMA pool
ds4: context buffers 17222.50 MiB (ctx=1000000)
# → Server starts successfully
# → Chat completion: 8.71 t/s, correct output, finish=stop

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds an opt-in CUDA managed-memory allocation path for GPU tensors and adjusts attention scaling constant computation.

Changes:

Add DS4_CUDA_MANAGED environment toggle to allocate tensors via cudaMallocManaged (otherwise cudaMalloc).
Replace rsqrtf(head_dim) with 1.0f / sqrtf(head_dim) for attention scaling constants.
Add allocation failure logging that prints the CUDA error string.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    static int use_managed = -1;
+    if (use_managed == -1) {
+        use_managed = (getenv("DS4_CUDA_MANAGED") != NULL) ? 1 : 0;
+        if (use_managed) {
+            fprintf(stderr, "ds4: CUDA managed memory enabled (hipMallocManaged) — full UMA pool\n");
+        }
+    }


+    if (use_managed == -1) {
+        use_managed = (getenv("DS4_CUDA_MANAGED") != NULL) ? 1 : 0;
+        if (use_managed) {
+            fprintf(stderr, "ds4: CUDA managed memory enabled (hipMallocManaged) — full UMA pool\n");


+
+    cudaError_t err;
+    if (use_managed) {
+        /* hipMemAttachGlobal = 1: GPU-accessible across all streams.


+        /* hipMemAttachGlobal = 1: GPU-accessible across all streams.
+         * On UMA platforms (Strix Halo, Grace-Hopper) this allocates from
+         * the full unified pool, bypassing the BIOS VRAM carve-out. */
+        err = cudaMallocManaged(&t->ptr, (size_t)bytes, 1);


+    if (err != cudaSuccess) {
+        fprintf(stderr, "ds4: CUDA tensor alloc failed: %s\n", cudaGetErrorString(err));
+        (void)cudaGetLastError();
        free(t);
        return NULL;
    }


Add DS4_CUDA_MANAGED=1 environment variable that switches ds4_gpu_tensor_alloc from cudaMalloc (VRAM carve-out only) to cudaMallocManaged (full unified memory pool). This is critical for UMA platforms like AMD Strix Halo where the BIOS VRAM carve-out (e.g. 96 GB) is smaller than physical memory (128 GB). Without this, context buffers are limited to the BIOS carve-out, capping usable context at ~870K when the model weights occupy ~81 GB. With managed memory, the full 128 GB pool is available, enabling 1M-token context. The change is opt-in and zero-overhead when the env var is unset. Also fix two rsqrtf() calls to 1.0f/sqrtf() for ROCm compatibility. Tested on Strix Halo (Ryzen AI MAX+ 395, gfx1151): DS4_CUDA_MANAGED=1 ./ds4-server --ctx 1000000 → context buffers 17222.50 MiB, server starts successfully → 1M context chat completion: 8.71 t/s, correct output Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings May 31, 2026 17:29

Copilot AI reviewed May 31, 2026

View reviewed changes

kmc6042 force-pushed the cuda-managed-memory branch from e11e755 to 21a54de Compare May 31, 2026 17:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda: add DS4_CUDA_MANAGED env var for full UMA pool access on Strix Halo#313

cuda: add DS4_CUDA_MANAGED env var for full UMA pool access on Strix Halo#313
kmc6042 wants to merge 1 commit into
antirez:rocmfrom
kmc6042:cuda-managed-memory

kmc6042 commented May 31, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kmc6042 commented May 31, 2026

Summary

Problem

Change

Testing

Related

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants