Skip to content

cuda: add DS4_CUDA_MANAGED env var for full UMA pool access on Strix Halo#313

Open
kmc6042 wants to merge 1 commit into
antirez:rocmfrom
kmc6042:cuda-managed-memory
Open

cuda: add DS4_CUDA_MANAGED env var for full UMA pool access on Strix Halo#313
kmc6042 wants to merge 1 commit into
antirez:rocmfrom
kmc6042:cuda-managed-memory

Conversation

@kmc6042
Copy link
Copy Markdown

@kmc6042 kmc6042 commented May 31, 2026

Summary

Add DS4_CUDA_MANAGED=1 environment variable that switches ds4_gpu_tensor_alloc from hipMalloc (VRAM carve-out only) to hipMallocManaged (full unified memory pool).

Problem

On UMA platforms like AMD Strix Halo (Ryzen AI MAX+ 395, 128 GB), the BIOS VRAM carve-out is often capped at 96 GB. Since ds4 uses hipMalloc for all tensor allocations, context buffers are limited to this carve-out, capping usable context at ~870K when model weights occupy ~81 GB.

With managed memory, the full 128 GB unified pool is available, enabling the full 1M-token context that DeepSeek V4 Flash supports.

Change

  • Single function change in ds4_gpu_tensor_alloc(): when DS4_CUDA_MANAGED is set, use cudaMallocManaged (→ hipMallocManaged) instead of cudaMalloc (→ hipMalloc)
  • Opt-in via env var, zero overhead when unset
  • Model weights remain on hipMalloc (VRAM), only runtime buffers use managed memory
  • Also fix two rsqrtf()1.0f/sqrtf() calls for ROCm compatibility

Testing

Tested on Strix Halo (Ryzen AI MAX+ 395, gfx1151, 128 GB UMA, BIOS VRAM 96 GB):

# Without the flag (before fix): OOM at 1M context
$ ./ds4-server --ctx 1000000
ds4: CUDA tensor alloc failed: out of memory
ds4: failed to create cuda session

# With the flag: 1M context works
$ DS4_CUDA_MANAGED=1 ./ds4-server --ctx 1000000
ds4: CUDA managed memory enabled (hipMallocManaged) — full UMA pool
ds4: context buffers 17222.50 MiB (ctx=1000000)
# → Server starts successfully
# → Chat completion: 8.71 t/s, correct output, finish=stop

Related

This should also help other UMA platforms (DGX Spark / GB10, Grace-Hopper) where the BIOS carve-out may be smaller than physical memory.

Copilot AI review requested due to automatic review settings May 31, 2026 17:29
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds an opt-in CUDA managed-memory allocation path for GPU tensors and adjusts attention scaling constant computation.

Changes:

  • Add DS4_CUDA_MANAGED environment toggle to allocate tensors via cudaMallocManaged (otherwise cudaMalloc).
  • Replace rsqrtf(head_dim) with 1.0f / sqrtf(head_dim) for attention scaling constants.
  • Add allocation failure logging that prints the CUDA error string.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ds4_cuda.cu Outdated
Comment on lines +1306 to +1312
static int use_managed = -1;
if (use_managed == -1) {
use_managed = (getenv("DS4_CUDA_MANAGED") != NULL) ? 1 : 0;
if (use_managed) {
fprintf(stderr, "ds4: CUDA managed memory enabled (hipMallocManaged) — full UMA pool\n");
}
}
Comment thread ds4_cuda.cu Outdated
if (use_managed == -1) {
use_managed = (getenv("DS4_CUDA_MANAGED") != NULL) ? 1 : 0;
if (use_managed) {
fprintf(stderr, "ds4: CUDA managed memory enabled (hipMallocManaged) — full UMA pool\n");
Comment thread ds4_cuda.cu Outdated

cudaError_t err;
if (use_managed) {
/* hipMemAttachGlobal = 1: GPU-accessible across all streams.
Comment thread ds4_cuda.cu Outdated
/* hipMemAttachGlobal = 1: GPU-accessible across all streams.
* On UMA platforms (Strix Halo, Grace-Hopper) this allocates from
* the full unified pool, bypassing the BIOS VRAM carve-out. */
err = cudaMallocManaged(&t->ptr, (size_t)bytes, 1);
Comment thread ds4_cuda.cu Outdated
Comment on lines 1324 to 1329
if (err != cudaSuccess) {
fprintf(stderr, "ds4: CUDA tensor alloc failed: %s\n", cudaGetErrorString(err));
(void)cudaGetLastError();
free(t);
return NULL;
}
Add DS4_CUDA_MANAGED=1 environment variable that switches
ds4_gpu_tensor_alloc from cudaMalloc (VRAM carve-out only) to
cudaMallocManaged (full unified memory pool).

This is critical for UMA platforms like AMD Strix Halo where the
BIOS VRAM carve-out (e.g. 96 GB) is smaller than physical memory
(128 GB). Without this, context buffers are limited to the BIOS
carve-out, capping usable context at ~870K when the model weights
occupy ~81 GB. With managed memory, the full 128 GB pool is
available, enabling 1M-token context.

The change is opt-in and zero-overhead when the env var is unset.

Also fix two rsqrtf() calls to 1.0f/sqrtf() for ROCm
compatibility.

Tested on Strix Halo (Ryzen AI MAX+ 395, gfx1151):
  DS4_CUDA_MANAGED=1 ./ds4-server --ctx 1000000
  → context buffers 17222.50 MiB, server starts successfully
  → 1M context chat completion: 8.71 t/s, correct output

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@kmc6042 kmc6042 force-pushed the cuda-managed-memory branch from e11e755 to 21a54de Compare May 31, 2026 17:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants