Skip to content

fix: filter single-file safetensors by assigned layers before push#83

Open
cjchanh wants to merge 1 commit intoevilsocket:mainfrom
cjchanh:fix/single-file-layer-filter
Open

fix: filter single-file safetensors by assigned layers before push#83
cjchanh wants to merge 1 commit intoevilsocket:mainfrom
cjchanh:fix/single-file-layer-filter

Conversation

@cjchanh
Copy link
Copy Markdown

@cjchanh cjchanh commented Apr 14, 2026

Problem

When a Cake master distributes a single-file safetensors model to a worker, it pushes the entire file regardless of how many layers the worker is assigned. For Qwen2.5-7B-Instruct-4bit (4 GiB single file), an iPad worker with a 3 GiB jetsam budget receives the full 4 GiB, exceeds memory, and crashes with early eof.

The indexed model path (model.safetensors.index.json present) already filters correctly via weight_map. The single-file fallback at sharding/mod.rs unconditionally adds model.safetensors to the push list.

Fix

For single-file models with assigned layers, the push path now:

  1. Reads only the safetensors header to enumerate tensor names
  2. Filters tensors by assigned layer prefixes (same starts_with logic as the indexed path)
  3. Calls extract_layer_tensors to build a minimal safetensors blob containing only the needed tensors
  4. Pushes the reduced blob instead of the full file

Backward compatible: if layers is empty (no specific assignment), the full file is still pushed. If no tensors match assigned layers, falls back to full push with a warning.

Results

Tested with M5 Max master + iPad Air M3 worker, Qwen2.5-7B-Instruct-4bit:

Metric Before After
Push size 4 GiB (full model) 250.1 MiB (52 tensors, 2 layers)
iPad RSS jetsam kill 1.4 GiB (under 3 GiB limit)
Result crash (early eof) coherent output at 17.21 tok/s

Test plan

  • cargo test -p cake-core --lib — 641 tests pass (638 existing + 3 new)
  • cargo test -p cake-core --test unit — 235 tests pass
  • cargo clippy — zero new warnings
  • Integration: M5 master + iPad Air M3, 2 layers of 7B-4bit, verified 250.1 MiB push, 1.4 GiB RSS, correct inference
  • Extended inference: longer generation to verify sustained correctness across distributed layers

New unit tests

  • extract_layer_tensors_single_file_filters_correctly — 4 tensors in, request 2, verify only 2 in output with correct data bytes
  • extract_layer_tensors_single_file_all_layers — request all tensors, verify all present with correct total size
  • extract_layer_tensors_single_file_missing_tensor_errors — request nonexistent tensor, verify error

When a worker is assigned a subset of layers from a single-file
safetensors model, extract only the needed tensors instead of pushing
the entire file. For Qwen2.5-7B-4bit (4 GiB), a 2-layer iPad worker
now receives 250 MiB instead of 4 GiB — staying well under the 3 GiB
iOS jetsam limit.

The indexed model path already filtered correctly via weight_map.
This extends the same extraction to the single-file fallback by:
- Reading the safetensors header to enumerate tensor names
- Filtering by assigned layer prefixes
- Calling extract_layer_tensors to build a minimal blob
- Falling back to full push when layers is empty (backward compat)

Verified: M5 master + iPad Air M3 worker, 2 layers, 250.1 MiB push,
1.4 GiB RSS, coherent output at 17.21 tok/s.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant