fix: filter single-file safetensors by assigned layers before push by cjchanh · Pull Request #83 · evilsocket/cake

cjchanh · 2026-04-14T17:03:41Z

Problem

When a Cake master distributes a single-file safetensors model to a worker, it pushes the entire file regardless of how many layers the worker is assigned. For Qwen2.5-7B-Instruct-4bit (4 GiB single file), an iPad worker with a 3 GiB jetsam budget receives the full 4 GiB, exceeds memory, and crashes with early eof.

The indexed model path (model.safetensors.index.json present) already filters correctly via weight_map. The single-file fallback at sharding/mod.rs unconditionally adds model.safetensors to the push list.

Fix

For single-file models with assigned layers, the push path now:

Reads only the safetensors header to enumerate tensor names
Filters tensors by assigned layer prefixes (same starts_with logic as the indexed path)
Calls extract_layer_tensors to build a minimal safetensors blob containing only the needed tensors
Pushes the reduced blob instead of the full file

Backward compatible: if layers is empty (no specific assignment), the full file is still pushed. If no tensors match assigned layers, falls back to full push with a warning.

Results

Tested with M5 Max master + iPad Air M3 worker, Qwen2.5-7B-Instruct-4bit:

Metric	Before	After
Push size	4 GiB (full model)	250.1 MiB (52 tensors, 2 layers)
iPad RSS	jetsam kill	1.4 GiB (under 3 GiB limit)
Result	crash (`early eof`)	coherent output at 17.21 tok/s

Test plan

cargo test -p cake-core --lib — 641 tests pass (638 existing + 3 new)
cargo test -p cake-core --test unit — 235 tests pass
cargo clippy — zero new warnings
Integration: M5 master + iPad Air M3, 2 layers of 7B-4bit, verified 250.1 MiB push, 1.4 GiB RSS, correct inference
Extended inference: longer generation to verify sustained correctness across distributed layers

New unit tests

extract_layer_tensors_single_file_filters_correctly — 4 tensors in, request 2, verify only 2 in output with correct data bytes
extract_layer_tensors_single_file_all_layers — request all tensors, verify all present with correct total size
extract_layer_tensors_single_file_missing_tensor_errors — request nonexistent tensor, verify error

When a worker is assigned a subset of layers from a single-file safetensors model, extract only the needed tensors instead of pushing the entire file. For Qwen2.5-7B-4bit (4 GiB), a 2-layer iPad worker now receives 250 MiB instead of 4 GiB — staying well under the 3 GiB iOS jetsam limit. The indexed model path already filtered correctly via weight_map. This extends the same extraction to the single-file fallback by: - Reading the safetensors header to enumerate tensor names - Filtering by assigned layer prefixes - Calling extract_layer_tensors to build a minimal blob - Falling back to full push when layers is empty (backward compat) Verified: M5 master + iPad Air M3 worker, 2 layers, 250.1 MiB push, 1.4 GiB RSS, coherent output at 17.21 tok/s.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: filter single-file safetensors by assigned layers before push#83

fix: filter single-file safetensors by assigned layers before push#83
cjchanh wants to merge 1 commit intoevilsocket:mainfrom
cjchanh:fix/single-file-layer-filter

cjchanh commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

cjchanh commented Apr 14, 2026

Problem

Fix

Results

Test plan

New unit tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant