Skip to content

Fix/gcs check hashes#821

Closed
postevanus-scale wants to merge 3 commits intomainfrom
fix/gcs-check-hashes
Closed

Fix/gcs check hashes#821
postevanus-scale wants to merge 3 commits intomainfrom
fix/gcs-check-hashes

Conversation

@postevanus-scale
Copy link
Copy Markdown
Collaborator

@postevanus-scale postevanus-scale commented May 6, 2026

Pull Request Summary

What is this PR changing? Why is this change being made? Any caveats you'd like to highlight? Link any relevant documents, links, or screenshots here if applicable.

Test Plan and Usage Guide

How did you validate that your PR works correctly? How do you run or demo the code? Provide enough detail so a reviewer can reasonably reproduce the testing procedure. Paste example command line invocations if applicable.

Greptile Summary

This PR fixes GCS checkpoint/weight downloads that were failing because the inference image lacks the google-crc32c library required by gcloud storage's default check_hashes=always mode. The fix appends gcloud config set storage/check_hashes if_fast_else_skip to the SDK install command in both load_model_weights_sub_commands_gcs and load_model_files_sub_commands_trt_llm, with corresponding test updates in all three affected test cases.

Confidence Score: 5/5

Safe to merge — minimal, focused fix with full test coverage and clear inline documentation.

No P0/P1 issues found. Both affected code paths are updated consistently, tests cover all three GCS cases, and the code comments clearly explain the rationale. The chosen if_fast_else_skip value is appropriate for environments without google-crc32c.

No files require special attention.

Important Files Changed

Filename Overview
model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py Adds gcloud config set storage/check_hashes if_fast_else_skip to both GCS download paths; fix is well-targeted and the string concatenation is correct.
model-engine/tests/unit/domain/test_llm_use_cases.py All three GCS test cases updated to include the new check_hashes config command in the expected subcommand strings.

Sequence Diagram

sequenceDiagram
    participant Pod
    participant GCS

    Pod->>Pod: curl | tar → install gcloud SDK
    Pod->>Pod: gcloud config set disable_usage_reporting true
    Pod->>Pod: gcloud config set storage/check_hashes if_fast_else_skip
    Pod->>GCS: gcloud storage rsync / cp (checkpoint files)
    GCS-->>Pod: checkpoint files (hash check skipped if no CRC32c lib)
Loading

Reviews (1): Last reviewed commit: "docs(gcs): tighten check_hashes rational..." | Re-trigger Greptile

…s without google-crc32c

`gcloud storage rsync` and `gcloud storage cp` default to `check_hashes=always`,
which requires the `google-crc32c` library to compute CRC32C hashes of downloaded
files. The inference container that runs the `load_model_weights_sub_commands_gcs`
and `load_model_files_sub_commands_trt_llm` (gs://) chains does not have that lib
installed, so every download task fails with:

    ERROR: Task '...safetensors' failed: This copy was skipped since fast hash
    calculation tools are not installed.

The download still proceeds (TCP-level integrity is intact), but gcloud refuses
to mark each file as completed and the rsync exits non-zero — the inference pod
then falls through into vLLM/TRT startup with an empty `model_files` directory.

Setting `storage/check_hashes=if_fast_else_skip` lets gcloud skip the optional
integrity check when the fast hash lib isn't available, while preserving it
where it is. End-to-end verified on a GKE-deployed gpt-oss-20b vLLM endpoint:
the rsync now completes (~1.7 GiB/s in-region), vLLM loads the safetensors
shards, and the pod transitions to Ready. Reproduced and fixed without
modifying the inference image.

Existing `test_load_model_weights_sub_commands` (GCS branch) and
`test_load_model_files_sub_commands_trt_llm_gcs` updated to expect the new
config line.
@postevanus-scale postevanus-scale enabled auto-merge (squash) May 6, 2026 09:05
@postevanus-scale
Copy link
Copy Markdown
Collaborator Author

Closing this duplicate PR with #820

auto-merge was automatically disabled May 6, 2026 09:06

Pull request was closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant