fix(gcs): set storage/check_hashes=if_fast_else_skip so rsync/cp works without google-crc32c#820
Merged
postevanus-scale merged 3 commits intoscaleapi:mainfrom May 6, 2026
Conversation
…s without google-crc32c
`gcloud storage rsync` and `gcloud storage cp` default to `check_hashes=always`,
which requires the `google-crc32c` library to compute CRC32C hashes of downloaded
files. The inference container that runs the `load_model_weights_sub_commands_gcs`
and `load_model_files_sub_commands_trt_llm` (gs://) chains does not have that lib
installed, so every download task fails with:
ERROR: Task '...safetensors' failed: This copy was skipped since fast hash
calculation tools are not installed.
The download still proceeds (TCP-level integrity is intact), but gcloud refuses
to mark each file as completed and the rsync exits non-zero — the inference pod
then falls through into vLLM/TRT startup with an empty `model_files` directory.
Setting `storage/check_hashes=if_fast_else_skip` lets gcloud skip the optional
integrity check when the fast hash lib isn't available, while preserving it
where it is. End-to-end verified on a GKE-deployed gpt-oss-20b vLLM endpoint:
the rsync now completes (~1.7 GiB/s in-region), vLLM loads the safetensors
shards, and the pod transitions to Ready. Reproduced and fixed without
modifying the inference image.
Existing `test_load_model_weights_sub_commands` (GCS branch) and
`test_load_model_files_sub_commands_trt_llm_gcs` updated to expect the new
config line.
lilyz-ai
approved these changes
May 5, 2026
dc8467f to
77bdf32
Compare
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
gcloud storage rsync(andcp) defaults tocheck_hashes=always, which requires thegoogle-crc32cPython library to verify each downloaded file. That library is not installed in the inference containers used by thegs://code paths inCreateLLMModelBundleV1UseCase, so every download task fails with:The download itself succeeds (TCP integrity is intact), but
gcloudrefuses to mark each file as complete and the whole rsync exits non-zero. The inference pod then falls through to vLLM/TRT startup with an emptymodel_files/and crashes.This adds one line to the existing gcloud bootstrap chain in both GCS code paths so the optional integrity check is skipped when the fast hash lib isn't available, while preserving it where it is.
Why
if_fast_else_skip?gcloud storageexposes four values forstorage/check_hashes:always(default)google-crc32cis missingif_fast_else_skipif_fast_else_failneverif_fast_else_skipis the safe middle ground — it preserves verification where the lib happens to be present (e.g. richer images) and degrades gracefully where it isn't. Addinggoogle-crc32cto the inference image was considered but is a much larger change for marginal benefit; GCS already validates byte-stream integrity at the TCP layer, and the safetensors / TRT loaders verify file structure on read.Affected functions
CreateLLMModelBundleV1UseCase.load_model_weights_sub_commands_gcs— used by the standardgs://weight-download path (S3/Azure variants are unaffected)CreateLLMModelBundleV1UseCase.load_model_files_sub_commands_trt_llm(gs:// branch) — used by the TensorRT-LLM checkpoint download pathVerification
End-to-end verified on a GKE-deployed
gpt-oss-20bvLLM endpoint that was stuck inCrashLoopBackOffwith the symptom above. After applying the equivalent live patch to the rendered Deployment command:gcloud storage rsynccompletes (~1.7 GiB/s in-region from a Cloud Storage bucket in the same region)0/2 CrashLoopBackOff→2/2 Runningand/healthreturns 200No image rebuild was required; only the rendered command needed to change.
Tests
test_load_model_weights_sub_commands(GCS branch, bothtrust_remote_code=FalseandTrue) andtest_load_model_files_sub_commands_trt_llm_gcsupdated to assert the new config line is present in the rendered subcommands.Test plan
Greptile Summary
gcloud config set storage/check_hashes if_fast_else_skipto the gcloud bootstrap chain in both GCS code paths (load_model_weights_sub_commands_gcsand thegs://branch ofload_model_files_sub_commands_trt_llm), fixing aCrashLoopBackOffcaused by the missinggoogle-crc32clibrary in inference containers.if_fast_else_skip) preserves hash verification when the fast-hash library is present while degrading gracefully when it isn't, without requiring an image rebuild.Confidence Score: 5/5
This PR is safe to merge — minimal, targeted fix with no logic changes beyond the gcloud config addition.
The change is a single-line addition to two parallel GCS bootstrap chains, both of which are well-understood shell command strings. The chosen
if_fast_else_skipvalue is the documented safe degradation path. Tests are updated in lock-step. No security concerns, no custom rule violations, and the fix is end-to-end verified per the PR description.No files require special attention.
Important Files Changed
gcloud config set storage/check_hashes if_fast_else_skipto both GCS download bootstrap chains so rsync/cp proceeds withoutgoogle-crc32c, fixing CrashLoopBackOff on GCS-backed pods.test_load_model_weights_sub_commands(GCS branch, trust/no-trust variants) andtest_load_model_files_sub_commands_trt_llm_gcsto assert the new config line is present in rendered subcommands.Sequence Diagram
sequenceDiagram participant Pod as Inference Pod participant gcloud as gcloud CLI (on-the-fly install) participant GCS as Google Cloud Storage Pod->>gcloud: curl ... | tar -xz (install SDK) Pod->>gcloud: config set disable_usage_reporting true Pod->>gcloud: config set storage/check_hashes if_fast_else_skip Note over gcloud: hash check skipped if google-crc32c absent Pod->>gcloud: gcloud storage rsync / cp (model weights) gcloud->>GCS: download model files GCS-->>gcloud: file data (TCP integrity guaranteed) gcloud-->>Pod: exit 0 (no longer blocked on crc32c) Pod->>Pod: vLLM / TRT loader reads model_files/Reviews (4): Last reviewed commit: "docs(gcs): tighten check_hashes rational..." | Re-trigger Greptile