Skip to content

fix(gcs): set storage/check_hashes=if_fast_else_skip so rsync/cp works without google-crc32c#820

Merged
postevanus-scale merged 3 commits intoscaleapi:mainfrom
postevanus-scale:fix/gcs-check-hashes
May 6, 2026
Merged

fix(gcs): set storage/check_hashes=if_fast_else_skip so rsync/cp works without google-crc32c#820
postevanus-scale merged 3 commits intoscaleapi:mainfrom
postevanus-scale:fix/gcs-check-hashes

Conversation

@postevanus-scale
Copy link
Copy Markdown
Collaborator

@postevanus-scale postevanus-scale commented May 5, 2026

Summary

gcloud storage rsync (and cp) defaults to check_hashes=always, which requires the google-crc32c Python library to verify each downloaded file. That library is not installed in the inference containers used by the gs:// code paths in CreateLLMModelBundleV1UseCase, so every download task fails with:

ERROR: Task 'file://model_files/model-00002-of-00002.safetensors' failed:
  This copy was skipped since fast hash calculation tools are not installed.
  You can change this by running:
        $ /usr/bin/python3.12 -m pip install google-crc32c --upgrade --target /opt/google-cloud-sdk/lib/third_party
  You can also modify the "storage/check_hashes" config setting.

The download itself succeeds (TCP integrity is intact), but gcloud refuses to mark each file as complete and the whole rsync exits non-zero. The inference pod then falls through to vLLM/TRT startup with an empty model_files/ and crashes.

This adds one line to the existing gcloud bootstrap chain in both GCS code paths so the optional integrity check is skipped when the fast hash lib isn't available, while preserving it where it is.

Why if_fast_else_skip?

gcloud storage exposes four values for storage/check_hashes:

Value Behavior
always (default) Mandatory; fails if google-crc32c is missing
if_fast_else_skip Verify when the fast hash lib is available; silently skip when it isn't
if_fast_else_fail Same, but logs a warning
never Never verify

if_fast_else_skip is the safe middle ground — it preserves verification where the lib happens to be present (e.g. richer images) and degrades gracefully where it isn't. Adding google-crc32c to the inference image was considered but is a much larger change for marginal benefit; GCS already validates byte-stream integrity at the TCP layer, and the safetensors / TRT loaders verify file structure on read.

Affected functions

  • CreateLLMModelBundleV1UseCase.load_model_weights_sub_commands_gcs — used by the standard gs:// weight-download path (S3/Azure variants are unaffected)
  • CreateLLMModelBundleV1UseCase.load_model_files_sub_commands_trt_llm (gs:// branch) — used by the TensorRT-LLM checkpoint download path

Verification

End-to-end verified on a GKE-deployed gpt-oss-20b vLLM endpoint that was stuck in CrashLoopBackOff with the symptom above. After applying the equivalent live patch to the rendered Deployment command:

  • gcloud storage rsync completes (~1.7 GiB/s in-region from a Cloud Storage bucket in the same region)
  • vLLM loads all three safetensors shards in ~2.3s
  • Pod transitions 0/2 CrashLoopBackOff2/2 Running and /health returns 200

No image rebuild was required; only the rendered command needed to change.

Tests

test_load_model_weights_sub_commands (GCS branch, both trust_remote_code=False and True) and test_load_model_files_sub_commands_trt_llm_gcs updated to assert the new config line is present in the rendered subcommands.

Test plan

  • CI green
  • Manual: redeploy a GCS-backed vLLM endpoint on GKE and verify the rsync completes and the pod reaches Ready

Greptile Summary

  • Appends gcloud config set storage/check_hashes if_fast_else_skip to the gcloud bootstrap chain in both GCS code paths (load_model_weights_sub_commands_gcs and the gs:// branch of load_model_files_sub_commands_trt_llm), fixing a CrashLoopBackOff caused by the missing google-crc32c library in inference containers.
  • The chosen value (if_fast_else_skip) preserves hash verification when the fast-hash library is present while degrading gracefully when it isn't, without requiring an image rebuild.
  • Unit tests are updated in lock-step to assert the new config line appears in rendered subcommands for all three affected test cases.

Confidence Score: 5/5

This PR is safe to merge — minimal, targeted fix with no logic changes beyond the gcloud config addition.

The change is a single-line addition to two parallel GCS bootstrap chains, both of which are well-understood shell command strings. The chosen if_fast_else_skip value is the documented safe degradation path. Tests are updated in lock-step. No security concerns, no custom rule violations, and the fix is end-to-end verified per the PR description.

No files require special attention.

Important Files Changed

Filename Overview
model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py Adds gcloud config set storage/check_hashes if_fast_else_skip to both GCS download bootstrap chains so rsync/cp proceeds without google-crc32c, fixing CrashLoopBackOff on GCS-backed pods.
model-engine/tests/unit/domain/test_llm_use_cases.py Test expectations updated for both test_load_model_weights_sub_commands (GCS branch, trust/no-trust variants) and test_load_model_files_sub_commands_trt_llm_gcs to assert the new config line is present in rendered subcommands.

Sequence Diagram

sequenceDiagram
    participant Pod as Inference Pod
    participant gcloud as gcloud CLI (on-the-fly install)
    participant GCS as Google Cloud Storage

    Pod->>gcloud: curl ... | tar -xz (install SDK)
    Pod->>gcloud: config set disable_usage_reporting true
    Pod->>gcloud: config set storage/check_hashes if_fast_else_skip
    Note over gcloud: hash check skipped if google-crc32c absent
    Pod->>gcloud: gcloud storage rsync / cp (model weights)
    gcloud->>GCS: download model files
    GCS-->>gcloud: file data (TCP integrity guaranteed)
    gcloud-->>Pod: exit 0 (no longer blocked on crc32c)
    Pod->>Pod: vLLM / TRT loader reads model_files/
Loading

Reviews (4): Last reviewed commit: "docs(gcs): tighten check_hashes rational..." | Re-trigger Greptile

…s without google-crc32c

`gcloud storage rsync` and `gcloud storage cp` default to `check_hashes=always`,
which requires the `google-crc32c` library to compute CRC32C hashes of downloaded
files. The inference container that runs the `load_model_weights_sub_commands_gcs`
and `load_model_files_sub_commands_trt_llm` (gs://) chains does not have that lib
installed, so every download task fails with:

    ERROR: Task '...safetensors' failed: This copy was skipped since fast hash
    calculation tools are not installed.

The download still proceeds (TCP-level integrity is intact), but gcloud refuses
to mark each file as completed and the rsync exits non-zero — the inference pod
then falls through into vLLM/TRT startup with an empty `model_files` directory.

Setting `storage/check_hashes=if_fast_else_skip` lets gcloud skip the optional
integrity check when the fast hash lib isn't available, while preserving it
where it is. End-to-end verified on a GKE-deployed gpt-oss-20b vLLM endpoint:
the rsync now completes (~1.7 GiB/s in-region), vLLM loads the safetensors
shards, and the pod transitions to Ready. Reproduced and fixed without
modifying the inference image.

Existing `test_load_model_weights_sub_commands` (GCS branch) and
`test_load_model_files_sub_commands_trt_llm_gcs` updated to expect the new
config line.
@postevanus-scale postevanus-scale enabled auto-merge (squash) May 5, 2026 16:54
@postevanus-scale postevanus-scale disabled auto-merge May 6, 2026 08:36
@postevanus-scale postevanus-scale merged commit 798747b into scaleapi:main May 6, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants