fix(gcs): set storage/check_hashes=if_fast_else_skip so rsync/cp works without google-crc32c by postevanus-scale · Pull Request #820 · scaleapi/llm-engine

postevanus-scale · 2026-05-05T15:25:00Z

Summary

gcloud storage rsync (and cp) defaults to check_hashes=always, which requires the google-crc32c Python library to verify each downloaded file. That library is not installed in the inference containers used by the gs:// code paths in CreateLLMModelBundleV1UseCase, so every download task fails with:

ERROR: Task 'file://model_files/model-00002-of-00002.safetensors' failed:
  This copy was skipped since fast hash calculation tools are not installed.
  You can change this by running:
        $ /usr/bin/python3.12 -m pip install google-crc32c --upgrade --target /opt/google-cloud-sdk/lib/third_party
  You can also modify the "storage/check_hashes" config setting.

The download itself succeeds (TCP integrity is intact), but gcloud refuses to mark each file as complete and the whole rsync exits non-zero. The inference pod then falls through to vLLM/TRT startup with an empty model_files/ and crashes.

This adds one line to the existing gcloud bootstrap chain in both GCS code paths so the optional integrity check is skipped when the fast hash lib isn't available, while preserving it where it is.

Why `if_fast_else_skip`?

gcloud storage exposes four values for storage/check_hashes:

Value	Behavior
`always` (default)	Mandatory; fails if `google-crc32c` is missing
`if_fast_else_skip`	Verify when the fast hash lib is available; silently skip when it isn't
`if_fast_else_fail`	Same, but logs a warning
`never`	Never verify

if_fast_else_skip is the safe middle ground — it preserves verification where the lib happens to be present (e.g. richer images) and degrades gracefully where it isn't. Adding google-crc32c to the inference image was considered but is a much larger change for marginal benefit; GCS already validates byte-stream integrity at the TCP layer, and the safetensors / TRT loaders verify file structure on read.

Affected functions

CreateLLMModelBundleV1UseCase.load_model_weights_sub_commands_gcs — used by the standard gs:// weight-download path (S3/Azure variants are unaffected)
CreateLLMModelBundleV1UseCase.load_model_files_sub_commands_trt_llm (gs:// branch) — used by the TensorRT-LLM checkpoint download path

Verification

End-to-end verified on a GKE-deployed gpt-oss-20b vLLM endpoint that was stuck in CrashLoopBackOff with the symptom above. After applying the equivalent live patch to the rendered Deployment command:

gcloud storage rsync completes (~1.7 GiB/s in-region from a Cloud Storage bucket in the same region)
vLLM loads all three safetensors shards in ~2.3s
Pod transitions 0/2 CrashLoopBackOff → 2/2 Running and /health returns 200

No image rebuild was required; only the rendered command needed to change.

Tests

test_load_model_weights_sub_commands (GCS branch, both trust_remote_code=False and True) and test_load_model_files_sub_commands_trt_llm_gcs updated to assert the new config line is present in the rendered subcommands.

Test plan

CI green
Manual: redeploy a GCS-backed vLLM endpoint on GKE and verify the rsync completes and the pod reaches Ready

Greptile Summary

Appends gcloud config set storage/check_hashes if_fast_else_skip to the gcloud bootstrap chain in both GCS code paths (load_model_weights_sub_commands_gcs and the gs:// branch of load_model_files_sub_commands_trt_llm), fixing a CrashLoopBackOff caused by the missing google-crc32c library in inference containers.
The chosen value (if_fast_else_skip) preserves hash verification when the fast-hash library is present while degrading gracefully when it isn't, without requiring an image rebuild.
Unit tests are updated in lock-step to assert the new config line appears in rendered subcommands for all three affected test cases.

Confidence Score: 5/5

This PR is safe to merge — minimal, targeted fix with no logic changes beyond the gcloud config addition.

The change is a single-line addition to two parallel GCS bootstrap chains, both of which are well-understood shell command strings. The chosen if_fast_else_skip value is the documented safe degradation path. Tests are updated in lock-step. No security concerns, no custom rule violations, and the fix is end-to-end verified per the PR description.

No files require special attention.

Important Files Changed

Filename	Overview
model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py	Adds `gcloud config set storage/check_hashes if_fast_else_skip` to both GCS download bootstrap chains so rsync/cp proceeds without `google-crc32c`, fixing CrashLoopBackOff on GCS-backed pods.
model-engine/tests/unit/domain/test_llm_use_cases.py	Test expectations updated for both `test_load_model_weights_sub_commands` (GCS branch, trust/no-trust variants) and `test_load_model_files_sub_commands_trt_llm_gcs` to assert the new config line is present in rendered subcommands.

Sequence Diagram

sequenceDiagram
    participant Pod as Inference Pod
    participant gcloud as gcloud CLI (on-the-fly install)
    participant GCS as Google Cloud Storage

    Pod->>gcloud: curl ... | tar -xz (install SDK)
    Pod->>gcloud: config set disable_usage_reporting true
    Pod->>gcloud: config set storage/check_hashes if_fast_else_skip
    Note over gcloud: hash check skipped if google-crc32c absent
    Pod->>gcloud: gcloud storage rsync / cp (model weights)
    gcloud->>GCS: download model files
    GCS-->>gcloud: file data (TCP integrity guaranteed)
    gcloud-->>Pod: exit 0 (no longer blocked on crc32c)
    Pod->>Pod: vLLM / TRT loader reads model_files/

_{Reviews (4): Last reviewed commit: "docs(gcs): tighten check_hashes rational..." | Re-trigger Greptile}

…s without google-crc32c `gcloud storage rsync` and `gcloud storage cp` default to `check_hashes=always`, which requires the `google-crc32c` library to compute CRC32C hashes of downloaded files. The inference container that runs the `load_model_weights_sub_commands_gcs` and `load_model_files_sub_commands_trt_llm` (gs://) chains does not have that lib installed, so every download task fails with: ERROR: Task '...safetensors' failed: This copy was skipped since fast hash calculation tools are not installed. The download still proceeds (TCP-level integrity is intact), but gcloud refuses to mark each file as completed and the rsync exits non-zero — the inference pod then falls through into vLLM/TRT startup with an empty `model_files` directory. Setting `storage/check_hashes=if_fast_else_skip` lets gcloud skip the optional integrity check when the fast hash lib isn't available, while preserving it where it is. End-to-end verified on a GKE-deployed gpt-oss-20b vLLM endpoint: the rsync now completes (~1.7 GiB/s in-region), vLLM loads the safetensors shards, and the pod transitions to Ready. Reproduced and fixed without modifying the inference image. Existing `test_load_model_weights_sub_commands` (GCS branch) and `test_load_model_files_sub_commands_trt_llm_gcs` updated to expect the new config line.

postevanus-scale requested review from arniechops and lilyz-ai May 5, 2026 15:32

postevanus-scale enabled auto-merge (squash) May 5, 2026 16:54

lilyz-ai approved these changes May 5, 2026

View reviewed changes

postevanus-scale added 2 commits May 6, 2026 09:22

Merge branch 'main' into fix/gcs-check-hashes

43535d8

docs(gcs): tighten check_hashes rationale comments

77bdf32

postevanus-scale force-pushed the fix/gcs-check-hashes branch from dc8467f to 77bdf32 Compare May 6, 2026 08:33

postevanus-scale disabled auto-merge May 6, 2026 08:36

postevanus-scale merged commit 798747b into scaleapi:main May 6, 2026
8 checks passed

postevanus-scale mentioned this pull request May 6, 2026

Fix/gcs check hashes #821

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gcs): set storage/check_hashes=if_fast_else_skip so rsync/cp works without google-crc32c#820

fix(gcs): set storage/check_hashes=if_fast_else_skip so rsync/cp works without google-crc32c#820
postevanus-scale merged 3 commits intoscaleapi:mainfrom
postevanus-scale:fix/gcs-check-hashes

postevanus-scale commented May 5, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

postevanus-scale commented May 5, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why if_fast_else_skip?

Affected functions

Verification

Tests

Test plan

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

postevanus-scale commented May 5, 2026 •

edited by greptile-apps Bot

Loading

Why `if_fast_else_skip`?