From 9d31a1093ba6f61ce0b416fd15ab01c31c143f51 Mon Sep 17 00:00:00 2001 From: Komh Date: Sat, 2 May 2026 01:51:54 +0000 Subject: [PATCH 1/3] [ai] Running vLLM on CPU-only nodes for experimental LLM serving --- ...only_nodes_for_experimental_LLM_serving.md | 227 ++++++++++++++++++ 1 file changed, 227 insertions(+) create mode 100644 docs/en/solutions/Running_vLLM_on_CPU_only_nodes_for_experimental_LLM_serving.md diff --git a/docs/en/solutions/Running_vLLM_on_CPU_only_nodes_for_experimental_LLM_serving.md b/docs/en/solutions/Running_vLLM_on_CPU_only_nodes_for_experimental_LLM_serving.md new file mode 100644 index 00000000..5cc689fd --- /dev/null +++ b/docs/en/solutions/Running_vLLM_on_CPU_only_nodes_for_experimental_LLM_serving.md @@ -0,0 +1,227 @@ +--- +kind: + - How To +products: + - Alauda Container Platform +ProductsVersion: + - 4.1.0,4.2.x +--- +## Overview + +vLLM is the de facto open-source inference engine for large language models, valued for its high throughput, low latency, and paged-attention memory layout. It is engineered around GPU acceleration, but the project upstream also publishes a CPU build that lets a developer prototype an OpenAI-compatible model endpoint without dedicated accelerator hardware. CPU mode is **not** a production target — token throughput is one to two orders of magnitude below a GPU run — but it is enough to wire up an end-to-end serving topology, exercise client tooling, and benchmark relative behaviour before committing to GPU-backed nodes. + +This note records a reproducible recipe for running vLLM on CPU worker nodes through standard Kubernetes primitives only: a Deployment, a PersistentVolumeClaim for the Hugging Face model cache, a Secret carrying the model-hub access token, a Service for in-cluster reachability, and an Ingress for external curl access. Production AI workloads on the platform should use the dedicated AI surface (KServe-based serving, GPU device plugins, the `hardware_accelerator` operators) — this article is intentionally narrow and intended for early evaluation work. + +## Resolution + +### Build a CPU-targeted vLLM image + +Upstream now publishes pre-built CPU images via `vllm-project/vllm` releases; pull one of those if it matches the target architecture. To build locally from source, clone the repository and use the CPU Dockerfile: + +```bash +git clone https://github.com/vllm-project/vllm +cd vllm +docker buildx build \ + --platform linux/amd64 \ + -t registry.example.com/lab/vllm-cpu:latest \ + -f docker/Dockerfile.cpu . +docker push registry.example.com/lab/vllm-cpu:latest +``` + +Push the image to whichever registry the cluster pulls from. If the target namespace requires a pull secret, attach it to the workload's ServiceAccount with `imagePullSecrets`. + +### Provide persistent storage for the Hugging Face cache + +vLLM downloads the model weights on first start. Without persistent storage, every pod restart re-downloads several gigabytes; with a PVC mapped to `~/.cache/huggingface`, the cache survives restarts. Any RWO storage class works for a single replica: + +```yaml +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: vllm-hf-cache + namespace: ai-lab +spec: + accessModes: + - ReadWriteOnce + resources: + requests: + storage: 50Gi +``` + +For node-local backing, an LVM-style local-storage CSI (the `Alauda Build of Local Storage` or `Alauda Build of TopoLVM` operator) gives the lowest-latency path — pin the PVC to a chosen worker via the relevant `volumeBindingMode: WaitForFirstConsumer` storage class. + +### Provide the Hugging Face access token + +Most Hugging Face models require accepting a license and authenticating via a personal access token. Hold the token in a Secret rather than in the Deployment spec: + +```bash +kubectl -n ai-lab create secret generic hf-token \ + --from-literal=HUGGING_FACE_HUB_TOKEN=hf_xxx_redacted_xxx +``` + +### Deploy vLLM + +The Deployment runs a single replica that mounts the PVC, sources the token from the Secret, and exposes the OpenAI-compatible HTTP API on port `8001`. CPU-only inference is memory-bound, so request enough RAM to hold the model weights plus the KV cache; for `Llama-3.2-1B-Instruct` allocate at least 8 GiB. Pin the pod to CPU worker nodes with a node selector if the cluster has heterogeneous capacity. + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: vllm-cpu + namespace: ai-lab +spec: + replicas: 1 + selector: + matchLabels: + app: vllm-cpu + template: + metadata: + labels: + app: vllm-cpu + spec: + containers: + - name: vllm + image: registry.example.com/lab/vllm-cpu:latest + args: + - serve + - meta-llama/Llama-3.2-1B-Instruct + - --port=8001 + ports: + - containerPort: 8001 + name: http + envFrom: + - secretRef: + name: hf-token + resources: + requests: + cpu: "4" + memory: 8Gi + limits: + cpu: "8" + memory: 16Gi + volumeMounts: + - name: hf-cache + mountPath: /root/.cache/huggingface + volumes: + - name: hf-cache + persistentVolumeClaim: + claimName: vllm-hf-cache +``` + +### Expose the endpoint + +Inside the cluster, a ClusterIP Service is enough: + +```yaml +apiVersion: v1 +kind: Service +metadata: + name: vllm-cpu + namespace: ai-lab +spec: + selector: + app: vllm-cpu + ports: + - name: http + port: 80 + targetPort: 8001 +``` + +For external reachability, use a standard Ingress (or the platform's ALB Operator if a richer L7 surface is needed). The OpenAI-compatible API does not require sticky sessions, so the default round-robin behaviour is fine: + +```yaml +apiVersion: networking.k8s.io/v1 +kind: Ingress +metadata: + name: vllm-cpu + namespace: ai-lab +spec: + rules: + - host: vllm.lab.example.com + http: + paths: + - path: / + pathType: Prefix + backend: + service: + name: vllm-cpu + port: + number: 80 +``` + +A standard Pod Security Standard `restricted` namespace is enough for the CPU build — there is no privileged-container requirement. If the target namespace enforces `baseline` or `restricted`, run vLLM as a non-root user inside the image; the default upstream Dockerfile already does this. + +### Sanity-check the endpoint + +Once the Deployment is `Available` and the model has finished downloading (watch `kubectl logs` for the `Application startup complete` line), drive a chat request through the Ingress: + +```bash +curl -X POST http://vllm.lab.example.com/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "meta-llama/Llama-3.2-1B-Instruct", + "messages": [ + {"role": "user", "content": "Hello! What is the capital of Massachusetts?"} + ] + }' +``` + +A successful response carries a `choices[0].message.content` field with a model-generated answer. + +### Benchmark the deployment + +For a relative read of throughput and latency, drive the endpoint with `guidellm`, the open-source benchmarking tool from the vLLM team: + +```bash +pip install guidellm +export HF_TOKEN=hf_xxx_redacted_xxx +guidellm benchmark \ + --target http://vllm.lab.example.com/v1 \ + --model meta-llama/Llama-3.2-1B-Instruct \ + --data "prompt_tokens=512,output_tokens=128" \ + --rate-type sweep \ + --max-seconds 240 +``` + +The sweep mode starts at one in-flight request and ramps to saturation, which exposes the server's behaviour under load. The standard reported indicators are: + +- **Requests per second (RPS)** — completed inference requests per second; whole-system throughput. +- **Time to first token (TTFT)** — wall time from request arrival to the first emitted token; chat-latency proxy. +- **Inter-token latency (ITL)** — time between successive tokens; streaming-quality proxy. +- **End-to-end latency** — total request duration; relevant for batch and offline calls. + +CPU-only numbers will sit far below GPU baselines — that is expected. Use the run as a control to validate the full pipeline (image, PVC, token plumbing, Service, Ingress) before re-running the same plan against GPU-backed nodes through the platform's KServe-based AI surface. + +## Diagnostic Steps + +If the pod loops in `CrashLoopBackOff`, the most common causes are out-of-memory (model weights exceed the container limit), Hugging Face authentication failure, or registry pull errors: + +```bash +kubectl -n ai-lab describe pod -l app=vllm-cpu +kubectl -n ai-lab logs deploy/vllm-cpu --tail=200 +``` + +Check the previous container instance for OOM evidence (`OOMKilled` reason in the pod status). Raise the memory request and limit in steps of 4 GiB until the pod stabilises. + +If model download stalls, confirm the Secret was wired in correctly: + +```bash +kubectl -n ai-lab exec deploy/vllm-cpu -- env | grep HUGGING_FACE +``` + +A missing or blank token causes `huggingface_hub` to fall back to anonymous access and download fails for gated models such as the Llama family. + +If the curl call returns 404 or times out, walk the chain bottom-up: + +```bash +kubectl -n ai-lab port-forward svc/vllm-cpu 8001:80 +curl -s http://127.0.0.1:8001/v1/models +``` + +A successful local response confirms the Service and pod are healthy; a failed external curl after that points at the Ingress controller, DNS, or ingress class configuration. + +To verify the cache PVC is doing its job, compare startup time of a fresh pod against a restarted pod — the second start should skip the multi-gigabyte download: + +```bash +kubectl -n ai-lab logs deploy/vllm-cpu --tail=20 | grep -E '(Loading model|Downloading|Application startup)' +``` From 182f545a9867e76775c41fcf53426fb4b29787c0 Mon Sep 17 00:00:00 2001 From: Komh Date: Sat, 2 May 2026 12:36:18 +0000 Subject: [PATCH 2/3] [ai] Running vLLM on CPU-only nodes for experimental LLM serving --- ...nning_vLLM_on_CPU_only_nodes_for_experimental_LLM_serving.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/en/solutions/Running_vLLM_on_CPU_only_nodes_for_experimental_LLM_serving.md b/docs/en/solutions/Running_vLLM_on_CPU_only_nodes_for_experimental_LLM_serving.md index 5cc689fd..9919e223 100644 --- a/docs/en/solutions/Running_vLLM_on_CPU_only_nodes_for_experimental_LLM_serving.md +++ b/docs/en/solutions/Running_vLLM_on_CPU_only_nodes_for_experimental_LLM_serving.md @@ -6,6 +6,8 @@ products: ProductsVersion: - 4.1.0,4.2.x --- + +# Running vLLM on CPU-only nodes for experimental LLM serving ## Overview vLLM is the de facto open-source inference engine for large language models, valued for its high throughput, low latency, and paged-attention memory layout. It is engineered around GPU acceleration, but the project upstream also publishes a CPU build that lets a developer prototype an OpenAI-compatible model endpoint without dedicated accelerator hardware. CPU mode is **not** a production target — token throughput is one to two orders of magnitude below a GPU run — but it is enough to wire up an end-to-end serving topology, exercise client tooling, and benchmark relative behaviour before committing to GPU-backed nodes. From 5556054e8a265e1d6113980c55b40ca55b1b390a Mon Sep 17 00:00:00 2001 From: Komh Date: Sun, 17 May 2026 03:28:18 +0000 Subject: [PATCH 3/3] [ai] Deploying a CPU-only vLLM inference workload on ACP --- ...CPU_only_vLLM_inference_workload_on_ACP.md | 192 +++++++++++++++ ...only_nodes_for_experimental_LLM_serving.md | 229 ------------------ 2 files changed, 192 insertions(+), 229 deletions(-) create mode 100644 docs/en/solutions/Deploying_a_CPU_only_vLLM_inference_workload_on_ACP.md delete mode 100644 docs/en/solutions/Running_vLLM_on_CPU_only_nodes_for_experimental_LLM_serving.md diff --git a/docs/en/solutions/Deploying_a_CPU_only_vLLM_inference_workload_on_ACP.md b/docs/en/solutions/Deploying_a_CPU_only_vLLM_inference_workload_on_ACP.md new file mode 100644 index 00000000..e331b9e4 --- /dev/null +++ b/docs/en/solutions/Deploying_a_CPU_only_vLLM_inference_workload_on_ACP.md @@ -0,0 +1,192 @@ +--- +kind: + - How To +products: + - Alauda Container Platform +ProductsVersion: + - 4.1.0,4.2.x +--- + +# Deploying a CPU-only vLLM inference workload on ACP + +## Issue + +A CPU-only vLLM container can be brought up on Alauda Container Platform using +only generic Kubernetes primitives. The workload is a user-built `vllm` image +fronted by a `Deployment`, a `PersistentVolumeClaim` mounted at the model cache +directory inside the container, a `Service` exposing the inference port, and +external access via an `Ingress`; no vLLM-specific operator or custom resource +is required on the cluster. This pattern is useful for experimentation +and benchmarking on clusters that lack accelerator hardware, and is explicitly +positioned as non-production: CPU inference is experimental, and the supported +production path on ACP uses GPU-accelerated serving via the Alauda AI suite +rather than a hand-rolled `Deployment`. + +## Root Cause + +Model files are downloaded by vLLM at runtime from an upstream model registry, +which makes the model cache directory inside the container the dominant cost +on pod restart. Without a persistent volume mounted at that path, the model is +re-downloaded every time the pod restarts, which is slow and wastes bandwidth. +The standard mitigation is to back the cache directory with a +`PersistentVolumeClaim` so that downloaded artifacts survive pod restarts. + +## Resolution + +Run the vLLM image as a regular `Deployment` in any namespace, mount a PVC at +the cache directory, expose port `8001` with a `Service`, and front the +`Service` with a standard `networking.k8s.io/v1` `Ingress` for external access. The default in-cluster ingress controller on ACP is ALB; the same +manifest shape works against any conformant Ingress controller. + +Provision the PVC from a `StorageClass` that suits the cluster's storage +posture. On ACP, node-local persistent storage is provided by the +`local-storage-operator` package (catalog channel `stable`, currentCSV +`local-storage-operator.v4.3.1`); LVM-backed local volumes are typically +fronted by a TopoLVM-provisioned default `StorageClass`, and for +shared-access scenarios any default `StorageClass` is acceptable. + +```yaml +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: vllm-cache +spec: + accessModes: + - ReadWriteOnce + resources: + requests: + storage: 50Gi +``` + +Inject the model-registry access token into the container by referencing a +`Secret` through an environment variable named `HUGGING_FACE_HUB_TOKEN`. The +binding uses the standard `valueFrom.secretKeyRef` form; no platform-specific +shape is involved. + +```yaml +apiVersion: v1 +kind: Secret +metadata: + name: hf-token +type: Opaque +stringData: + token: +--- +apiVersion: apps/v1 +kind: Deployment +metadata: + name: vllm-cpu +spec: + replicas: 1 + selector: + matchLabels: + app: vllm-cpu + template: + metadata: + labels: + app: vllm-cpu + spec: + containers: + - name: vllm + image: /vllm-cpu: + ports: + - containerPort: 8001 + env: + - name: HUGGING_FACE_HUB_TOKEN + valueFrom: + secretKeyRef: + name: hf-token + key: token + volumeMounts: + - name: cache + mountPath: /root/.cache/huggingface + volumes: + - name: cache + persistentVolumeClaim: + claimName: vllm-cache +--- +apiVersion: v1 +kind: Service +metadata: + name: vllm-cpu +spec: + selector: + app: vllm-cpu + ports: + - port: 8001 + targetPort: 8001 +--- +apiVersion: networking.k8s.io/v1 +kind: Ingress +metadata: + name: vllm-cpu +spec: + rules: + - host: + http: + paths: + - path: / + pathType: Prefix + backend: + service: + name: vllm-cpu + port: + number: 8001 +``` + +If the pod requires elevated capabilities, label the namespace to relax Pod +Security Admission enforcement where needed: + +```bash +kubectl label ns \ + pod-security.kubernetes.io/enforce=privileged \ + pod-security.kubernetes.io/warn=privileged \ + pod-security.kubernetes.io/audit=privileged \ + --overwrite +``` + +For a supported production deployment of vLLM on ACP, install the +`aml-operator` package (catalog channel `alpha`, currentCSV +`aml-operator.v1.4.0`, install mode `AllNamespaces`) which provisions the +Alauda AI suite — including `KServe` and an `InferenceService` / +`ServingRuntime` API on top of `kserveless-operator` — and serves vLLM +through a managed `InferenceService` resource rather than a hand-written +`Deployment`. + +## Diagnostic Steps + +Confirm the pod is admitted and running, and that the cache volume is mounted +at the expected path so that re-downloads do not occur across restarts: + +```bash +kubectl get pod -l app=vllm-cpu +kubectl describe pod -l app=vllm-cpu +kubectl get pvc vllm-cache +``` + +Confirm the access token is reaching the container via the expected env-var +shape (the value itself is sensitive — only check that the variable is +present): + +```bash +kubectl exec deploy/vllm-cpu -- printenv | grep -c '^HUGGING_FACE_HUB_TOKEN=' +``` + +Confirm external reachability through the configured `Ingress` host or +NodePort: + +```bash +kubectl get ingress vllm-cpu +kubectl get svc vllm-cpu +``` + +If a production-grade managed path is required instead of this experimental +CPU shape, inspect the Alauda AI install and the available `ServingRuntime` +entries on the cluster: + +```bash +kubectl get packagemanifest aml-operator -n cpaas-system \ + -o jsonpath='{.status.channels[?(@.name=="alpha")].currentCSV}' +kubectl get servingruntime -A +kubectl get clusterservingruntime +``` diff --git a/docs/en/solutions/Running_vLLM_on_CPU_only_nodes_for_experimental_LLM_serving.md b/docs/en/solutions/Running_vLLM_on_CPU_only_nodes_for_experimental_LLM_serving.md deleted file mode 100644 index 9919e223..00000000 --- a/docs/en/solutions/Running_vLLM_on_CPU_only_nodes_for_experimental_LLM_serving.md +++ /dev/null @@ -1,229 +0,0 @@ ---- -kind: - - How To -products: - - Alauda Container Platform -ProductsVersion: - - 4.1.0,4.2.x ---- - -# Running vLLM on CPU-only nodes for experimental LLM serving -## Overview - -vLLM is the de facto open-source inference engine for large language models, valued for its high throughput, low latency, and paged-attention memory layout. It is engineered around GPU acceleration, but the project upstream also publishes a CPU build that lets a developer prototype an OpenAI-compatible model endpoint without dedicated accelerator hardware. CPU mode is **not** a production target — token throughput is one to two orders of magnitude below a GPU run — but it is enough to wire up an end-to-end serving topology, exercise client tooling, and benchmark relative behaviour before committing to GPU-backed nodes. - -This note records a reproducible recipe for running vLLM on CPU worker nodes through standard Kubernetes primitives only: a Deployment, a PersistentVolumeClaim for the Hugging Face model cache, a Secret carrying the model-hub access token, a Service for in-cluster reachability, and an Ingress for external curl access. Production AI workloads on the platform should use the dedicated AI surface (KServe-based serving, GPU device plugins, the `hardware_accelerator` operators) — this article is intentionally narrow and intended for early evaluation work. - -## Resolution - -### Build a CPU-targeted vLLM image - -Upstream now publishes pre-built CPU images via `vllm-project/vllm` releases; pull one of those if it matches the target architecture. To build locally from source, clone the repository and use the CPU Dockerfile: - -```bash -git clone https://github.com/vllm-project/vllm -cd vllm -docker buildx build \ - --platform linux/amd64 \ - -t registry.example.com/lab/vllm-cpu:latest \ - -f docker/Dockerfile.cpu . -docker push registry.example.com/lab/vllm-cpu:latest -``` - -Push the image to whichever registry the cluster pulls from. If the target namespace requires a pull secret, attach it to the workload's ServiceAccount with `imagePullSecrets`. - -### Provide persistent storage for the Hugging Face cache - -vLLM downloads the model weights on first start. Without persistent storage, every pod restart re-downloads several gigabytes; with a PVC mapped to `~/.cache/huggingface`, the cache survives restarts. Any RWO storage class works for a single replica: - -```yaml -apiVersion: v1 -kind: PersistentVolumeClaim -metadata: - name: vllm-hf-cache - namespace: ai-lab -spec: - accessModes: - - ReadWriteOnce - resources: - requests: - storage: 50Gi -``` - -For node-local backing, an LVM-style local-storage CSI (the `Alauda Build of Local Storage` or `Alauda Build of TopoLVM` operator) gives the lowest-latency path — pin the PVC to a chosen worker via the relevant `volumeBindingMode: WaitForFirstConsumer` storage class. - -### Provide the Hugging Face access token - -Most Hugging Face models require accepting a license and authenticating via a personal access token. Hold the token in a Secret rather than in the Deployment spec: - -```bash -kubectl -n ai-lab create secret generic hf-token \ - --from-literal=HUGGING_FACE_HUB_TOKEN=hf_xxx_redacted_xxx -``` - -### Deploy vLLM - -The Deployment runs a single replica that mounts the PVC, sources the token from the Secret, and exposes the OpenAI-compatible HTTP API on port `8001`. CPU-only inference is memory-bound, so request enough RAM to hold the model weights plus the KV cache; for `Llama-3.2-1B-Instruct` allocate at least 8 GiB. Pin the pod to CPU worker nodes with a node selector if the cluster has heterogeneous capacity. - -```yaml -apiVersion: apps/v1 -kind: Deployment -metadata: - name: vllm-cpu - namespace: ai-lab -spec: - replicas: 1 - selector: - matchLabels: - app: vllm-cpu - template: - metadata: - labels: - app: vllm-cpu - spec: - containers: - - name: vllm - image: registry.example.com/lab/vllm-cpu:latest - args: - - serve - - meta-llama/Llama-3.2-1B-Instruct - - --port=8001 - ports: - - containerPort: 8001 - name: http - envFrom: - - secretRef: - name: hf-token - resources: - requests: - cpu: "4" - memory: 8Gi - limits: - cpu: "8" - memory: 16Gi - volumeMounts: - - name: hf-cache - mountPath: /root/.cache/huggingface - volumes: - - name: hf-cache - persistentVolumeClaim: - claimName: vllm-hf-cache -``` - -### Expose the endpoint - -Inside the cluster, a ClusterIP Service is enough: - -```yaml -apiVersion: v1 -kind: Service -metadata: - name: vllm-cpu - namespace: ai-lab -spec: - selector: - app: vllm-cpu - ports: - - name: http - port: 80 - targetPort: 8001 -``` - -For external reachability, use a standard Ingress (or the platform's ALB Operator if a richer L7 surface is needed). The OpenAI-compatible API does not require sticky sessions, so the default round-robin behaviour is fine: - -```yaml -apiVersion: networking.k8s.io/v1 -kind: Ingress -metadata: - name: vllm-cpu - namespace: ai-lab -spec: - rules: - - host: vllm.lab.example.com - http: - paths: - - path: / - pathType: Prefix - backend: - service: - name: vllm-cpu - port: - number: 80 -``` - -A standard Pod Security Standard `restricted` namespace is enough for the CPU build — there is no privileged-container requirement. If the target namespace enforces `baseline` or `restricted`, run vLLM as a non-root user inside the image; the default upstream Dockerfile already does this. - -### Sanity-check the endpoint - -Once the Deployment is `Available` and the model has finished downloading (watch `kubectl logs` for the `Application startup complete` line), drive a chat request through the Ingress: - -```bash -curl -X POST http://vllm.lab.example.com/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "meta-llama/Llama-3.2-1B-Instruct", - "messages": [ - {"role": "user", "content": "Hello! What is the capital of Massachusetts?"} - ] - }' -``` - -A successful response carries a `choices[0].message.content` field with a model-generated answer. - -### Benchmark the deployment - -For a relative read of throughput and latency, drive the endpoint with `guidellm`, the open-source benchmarking tool from the vLLM team: - -```bash -pip install guidellm -export HF_TOKEN=hf_xxx_redacted_xxx -guidellm benchmark \ - --target http://vllm.lab.example.com/v1 \ - --model meta-llama/Llama-3.2-1B-Instruct \ - --data "prompt_tokens=512,output_tokens=128" \ - --rate-type sweep \ - --max-seconds 240 -``` - -The sweep mode starts at one in-flight request and ramps to saturation, which exposes the server's behaviour under load. The standard reported indicators are: - -- **Requests per second (RPS)** — completed inference requests per second; whole-system throughput. -- **Time to first token (TTFT)** — wall time from request arrival to the first emitted token; chat-latency proxy. -- **Inter-token latency (ITL)** — time between successive tokens; streaming-quality proxy. -- **End-to-end latency** — total request duration; relevant for batch and offline calls. - -CPU-only numbers will sit far below GPU baselines — that is expected. Use the run as a control to validate the full pipeline (image, PVC, token plumbing, Service, Ingress) before re-running the same plan against GPU-backed nodes through the platform's KServe-based AI surface. - -## Diagnostic Steps - -If the pod loops in `CrashLoopBackOff`, the most common causes are out-of-memory (model weights exceed the container limit), Hugging Face authentication failure, or registry pull errors: - -```bash -kubectl -n ai-lab describe pod -l app=vllm-cpu -kubectl -n ai-lab logs deploy/vllm-cpu --tail=200 -``` - -Check the previous container instance for OOM evidence (`OOMKilled` reason in the pod status). Raise the memory request and limit in steps of 4 GiB until the pod stabilises. - -If model download stalls, confirm the Secret was wired in correctly: - -```bash -kubectl -n ai-lab exec deploy/vllm-cpu -- env | grep HUGGING_FACE -``` - -A missing or blank token causes `huggingface_hub` to fall back to anonymous access and download fails for gated models such as the Llama family. - -If the curl call returns 404 or times out, walk the chain bottom-up: - -```bash -kubectl -n ai-lab port-forward svc/vllm-cpu 8001:80 -curl -s http://127.0.0.1:8001/v1/models -``` - -A successful local response confirms the Service and pod are healthy; a failed external curl after that points at the Ingress controller, DNS, or ingress class configuration. - -To verify the cache PVC is doing its job, compare startup time of a fresh pod against a restarted pod — the second start should skip the multi-gigabyte download: - -```bash -kubectl -n ai-lab logs deploy/vllm-cpu --tail=20 | grep -E '(Loading model|Downloading|Application startup)' -```