From 9d31a1093ba6f61ce0b416fd15ab01c31c143f51 Mon Sep 17 00:00:00 2001
From: Komh <mail@guojing.io>
Date: Sat, 2 May 2026 01:51:54 +0000
Subject: [PATCH 1/3] [ai] Running vLLM on CPU-only nodes for experimental LLM
 serving

---
 ...only_nodes_for_experimental_LLM_serving.md | 227 ++++++++++++++++++
 1 file changed, 227 insertions(+)
 create mode 100644 docs/en/solutions/Running_vLLM_on_CPU_only_nodes_for_experimental_LLM_serving.md

diff --git a/docs/en/solutions/Running_vLLM_on_CPU_only_nodes_for_experimental_LLM_serving.md b/docs/en/solutions/Running_vLLM_on_CPU_only_nodes_for_experimental_LLM_serving.md
new file mode 100644
index 00000000..5cc689fd
--- /dev/null
+++ b/docs/en/solutions/Running_vLLM_on_CPU_only_nodes_for_experimental_LLM_serving.md
@@ -0,0 +1,227 @@
+---
+kind:
+   - How To
+products:
+   - Alauda Container Platform
+ProductsVersion:
+   - 4.1.0,4.2.x
+---
+## Overview
+
+vLLM is the de facto open-source inference engine for large language models, valued for its high throughput, low latency, and paged-attention memory layout. It is engineered around GPU acceleration, but the project upstream also publishes a CPU build that lets a developer prototype an OpenAI-compatible model endpoint without dedicated accelerator hardware. CPU mode is **not** a production target — token throughput is one to two orders of magnitude below a GPU run — but it is enough to wire up an end-to-end serving topology, exercise client tooling, and benchmark relative behaviour before committing to GPU-backed nodes.
+
+This note records a reproducible recipe for running vLLM on CPU worker nodes through standard Kubernetes primitives only: a Deployment, a PersistentVolumeClaim for the Hugging Face model cache, a Secret carrying the model-hub access token, a Service for in-cluster reachability, and an Ingress for external curl access. Production AI workloads on the platform should use the dedicated AI surface (KServe-based serving, GPU device plugins, the `hardware_accelerator` operators) — this article is intentionally narrow and intended for early evaluation work.
+
+## Resolution
+
+### Build a CPU-targeted vLLM image
+
+Upstream now publishes pre-built CPU images via `vllm-project/vllm` releases; pull one of those if it matches the target architecture. To build locally from source, clone the repository and use the CPU Dockerfile:
+
+```bash
+git clone https://github.com/vllm-project/vllm
+cd vllm
+docker buildx build \
+  --platform linux/amd64 \
+  -t registry.example.com/lab/vllm-cpu:latest \
+  -f docker/Dockerfile.cpu .
+docker push registry.example.com/lab/vllm-cpu:latest
+```
+
+Push the image to whichever registry the cluster pulls from. If the target namespace requires a pull secret, attach it to the workload's ServiceAccount with `imagePullSecrets`.
+
+### Provide persistent storage for the Hugging Face cache
+
+vLLM downloads the model weights on first start. Without persistent storage, every pod restart re-downloads several gigabytes; with a PVC mapped to `~/.cache/huggingface`, the cache survives restarts. Any RWO storage class works for a single replica:
+
+```yaml
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: vllm-hf-cache
+  namespace: ai-lab
+spec:
+  accessModes:
+    - ReadWriteOnce
+  resources:
+    requests:
+      storage: 50Gi
+```
+
+For node-local backing, an LVM-style local-storage CSI (the `Alauda Build of Local Storage` or `Alauda Build of TopoLVM` operator) gives the lowest-latency path — pin the PVC to a chosen worker via the relevant `volumeBindingMode: WaitForFirstConsumer` storage class.
+
+### Provide the Hugging Face access token
+
+Most Hugging Face models require accepting a license and authenticating via a personal access token. Hold the token in a Secret rather than in the Deployment spec:
+
+```bash
+kubectl -n ai-lab create secret generic hf-token \
+  --from-literal=HUGGING_FACE_HUB_TOKEN=hf_xxx_redacted_xxx
+```
+
+### Deploy vLLM
+
+The Deployment runs a single replica that mounts the PVC, sources the token from the Secret, and exposes the OpenAI-compatible HTTP API on port `8001`. CPU-only inference is memory-bound, so request enough RAM to hold the model weights plus the KV cache; for `Llama-3.2-1B-Instruct` allocate at least 8 GiB. Pin the pod to CPU worker nodes with a node selector if the cluster has heterogeneous capacity.
+
+```yaml
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: vllm-cpu
+  namespace: ai-lab
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: vllm-cpu
+  template:
+    metadata:
+      labels:
+        app: vllm-cpu
+    spec:
+      containers:
+        - name: vllm
+          image: registry.example.com/lab/vllm-cpu:latest
+          args:
+            - serve
+            - meta-llama/Llama-3.2-1B-Instruct
+            - --port=8001
+          ports:
+            - containerPort: 8001
+              name: http
+          envFrom:
+            - secretRef:
+                name: hf-token
+          resources:
+            requests:
+              cpu: "4"
+              memory: 8Gi
+            limits:
+              cpu: "8"
+              memory: 16Gi
+          volumeMounts:
+            - name: hf-cache
+              mountPath: /root/.cache/huggingface
+      volumes:
+        - name: hf-cache
+          persistentVolumeClaim:
+            claimName: vllm-hf-cache
+```
+
+### Expose the endpoint
+
+Inside the cluster, a ClusterIP Service is enough:
+
+```yaml
+apiVersion: v1
+kind: Service
+metadata:
+  name: vllm-cpu
+  namespace: ai-lab
+spec:
+  selector:
+    app: vllm-cpu
+  ports:
+    - name: http
+      port: 80
+      targetPort: 8001
+```
+
+For external reachability, use a standard Ingress (or the platform's ALB Operator if a richer L7 surface is needed). The OpenAI-compatible API does not require sticky sessions, so the default round-robin behaviour is fine:
+
+```yaml
+apiVersion: networking.k8s.io/v1
+kind: Ingress
+metadata:
+  name: vllm-cpu
+  namespace: ai-lab
+spec:
+  rules:
+    - host: vllm.lab.example.com
+      http:
+        paths:
+          - path: /
+            pathType: Prefix
+            backend:
+              service:
+                name: vllm-cpu
+                port:
+                  number: 80
+```
+
+A standard Pod Security Standard `restricted` namespace is enough for the CPU build — there is no privileged-container requirement. If the target namespace enforces `baseline` or `restricted`, run vLLM as a non-root user inside the image; the default upstream Dockerfile already does this.
+
+### Sanity-check the endpoint
+
+Once the Deployment is `Available` and the model has finished downloading (watch `kubectl logs` for the `Application startup complete` line), drive a chat request through the Ingress:
+
+```bash
+curl -X POST http://vllm.lab.example.com/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "meta-llama/Llama-3.2-1B-Instruct",
+    "messages": [
+      {"role": "user", "content": "Hello! What is the capital of Massachusetts?"}
+    ]
+  }'
+```
+
+A successful response carries a `choices[0].message.content` field with a model-generated answer.
+
+### Benchmark the deployment
+
+For a relative read of throughput and latency, drive the endpoint with `guidellm`, the open-source benchmarking tool from the vLLM team:
+
+```bash
+pip install guidellm
+export HF_TOKEN=hf_xxx_redacted_xxx
+guidellm benchmark \
+  --target http://vllm.lab.example.com/v1 \
+  --model meta-llama/Llama-3.2-1B-Instruct \
+  --data "prompt_tokens=512,output_tokens=128" \
+  --rate-type sweep \
+  --max-seconds 240
+```
+
+The sweep mode starts at one in-flight request and ramps to saturation, which exposes the server's behaviour under load. The standard reported indicators are:
+
+- **Requests per second (RPS)** — completed inference requests per second; whole-system throughput.
+- **Time to first token (TTFT)** — wall time from request arrival to the first emitted token; chat-latency proxy.
+- **Inter-token latency (ITL)** — time between successive tokens; streaming-quality proxy.
+- **End-to-end latency** — total request duration; relevant for batch and offline calls.
+
+CPU-only numbers will sit far below GPU baselines — that is expected. Use the run as a control to validate the full pipeline (image, PVC, token plumbing, Service, Ingress) before re-running the same plan against GPU-backed nodes through the platform's KServe-based AI surface.
+
+## Diagnostic Steps
+
+If the pod loops in `CrashLoopBackOff`, the most common causes are out-of-memory (model weights exceed the container limit), Hugging Face authentication failure, or registry pull errors:
+
+```bash
+kubectl -n ai-lab describe pod -l app=vllm-cpu
+kubectl -n ai-lab logs deploy/vllm-cpu --tail=200
+```
+
+Check the previous container instance for OOM evidence (`OOMKilled` reason in the pod status). Raise the memory request and limit in steps of 4 GiB until the pod stabilises.
+
+If model download stalls, confirm the Secret was wired in correctly:
+
+```bash
+kubectl -n ai-lab exec deploy/vllm-cpu -- env | grep HUGGING_FACE
+```
+
+A missing or blank token causes `huggingface_hub` to fall back to anonymous access and download fails for gated models such as the Llama family.
+
+If the curl call returns 404 or times out, walk the chain bottom-up:
+
+```bash
+kubectl -n ai-lab port-forward svc/vllm-cpu 8001:80
+curl -s http://127.0.0.1:8001/v1/models
+```
+
+A successful local response confirms the Service and pod are healthy; a failed external curl after that points at the Ingress controller, DNS, or ingress class configuration.
+
+To verify the cache PVC is doing its job, compare startup time of a fresh pod against a restarted pod — the second start should skip the multi-gigabyte download:
+
+```bash
+kubectl -n ai-lab logs deploy/vllm-cpu --tail=20 | grep -E '(Loading model|Downloading|Application startup)'
+```

From 182f545a9867e76775c41fcf53426fb4b29787c0 Mon Sep 17 00:00:00 2001
From: Komh <mail@guojing.io>
Date: Sat, 2 May 2026 12:36:18 +0000
Subject: [PATCH 2/3] [ai] Running vLLM on CPU-only nodes for experimental LLM
 serving

---
 ...nning_vLLM_on_CPU_only_nodes_for_experimental_LLM_serving.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/docs/en/solutions/Running_vLLM_on_CPU_only_nodes_for_experimental_LLM_serving.md b/docs/en/solutions/Running_vLLM_on_CPU_only_nodes_for_experimental_LLM_serving.md
index 5cc689fd..9919e223 100644
--- a/docs/en/solutions/Running_vLLM_on_CPU_only_nodes_for_experimental_LLM_serving.md
+++ b/docs/en/solutions/Running_vLLM_on_CPU_only_nodes_for_experimental_LLM_serving.md
@@ -6,6 +6,8 @@ products:
 ProductsVersion:
    - 4.1.0,4.2.x
 ---
+
+# Running vLLM on CPU-only nodes for experimental LLM serving
 ## Overview
 
 vLLM is the de facto open-source inference engine for large language models, valued for its high throughput, low latency, and paged-attention memory layout. It is engineered around GPU acceleration, but the project upstream also publishes a CPU build that lets a developer prototype an OpenAI-compatible model endpoint without dedicated accelerator hardware. CPU mode is **not** a production target — token throughput is one to two orders of magnitude below a GPU run — but it is enough to wire up an end-to-end serving topology, exercise client tooling, and benchmark relative behaviour before committing to GPU-backed nodes.

From 5556054e8a265e1d6113980c55b40ca55b1b390a Mon Sep 17 00:00:00 2001
From: Komh <mail@guojing.io>
Date: Sun, 17 May 2026 03:28:18 +0000
Subject: [PATCH 3/3] [ai] Deploying a CPU-only vLLM inference workload on ACP

---
 ...CPU_only_vLLM_inference_workload_on_ACP.md | 192 +++++++++++++++
 ...only_nodes_for_experimental_LLM_serving.md | 229 ------------------
 2 files changed, 192 insertions(+), 229 deletions(-)
 create mode 100644 docs/en/solutions/Deploying_a_CPU_only_vLLM_inference_workload_on_ACP.md
 delete mode 100644 docs/en/solutions/Running_vLLM_on_CPU_only_nodes_for_experimental_LLM_serving.md

diff --git a/docs/en/solutions/Deploying_a_CPU_only_vLLM_inference_workload_on_ACP.md b/docs/en/solutions/Deploying_a_CPU_only_vLLM_inference_workload_on_ACP.md
new file mode 100644
index 00000000..e331b9e4
--- /dev/null
+++ b/docs/en/solutions/Deploying_a_CPU_only_vLLM_inference_workload_on_ACP.md
@@ -0,0 +1,192 @@
+---
+kind:
+   - How To
+products:
+   - Alauda Container Platform
+ProductsVersion:
+   - 4.1.0,4.2.x
+---
+
+# Deploying a CPU-only vLLM inference workload on ACP
+
+## Issue
+
+A CPU-only vLLM container can be brought up on Alauda Container Platform using
+only generic Kubernetes primitives. The workload is a user-built `vllm` image
+fronted by a `Deployment`, a `PersistentVolumeClaim` mounted at the model cache
+directory inside the container, a `Service` exposing the inference port, and
+external access via an `Ingress`; no vLLM-specific operator or custom resource
+is required on the cluster. This pattern is useful for experimentation
+and benchmarking on clusters that lack accelerator hardware, and is explicitly
+positioned as non-production: CPU inference is experimental, and the supported
+production path on ACP uses GPU-accelerated serving via the Alauda AI suite
+rather than a hand-rolled `Deployment`.
+
+## Root Cause
+
+Model files are downloaded by vLLM at runtime from an upstream model registry,
+which makes the model cache directory inside the container the dominant cost
+on pod restart. Without a persistent volume mounted at that path, the model is
+re-downloaded every time the pod restarts, which is slow and wastes bandwidth.
+The standard mitigation is to back the cache directory with a
+`PersistentVolumeClaim` so that downloaded artifacts survive pod restarts.
+
+## Resolution
+
+Run the vLLM image as a regular `Deployment` in any namespace, mount a PVC at
+the cache directory, expose port `8001` with a `Service`, and front the
+`Service` with a standard `networking.k8s.io/v1` `Ingress` for external access. The default in-cluster ingress controller on ACP is ALB; the same
+manifest shape works against any conformant Ingress controller.
+
+Provision the PVC from a `StorageClass` that suits the cluster's storage
+posture. On ACP, node-local persistent storage is provided by the
+`local-storage-operator` package (catalog channel `stable`, currentCSV
+`local-storage-operator.v4.3.1`); LVM-backed local volumes are typically
+fronted by a TopoLVM-provisioned default `StorageClass`, and for
+shared-access scenarios any default `StorageClass` is acceptable.
+
+```yaml
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: vllm-cache
+spec:
+  accessModes:
+  - ReadWriteOnce
+  resources:
+    requests:
+      storage: 50Gi
+```
+
+Inject the model-registry access token into the container by referencing a
+`Secret` through an environment variable named `HUGGING_FACE_HUB_TOKEN`. The
+binding uses the standard `valueFrom.secretKeyRef` form; no platform-specific
+shape is involved.
+
+```yaml
+apiVersion: v1
+kind: Secret
+metadata:
+  name: hf-token
+type: Opaque
+stringData:
+  token: <model-registry-access-token>
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: vllm-cpu
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: vllm-cpu
+  template:
+    metadata:
+      labels:
+        app: vllm-cpu
+    spec:
+      containers:
+      - name: vllm
+        image: <registry>/vllm-cpu:<tag>
+        ports:
+        - containerPort: 8001
+        env:
+        - name: HUGGING_FACE_HUB_TOKEN
+          valueFrom:
+            secretKeyRef:
+              name: hf-token
+              key: token
+        volumeMounts:
+        - name: cache
+          mountPath: /root/.cache/huggingface
+      volumes:
+      - name: cache
+        persistentVolumeClaim:
+          claimName: vllm-cache
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: vllm-cpu
+spec:
+  selector:
+    app: vllm-cpu
+  ports:
+  - port: 8001
+    targetPort: 8001
+---
+apiVersion: networking.k8s.io/v1
+kind: Ingress
+metadata:
+  name: vllm-cpu
+spec:
+  rules:
+  - host: <external-host>
+    http:
+      paths:
+      - path: /
+        pathType: Prefix
+        backend:
+          service:
+            name: vllm-cpu
+            port:
+              number: 8001
+```
+
+If the pod requires elevated capabilities, label the namespace to relax Pod
+Security Admission enforcement where needed:
+
+```bash
+kubectl label ns <namespace> \
+  pod-security.kubernetes.io/enforce=privileged \
+  pod-security.kubernetes.io/warn=privileged \
+  pod-security.kubernetes.io/audit=privileged \
+  --overwrite
+```
+
+For a supported production deployment of vLLM on ACP, install the
+`aml-operator` package (catalog channel `alpha`, currentCSV
+`aml-operator.v1.4.0`, install mode `AllNamespaces`) which provisions the
+Alauda AI suite — including `KServe` and an `InferenceService` /
+`ServingRuntime` API on top of `kserveless-operator` — and serves vLLM
+through a managed `InferenceService` resource rather than a hand-written
+`Deployment`.
+
+## Diagnostic Steps
+
+Confirm the pod is admitted and running, and that the cache volume is mounted
+at the expected path so that re-downloads do not occur across restarts:
+
+```bash
+kubectl get pod -l app=vllm-cpu
+kubectl describe pod -l app=vllm-cpu
+kubectl get pvc vllm-cache
+```
+
+Confirm the access token is reaching the container via the expected env-var
+shape (the value itself is sensitive — only check that the variable is
+present):
+
+```bash
+kubectl exec deploy/vllm-cpu -- printenv | grep -c '^HUGGING_FACE_HUB_TOKEN='
+```
+
+Confirm external reachability through the configured `Ingress` host or
+NodePort:
+
+```bash
+kubectl get ingress vllm-cpu
+kubectl get svc vllm-cpu
+```
+
+If a production-grade managed path is required instead of this experimental
+CPU shape, inspect the Alauda AI install and the available `ServingRuntime`
+entries on the cluster:
+
+```bash
+kubectl get packagemanifest aml-operator -n cpaas-system \
+  -o jsonpath='{.status.channels[?(@.name=="alpha")].currentCSV}'
+kubectl get servingruntime -A
+kubectl get clusterservingruntime
+```
diff --git a/docs/en/solutions/Running_vLLM_on_CPU_only_nodes_for_experimental_LLM_serving.md b/docs/en/solutions/Running_vLLM_on_CPU_only_nodes_for_experimental_LLM_serving.md
deleted file mode 100644
index 9919e223..00000000
--- a/docs/en/solutions/Running_vLLM_on_CPU_only_nodes_for_experimental_LLM_serving.md
+++ /dev/null
@@ -1,229 +0,0 @@
----
-kind:
-   - How To
-products:
-   - Alauda Container Platform
-ProductsVersion:
-   - 4.1.0,4.2.x
----
-
-# Running vLLM on CPU-only nodes for experimental LLM serving
-## Overview
-
-vLLM is the de facto open-source inference engine for large language models, valued for its high throughput, low latency, and paged-attention memory layout. It is engineered around GPU acceleration, but the project upstream also publishes a CPU build that lets a developer prototype an OpenAI-compatible model endpoint without dedicated accelerator hardware. CPU mode is **not** a production target — token throughput is one to two orders of magnitude below a GPU run — but it is enough to wire up an end-to-end serving topology, exercise client tooling, and benchmark relative behaviour before committing to GPU-backed nodes.
-
-This note records a reproducible recipe for running vLLM on CPU worker nodes through standard Kubernetes primitives only: a Deployment, a PersistentVolumeClaim for the Hugging Face model cache, a Secret carrying the model-hub access token, a Service for in-cluster reachability, and an Ingress for external curl access. Production AI workloads on the platform should use the dedicated AI surface (KServe-based serving, GPU device plugins, the `hardware_accelerator` operators) — this article is intentionally narrow and intended for early evaluation work.
-
-## Resolution
-
-### Build a CPU-targeted vLLM image
-
-Upstream now publishes pre-built CPU images via `vllm-project/vllm` releases; pull one of those if it matches the target architecture. To build locally from source, clone the repository and use the CPU Dockerfile:
-
-```bash
-git clone https://github.com/vllm-project/vllm
-cd vllm
-docker buildx build \
-  --platform linux/amd64 \
-  -t registry.example.com/lab/vllm-cpu:latest \
-  -f docker/Dockerfile.cpu .
-docker push registry.example.com/lab/vllm-cpu:latest
-```
-
-Push the image to whichever registry the cluster pulls from. If the target namespace requires a pull secret, attach it to the workload's ServiceAccount with `imagePullSecrets`.
-
-### Provide persistent storage for the Hugging Face cache
-
-vLLM downloads the model weights on first start. Without persistent storage, every pod restart re-downloads several gigabytes; with a PVC mapped to `~/.cache/huggingface`, the cache survives restarts. Any RWO storage class works for a single replica:
-
-```yaml
-apiVersion: v1
-kind: PersistentVolumeClaim
-metadata:
-  name: vllm-hf-cache
-  namespace: ai-lab
-spec:
-  accessModes:
-    - ReadWriteOnce
-  resources:
-    requests:
-      storage: 50Gi
-```
-
-For node-local backing, an LVM-style local-storage CSI (the `Alauda Build of Local Storage` or `Alauda Build of TopoLVM` operator) gives the lowest-latency path — pin the PVC to a chosen worker via the relevant `volumeBindingMode: WaitForFirstConsumer` storage class.
-
-### Provide the Hugging Face access token
-
-Most Hugging Face models require accepting a license and authenticating via a personal access token. Hold the token in a Secret rather than in the Deployment spec:
-
-```bash
-kubectl -n ai-lab create secret generic hf-token \
-  --from-literal=HUGGING_FACE_HUB_TOKEN=hf_xxx_redacted_xxx
-```
-
-### Deploy vLLM
-
-The Deployment runs a single replica that mounts the PVC, sources the token from the Secret, and exposes the OpenAI-compatible HTTP API on port `8001`. CPU-only inference is memory-bound, so request enough RAM to hold the model weights plus the KV cache; for `Llama-3.2-1B-Instruct` allocate at least 8 GiB. Pin the pod to CPU worker nodes with a node selector if the cluster has heterogeneous capacity.
-
-```yaml
-apiVersion: apps/v1
-kind: Deployment
-metadata:
-  name: vllm-cpu
-  namespace: ai-lab
-spec:
-  replicas: 1
-  selector:
-    matchLabels:
-      app: vllm-cpu
-  template:
-    metadata:
-      labels:
-        app: vllm-cpu
-    spec:
-      containers:
-        - name: vllm
-          image: registry.example.com/lab/vllm-cpu:latest
-          args:
-            - serve
-            - meta-llama/Llama-3.2-1B-Instruct
-            - --port=8001
-          ports:
-            - containerPort: 8001
-              name: http
-          envFrom:
-            - secretRef:
-                name: hf-token
-          resources:
-            requests:
-              cpu: "4"
-              memory: 8Gi
-            limits:
-              cpu: "8"
-              memory: 16Gi
-          volumeMounts:
-            - name: hf-cache
-              mountPath: /root/.cache/huggingface
-      volumes:
-        - name: hf-cache
-          persistentVolumeClaim:
-            claimName: vllm-hf-cache
-```
-
-### Expose the endpoint
-
-Inside the cluster, a ClusterIP Service is enough:
-
-```yaml
-apiVersion: v1
-kind: Service
-metadata:
-  name: vllm-cpu
-  namespace: ai-lab
-spec:
-  selector:
-    app: vllm-cpu
-  ports:
-    - name: http
-      port: 80
-      targetPort: 8001
-```
-
-For external reachability, use a standard Ingress (or the platform's ALB Operator if a richer L7 surface is needed). The OpenAI-compatible API does not require sticky sessions, so the default round-robin behaviour is fine:
-
-```yaml
-apiVersion: networking.k8s.io/v1
-kind: Ingress
-metadata:
-  name: vllm-cpu
-  namespace: ai-lab
-spec:
-  rules:
-    - host: vllm.lab.example.com
-      http:
-        paths:
-          - path: /
-            pathType: Prefix
-            backend:
-              service:
-                name: vllm-cpu
-                port:
-                  number: 80
-```
-
-A standard Pod Security Standard `restricted` namespace is enough for the CPU build — there is no privileged-container requirement. If the target namespace enforces `baseline` or `restricted`, run vLLM as a non-root user inside the image; the default upstream Dockerfile already does this.
-
-### Sanity-check the endpoint
-
-Once the Deployment is `Available` and the model has finished downloading (watch `kubectl logs` for the `Application startup complete` line), drive a chat request through the Ingress:
-
-```bash
-curl -X POST http://vllm.lab.example.com/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "meta-llama/Llama-3.2-1B-Instruct",
-    "messages": [
-      {"role": "user", "content": "Hello! What is the capital of Massachusetts?"}
-    ]
-  }'
-```
-
-A successful response carries a `choices[0].message.content` field with a model-generated answer.
-
-### Benchmark the deployment
-
-For a relative read of throughput and latency, drive the endpoint with `guidellm`, the open-source benchmarking tool from the vLLM team:
-
-```bash
-pip install guidellm
-export HF_TOKEN=hf_xxx_redacted_xxx
-guidellm benchmark \
-  --target http://vllm.lab.example.com/v1 \
-  --model meta-llama/Llama-3.2-1B-Instruct \
-  --data "prompt_tokens=512,output_tokens=128" \
-  --rate-type sweep \
-  --max-seconds 240
-```
-
-The sweep mode starts at one in-flight request and ramps to saturation, which exposes the server's behaviour under load. The standard reported indicators are:
-
-- **Requests per second (RPS)** — completed inference requests per second; whole-system throughput.
-- **Time to first token (TTFT)** — wall time from request arrival to the first emitted token; chat-latency proxy.
-- **Inter-token latency (ITL)** — time between successive tokens; streaming-quality proxy.
-- **End-to-end latency** — total request duration; relevant for batch and offline calls.
-
-CPU-only numbers will sit far below GPU baselines — that is expected. Use the run as a control to validate the full pipeline (image, PVC, token plumbing, Service, Ingress) before re-running the same plan against GPU-backed nodes through the platform's KServe-based AI surface.
-
-## Diagnostic Steps
-
-If the pod loops in `CrashLoopBackOff`, the most common causes are out-of-memory (model weights exceed the container limit), Hugging Face authentication failure, or registry pull errors:
-
-```bash
-kubectl -n ai-lab describe pod -l app=vllm-cpu
-kubectl -n ai-lab logs deploy/vllm-cpu --tail=200
-```
-
-Check the previous container instance for OOM evidence (`OOMKilled` reason in the pod status). Raise the memory request and limit in steps of 4 GiB until the pod stabilises.
-
-If model download stalls, confirm the Secret was wired in correctly:
-
-```bash
-kubectl -n ai-lab exec deploy/vllm-cpu -- env | grep HUGGING_FACE
-```
-
-A missing or blank token causes `huggingface_hub` to fall back to anonymous access and download fails for gated models such as the Llama family.
-
-If the curl call returns 404 or times out, walk the chain bottom-up:
-
-```bash
-kubectl -n ai-lab port-forward svc/vllm-cpu 8001:80
-curl -s http://127.0.0.1:8001/v1/models
-```
-
-A successful local response confirms the Service and pod are healthy; a failed external curl after that points at the Ingress controller, DNS, or ingress class configuration.
-
-To verify the cache PVC is doing its job, compare startup time of a fresh pod against a restarted pod — the second start should skip the multi-gigabyte download:
-
-```bash
-kubectl -n ai-lab logs deploy/vllm-cpu --tail=20 | grep -E '(Loading model|Downloading|Application startup)'
-```