From c2c0667e842688f0da990c459895e46bb169abf5 Mon Sep 17 00:00:00 2001
From: Komh <mail@guojing.io>
Date: Fri, 24 Apr 2026 23:17:41 +0000
Subject: [PATCH 1/3] [observability] Collecting Worker Node CPU, Memory,
 Interrupt, and Network Metrics via DaemonSet

---
 ...rrupt_and_Network_Metrics_via_DaemonSet.md | 193 ++++++++++++++++++
 1 file changed, 193 insertions(+)
 create mode 100644 docs/en/solutions/Collecting_Worker_Node_CPU_Memory_Interrupt_and_Network_Metrics_via_DaemonSet.md

diff --git a/docs/en/solutions/Collecting_Worker_Node_CPU_Memory_Interrupt_and_Network_Metrics_via_DaemonSet.md b/docs/en/solutions/Collecting_Worker_Node_CPU_Memory_Interrupt_and_Network_Metrics_via_DaemonSet.md
new file mode 100644
index 00000000..0d793f2a
--- /dev/null
+++ b/docs/en/solutions/Collecting_Worker_Node_CPU_Memory_Interrupt_and_Network_Metrics_via_DaemonSet.md
@@ -0,0 +1,193 @@
+---
+kind:
+   - How To
+products:
+   - Alauda Container Platform
+ProductsVersion:
+   - 4.1.0,4.2.x
+---
+## Issue
+
+Worker nodes are showing CPU load spikes, memory pressure, or interrupt storms and the platform-native monitoring stack does not capture the sub-second kernel-level signals needed to diagnose the root cause. Typical scenarios:
+
+- A workload triggers a hot path inside the kernel (softirq flood, contended spinlock) that Prometheus scrape intervals of 30–60 seconds cannot resolve.
+- Host-level tooling output (`sar`, `pidstat`, `/proc/interrupts`, `/proc/softirqs`) needs to be collected simultaneously on every node over the window the problem reproduces, then consolidated into a tarball for offline analysis.
+- A network-level investigation (HAProxy, conntrack, qdisc stats, per-interface counters) needs the same timed snapshots.
+
+Running these tools ad-hoc inside a single debug shell on one node does not scale to a fleet-wide incident. The solution is a DaemonSet that launches a privileged pod on every targeted node, runs the collection scripts under a shared ConfigMap, and leaves behind an extractable archive per node.
+
+## Resolution
+
+### Step 1 — namespace and privileges
+
+Create a dedicated namespace for the collection pods and give its default service account the privileges required to run the collector as root on the host with `hostPID`, `hostNetwork`, and `hostIPC`. This grant is scoped to the diagnostic namespace only.
+
+```bash
+kubectl create namespace metrics-debug
+```
+
+Attach whatever privileged PodSecurity / Pod Security Admission configuration your cluster uses to this namespace so the collector pods can run as root with host namespaces. The exact label depends on the policy enforced — for a cluster using the upstream PSA labels:
+
+```bash
+kubectl label namespace metrics-debug \
+  pod-security.kubernetes.io/enforce=privileged \
+  pod-security.kubernetes.io/audit=privileged \
+  pod-security.kubernetes.io/warn=privileged
+```
+
+### Step 2 — ConfigMap with the collector scripts
+
+The first ConfigMap carries two scripts: an install script that pulls in the required packages inside the container, and a collector that starts the snapshot loops in the background, writes the outputs to `/metrics`, and tars the result at the end.
+
+```yaml
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: metrics-scripts
+  namespace: metrics-debug
+data:
+  install-requirements.sh: |
+    #!/bin/bash
+    dnf install -y procps-ng perf psmisc hostname iproute sysstat iotop
+  collect-metrics.sh: |
+    #!/bin/bash
+    INTERVAL=1
+    DURATION=${DURATION:-inf}
+    mkdir -p /metrics && rm -rf /metrics/*
+    uname -n > /metrics/hostname.txt
+    pidstat -ruwh ${INTERVAL} > /metrics/pidstat.txt &
+    sar -A ${INTERVAL} > /metrics/sar.txt &
+    bash -c "while true; do date; free -m; sleep ${INTERVAL}; done" > /metrics/free.txt &
+    bash -c "while true; do date; cat /proc/softirqs; sleep ${INTERVAL}; done" > /metrics/softirqs.txt &
+    bash -c "while true; do date; cat /proc/interrupts; sleep ${INTERVAL}; done" > /metrics/interrupts.txt &
+    echo "collection started; running for ${DURATION}"
+    sleep "${DURATION}"
+    pkill -P $$ || true
+    sync
+    tar -czf /metrics.tar.gz /metrics
+    echo "done"
+```
+
+Apply it:
+
+```bash
+kubectl -n metrics-debug apply -f metrics-scripts-configmap.yaml
+```
+
+Tune `INTERVAL` (sample period in seconds) and `DURATION` (total collection window, or `inf` for until the pod is deleted) to the investigation window. One-second sampling is fine for short bursts; raise it for multi-hour captures or the archive balloons.
+
+### Step 3 — DaemonSet that mounts the scripts and runs them on every worker
+
+```yaml
+apiVersion: apps/v1
+kind: DaemonSet
+metadata:
+  name: metrics-daemonset
+  namespace: metrics-debug
+  labels:
+    app: metrics-daemonset
+spec:
+  selector:
+    matchLabels:
+      app: metrics-daemonset
+  template:
+    metadata:
+      labels:
+        app: metrics-daemonset
+    spec:
+      nodeSelector:
+        node-role.kubernetes.io/worker: ""
+      hostPID: true
+      hostIPC: true
+      hostNetwork: true
+      containers:
+        - name: metrics-daemonset
+          image: fedora:latest
+          command:
+            - /bin/bash
+            - -c
+            - bash /entrypoint/install-requirements.sh && bash /entrypoint/collect-metrics.sh && sleep infinity
+          securityContext:
+            runAsUser: 0
+            runAsGroup: 0
+            privileged: true
+          volumeMounts:
+            - name: entrypoint
+              mountPath: /entrypoint
+      volumes:
+        - name: entrypoint
+          configMap:
+            name: metrics-scripts
+```
+
+Remove the `nodeSelector` stanza to run on every node instead of only workers; change the label key to target a specific node pool.
+
+```bash
+kubectl -n metrics-debug apply -f metrics-daemonset.yaml
+kubectl -n metrics-debug get pod -l app=metrics-daemonset -o wide
+kubectl -n metrics-debug logs -l app=metrics-daemonset --prefix --timestamps --tail=50
+```
+
+Once the collector scripts print `done` (or you have decided the current window is enough and are terminating a `DURATION=inf` run), copy the per-node archives to a local directory, using both pod name and node name in the destination filename so it is unambiguous which tarball came from where:
+
+```bash
+mkdir -p metrics-out
+for pod in $(kubectl -n metrics-debug get pod -l app=metrics-daemonset -o name); do
+  node=$(kubectl -n metrics-debug get "$pod" -o jsonpath='{.spec.nodeName}')
+  name=${pod##*/}
+  kubectl -n metrics-debug cp "${name}:/metrics.tar.gz" "metrics-out/metrics.${name}.${node}.tar.gz"
+done
+tar -czf metrics.tar.gz metrics-out
+```
+
+### Step 4 — network-focused companion DaemonSet
+
+For network latency, conntrack leaks, or packet-drop investigations, ship a second ConfigMap with a `monitor.sh` that collects `ss`, `ip`, `tc`, `ethtool`, `conntrack`, `/proc/net/*`, and `/proc/softirqs` at a configurable delay and iteration count, plus a matching DaemonSet. The structure mirrors the general-metrics one; the container also needs `NET_ADMIN` capability to run `conntrack -L` and `tc -s`:
+
+```yaml
+securityContext:
+  capabilities:
+    add: ["NET_ADMIN"]
+```
+
+When the workload under investigation is a specific service on the host — for example an ingress controller — the network collector can also enter that process's network namespace with `nsenter -n -t <pid>` to capture per-service socket state and conntrack entries, by resolving the PID inside the collector script at runtime.
+
+Tear everything down when the capture is complete so the cluster stops carrying privileged pods:
+
+```bash
+kubectl delete namespace metrics-debug
+```
+
+## Diagnostic Steps
+
+Confirm the DaemonSet actually scheduled on every targeted node — a node that was cordoned or had a matching taint without the DaemonSet tolerating it silently drops out:
+
+```bash
+kubectl -n metrics-debug get pod -l app=metrics-daemonset \
+  -o custom-columns=POD:.metadata.name,NODE:.spec.nodeName,STATUS:.status.phase
+kubectl get node -o name | wc -l
+```
+
+The pod count should match the node count (or node-selector count). Mismatches usually mean either a taint without a matching toleration, or insufficient privileges (container fails `CrashLoopBackOff` during the privileged `dnf install` step).
+
+Check progress on a single node by tailing its collector output while it runs:
+
+```bash
+kubectl -n metrics-debug exec -it metrics-daemonset-<random> -- \
+  bash -c 'ls -la /metrics && head /metrics/sar.txt'
+```
+
+Verify the tarball is well-formed before ending the window:
+
+```bash
+tar -tf metrics-out/metrics.<pod>.<node>.tar.gz | head
+```
+
+For long-running captures, watch the archive grow on the pod filesystem — a non-growing size usually means one of the background samplers has died and the container is writing to a stale file descriptor:
+
+```bash
+kubectl -n metrics-debug exec -it metrics-daemonset-<random> -- \
+  bash -c 'du -sh /metrics; wc -l /metrics/*.txt'
+```
+
+If the container itself is being killed by the kubelet for running out of memory during collection (the `sar` / `pidstat` log files grow indefinitely), lower the `INTERVAL`, shorten `DURATION`, or add `resources.limits.memory` on the DaemonSet Pod spec and use log rotation inside the collector.

From 75e410b5e96985f759b70047acd0523af36e6857 Mon Sep 17 00:00:00 2001
From: Komh <mail@guojing.io>
Date: Sat, 2 May 2026 12:51:13 +0000
Subject: [PATCH 2/3] [observability] Collecting Worker Node CPU, Memory,
 Interrupt, and Network Metrics via DaemonSet

---
 ...de_CPU_Memory_Interrupt_and_Network_Metrics_via_DaemonSet.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/docs/en/solutions/Collecting_Worker_Node_CPU_Memory_Interrupt_and_Network_Metrics_via_DaemonSet.md b/docs/en/solutions/Collecting_Worker_Node_CPU_Memory_Interrupt_and_Network_Metrics_via_DaemonSet.md
index 0d793f2a..30aac716 100644
--- a/docs/en/solutions/Collecting_Worker_Node_CPU_Memory_Interrupt_and_Network_Metrics_via_DaemonSet.md
+++ b/docs/en/solutions/Collecting_Worker_Node_CPU_Memory_Interrupt_and_Network_Metrics_via_DaemonSet.md
@@ -6,6 +6,8 @@ products:
 ProductsVersion:
    - 4.1.0,4.2.x
 ---
+
+# Collecting Worker Node CPU, Memory, Interrupt, and Network Metrics via DaemonSet
 ## Issue
 
 Worker nodes are showing CPU load spikes, memory pressure, or interrupt storms and the platform-native monitoring stack does not capture the sub-second kernel-level signals needed to diagnose the root cause. Typical scenarios:

From c67414aca48fe51e985d3c0baeacf44fb3f59003 Mon Sep 17 00:00:00 2001
From: Komh <mail@guojing.io>
Date: Sun, 17 May 2026 03:28:34 +0000
Subject: [PATCH 3/3] [observability] Collecting host-level performance and
 network telemetry via privileged DaemonSets on ACP

---
 ...rrupt_and_Network_Metrics_via_DaemonSet.md | 195 ------------------
 ...emetry_via_privileged_DaemonSets_on_ACP.md |  95 +++++++++
 2 files changed, 95 insertions(+), 195 deletions(-)
 delete mode 100644 docs/en/solutions/Collecting_Worker_Node_CPU_Memory_Interrupt_and_Network_Metrics_via_DaemonSet.md
 create mode 100644 docs/en/solutions/Collecting_host_level_performance_and_network_telemetry_via_privileged_DaemonSets_on_ACP.md

diff --git a/docs/en/solutions/Collecting_Worker_Node_CPU_Memory_Interrupt_and_Network_Metrics_via_DaemonSet.md b/docs/en/solutions/Collecting_Worker_Node_CPU_Memory_Interrupt_and_Network_Metrics_via_DaemonSet.md
deleted file mode 100644
index 30aac716..00000000
--- a/docs/en/solutions/Collecting_Worker_Node_CPU_Memory_Interrupt_and_Network_Metrics_via_DaemonSet.md
+++ /dev/null
@@ -1,195 +0,0 @@
----
-kind:
-   - How To
-products:
-   - Alauda Container Platform
-ProductsVersion:
-   - 4.1.0,4.2.x
----
-
-# Collecting Worker Node CPU, Memory, Interrupt, and Network Metrics via DaemonSet
-## Issue
-
-Worker nodes are showing CPU load spikes, memory pressure, or interrupt storms and the platform-native monitoring stack does not capture the sub-second kernel-level signals needed to diagnose the root cause. Typical scenarios:
-
-- A workload triggers a hot path inside the kernel (softirq flood, contended spinlock) that Prometheus scrape intervals of 30–60 seconds cannot resolve.
-- Host-level tooling output (`sar`, `pidstat`, `/proc/interrupts`, `/proc/softirqs`) needs to be collected simultaneously on every node over the window the problem reproduces, then consolidated into a tarball for offline analysis.
-- A network-level investigation (HAProxy, conntrack, qdisc stats, per-interface counters) needs the same timed snapshots.
-
-Running these tools ad-hoc inside a single debug shell on one node does not scale to a fleet-wide incident. The solution is a DaemonSet that launches a privileged pod on every targeted node, runs the collection scripts under a shared ConfigMap, and leaves behind an extractable archive per node.
-
-## Resolution
-
-### Step 1 — namespace and privileges
-
-Create a dedicated namespace for the collection pods and give its default service account the privileges required to run the collector as root on the host with `hostPID`, `hostNetwork`, and `hostIPC`. This grant is scoped to the diagnostic namespace only.
-
-```bash
-kubectl create namespace metrics-debug
-```
-
-Attach whatever privileged PodSecurity / Pod Security Admission configuration your cluster uses to this namespace so the collector pods can run as root with host namespaces. The exact label depends on the policy enforced — for a cluster using the upstream PSA labels:
-
-```bash
-kubectl label namespace metrics-debug \
-  pod-security.kubernetes.io/enforce=privileged \
-  pod-security.kubernetes.io/audit=privileged \
-  pod-security.kubernetes.io/warn=privileged
-```
-
-### Step 2 — ConfigMap with the collector scripts
-
-The first ConfigMap carries two scripts: an install script that pulls in the required packages inside the container, and a collector that starts the snapshot loops in the background, writes the outputs to `/metrics`, and tars the result at the end.
-
-```yaml
-apiVersion: v1
-kind: ConfigMap
-metadata:
-  name: metrics-scripts
-  namespace: metrics-debug
-data:
-  install-requirements.sh: |
-    #!/bin/bash
-    dnf install -y procps-ng perf psmisc hostname iproute sysstat iotop
-  collect-metrics.sh: |
-    #!/bin/bash
-    INTERVAL=1
-    DURATION=${DURATION:-inf}
-    mkdir -p /metrics && rm -rf /metrics/*
-    uname -n > /metrics/hostname.txt
-    pidstat -ruwh ${INTERVAL} > /metrics/pidstat.txt &
-    sar -A ${INTERVAL} > /metrics/sar.txt &
-    bash -c "while true; do date; free -m; sleep ${INTERVAL}; done" > /metrics/free.txt &
-    bash -c "while true; do date; cat /proc/softirqs; sleep ${INTERVAL}; done" > /metrics/softirqs.txt &
-    bash -c "while true; do date; cat /proc/interrupts; sleep ${INTERVAL}; done" > /metrics/interrupts.txt &
-    echo "collection started; running for ${DURATION}"
-    sleep "${DURATION}"
-    pkill -P $$ || true
-    sync
-    tar -czf /metrics.tar.gz /metrics
-    echo "done"
-```
-
-Apply it:
-
-```bash
-kubectl -n metrics-debug apply -f metrics-scripts-configmap.yaml
-```
-
-Tune `INTERVAL` (sample period in seconds) and `DURATION` (total collection window, or `inf` for until the pod is deleted) to the investigation window. One-second sampling is fine for short bursts; raise it for multi-hour captures or the archive balloons.
-
-### Step 3 — DaemonSet that mounts the scripts and runs them on every worker
-
-```yaml
-apiVersion: apps/v1
-kind: DaemonSet
-metadata:
-  name: metrics-daemonset
-  namespace: metrics-debug
-  labels:
-    app: metrics-daemonset
-spec:
-  selector:
-    matchLabels:
-      app: metrics-daemonset
-  template:
-    metadata:
-      labels:
-        app: metrics-daemonset
-    spec:
-      nodeSelector:
-        node-role.kubernetes.io/worker: ""
-      hostPID: true
-      hostIPC: true
-      hostNetwork: true
-      containers:
-        - name: metrics-daemonset
-          image: fedora:latest
-          command:
-            - /bin/bash
-            - -c
-            - bash /entrypoint/install-requirements.sh && bash /entrypoint/collect-metrics.sh && sleep infinity
-          securityContext:
-            runAsUser: 0
-            runAsGroup: 0
-            privileged: true
-          volumeMounts:
-            - name: entrypoint
-              mountPath: /entrypoint
-      volumes:
-        - name: entrypoint
-          configMap:
-            name: metrics-scripts
-```
-
-Remove the `nodeSelector` stanza to run on every node instead of only workers; change the label key to target a specific node pool.
-
-```bash
-kubectl -n metrics-debug apply -f metrics-daemonset.yaml
-kubectl -n metrics-debug get pod -l app=metrics-daemonset -o wide
-kubectl -n metrics-debug logs -l app=metrics-daemonset --prefix --timestamps --tail=50
-```
-
-Once the collector scripts print `done` (or you have decided the current window is enough and are terminating a `DURATION=inf` run), copy the per-node archives to a local directory, using both pod name and node name in the destination filename so it is unambiguous which tarball came from where:
-
-```bash
-mkdir -p metrics-out
-for pod in $(kubectl -n metrics-debug get pod -l app=metrics-daemonset -o name); do
-  node=$(kubectl -n metrics-debug get "$pod" -o jsonpath='{.spec.nodeName}')
-  name=${pod##*/}
-  kubectl -n metrics-debug cp "${name}:/metrics.tar.gz" "metrics-out/metrics.${name}.${node}.tar.gz"
-done
-tar -czf metrics.tar.gz metrics-out
-```
-
-### Step 4 — network-focused companion DaemonSet
-
-For network latency, conntrack leaks, or packet-drop investigations, ship a second ConfigMap with a `monitor.sh` that collects `ss`, `ip`, `tc`, `ethtool`, `conntrack`, `/proc/net/*`, and `/proc/softirqs` at a configurable delay and iteration count, plus a matching DaemonSet. The structure mirrors the general-metrics one; the container also needs `NET_ADMIN` capability to run `conntrack -L` and `tc -s`:
-
-```yaml
-securityContext:
-  capabilities:
-    add: ["NET_ADMIN"]
-```
-
-When the workload under investigation is a specific service on the host — for example an ingress controller — the network collector can also enter that process's network namespace with `nsenter -n -t <pid>` to capture per-service socket state and conntrack entries, by resolving the PID inside the collector script at runtime.
-
-Tear everything down when the capture is complete so the cluster stops carrying privileged pods:
-
-```bash
-kubectl delete namespace metrics-debug
-```
-
-## Diagnostic Steps
-
-Confirm the DaemonSet actually scheduled on every targeted node — a node that was cordoned or had a matching taint without the DaemonSet tolerating it silently drops out:
-
-```bash
-kubectl -n metrics-debug get pod -l app=metrics-daemonset \
-  -o custom-columns=POD:.metadata.name,NODE:.spec.nodeName,STATUS:.status.phase
-kubectl get node -o name | wc -l
-```
-
-The pod count should match the node count (or node-selector count). Mismatches usually mean either a taint without a matching toleration, or insufficient privileges (container fails `CrashLoopBackOff` during the privileged `dnf install` step).
-
-Check progress on a single node by tailing its collector output while it runs:
-
-```bash
-kubectl -n metrics-debug exec -it metrics-daemonset-<random> -- \
-  bash -c 'ls -la /metrics && head /metrics/sar.txt'
-```
-
-Verify the tarball is well-formed before ending the window:
-
-```bash
-tar -tf metrics-out/metrics.<pod>.<node>.tar.gz | head
-```
-
-For long-running captures, watch the archive grow on the pod filesystem — a non-growing size usually means one of the background samplers has died and the container is writing to a stale file descriptor:
-
-```bash
-kubectl -n metrics-debug exec -it metrics-daemonset-<random> -- \
-  bash -c 'du -sh /metrics; wc -l /metrics/*.txt'
-```
-
-If the container itself is being killed by the kubelet for running out of memory during collection (the `sar` / `pidstat` log files grow indefinitely), lower the `INTERVAL`, shorten `DURATION`, or add `resources.limits.memory` on the DaemonSet Pod spec and use log rotation inside the collector.
diff --git a/docs/en/solutions/Collecting_host_level_performance_and_network_telemetry_via_privileged_DaemonSets_on_ACP.md b/docs/en/solutions/Collecting_host_level_performance_and_network_telemetry_via_privileged_DaemonSets_on_ACP.md
new file mode 100644
index 00000000..f7d85f1e
--- /dev/null
+++ b/docs/en/solutions/Collecting_host_level_performance_and_network_telemetry_via_privileged_DaemonSets_on_ACP.md
@@ -0,0 +1,95 @@
+---
+kind:
+   - How To
+products:
+   - Alauda Container Platform
+ProductsVersion:
+   - 4.1.0,4.2.x
+---
+
+# Collecting host-level performance and network telemetry via privileged DaemonSets on ACP
+
+## Issue
+
+On Alauda Container Platform clusters where the first-party monitoring stack is not yet deployed, operators still need raw, per-node kernel telemetry — CPU, memory, IRQ/softirq, process tables, block I/O, and full network state — to diagnose worker-node performance problems. The standard delivery vehicle is a per-node DaemonSet running a privileged container with `hostPID: true`, `hostIPC: true`, `hostNetwork: true`, and `securityContext.runAsUser: 0` plus `privileged: true`, mounting two ConfigMaps (`install-requirements.sh`, `collect-metrics.sh`) at `/entrypoint`, then running `bash /entrypoint/install-requirements.sh && bash /entrypoint/collect-metrics.sh && sleep infinity`.
+
+ACP first-party monitoring is delivered by the `prometheus-operator` PackageManifest (displayName `Alauda build of Prometheus`), with the higher-level descriptors `Alauda Container Platform Monitoring for Prometheus` and `Alauda Container Platform Monitoring for VictoriaMetrics`. On a freshly-installed cluster without that operator (no `prometheuses.monitoring.coreos.com` CRD and no node-exporter pod present), the DaemonSet pattern below is the only available host-level telemetry path.
+
+## Root Cause
+
+Inside a `hostPID: true` privileged pod, processes on the underlying node are visible to in-container tooling: `ps`, `pidstat -ruwh`, and `top` enumerate node-wide processes, and `/proc/interrupts` plus `/proc/softirqs` reflect the host kernel's per-CPU IRQ/softirq counters rather than the container's PID-1 subview. The behavior is exercised on this cluster against Linux kernel 5.15 with containerd 2.2.1 as the node runtime. That direct host-namespace access is what makes a single DaemonSet pod sufficient to capture genuine node-level signal without an in-cluster monitoring agent.
+
+ACP does not ship a vendor-specific pod-security-constraints CRD chain; pod admission is governed exclusively by Kubernetes PodSecurity Admission (PSA), and the cluster-wide default policy is configured at `warn` level against the `baseline` profile. A privileged hostPID/hostIPC/hostNetwork pod is therefore admitted with a `PodSecurity baseline:latest` warning and runs without any cluster-side security-policy grant on the target namespace.
+
+## Resolution
+
+Use a namespace with no PSA `enforce` label (the cluster default of warn-only `baseline` admits the workload). The pod image must be reachable from the cluster — either a base image (e.g. `alpine` or `ubuntu`) mirrored into the internal registry, or a tool image such as `registry.alauda.cn:60080/3rdparty/kubectl:v4.3.1` extended with the diagnostic toolchain. The DaemonSet target set is the node group of interest; on this ACP cluster topology worker nodes carry no `node-role.kubernetes.io/worker` label (only the control-plane node carries `node-role.kubernetes.io/*` role labels), so omit the nodeSelector entirely to land on every node, or scope to non-control-plane nodes with `affinity.nodeAffinity` using a `node-role.kubernetes.io/control-plane DoesNotExist` requirement.
+
+DaemonSet skeleton (substitute `<image>` with an internally-pullable base image carrying `bash`, `tar`, and a package manager; the ConfigMaps below carry the actual collection scripts):
+
+```yaml
+apiVersion: apps/v1
+kind: DaemonSet
+metadata:
+  name: metrics-daemonset
+  namespace: metrics-debug
+spec:
+  selector:
+    matchLabels:
+      app: metrics-daemonset
+  template:
+    metadata:
+      labels:
+        app: metrics-daemonset
+    spec:
+      hostPID: true
+      hostIPC: true
+      hostNetwork: true
+      affinity:
+        nodeAffinity:
+          requiredDuringSchedulingIgnoredDuringExecution:
+            nodeSelectorTerms:
+            - matchExpressions:
+              - key: node-role.kubernetes.io/control-plane
+                operator: DoesNotExist
+      containers:
+      - name: collector
+        image: <image>
+        securityContext:
+          runAsUser: 0
+          privileged: true
+        command: ["/bin/bash", "-c"]
+        args:
+        - "bash /entrypoint/install-requirements.sh && bash /entrypoint/collect-metrics.sh && sleep infinity"
+        volumeMounts:
+        - name: entrypoint
+          mountPath: /entrypoint
+      volumes:
+      - name: entrypoint
+        configMap:
+          name: collector-scripts
+          defaultMode: 0755
+```
+
+The `collect-metrics.sh` ConfigMap collects five mandatory host-level performance streams in parallel, each as a background process writing into `/metrics/`: `pidstat -ruwh ${INTERVAL}` to `pidstat.txt`, `sar -A ${INTERVAL}` to `sar.txt`, a `while` loop emitting `date; free -m; sleep ${INTERVAL}` to `free.txt`, the same loop pattern dumping `/proc/softirqs` to `softirqs.txt` and `/proc/interrupts` to `interrupts.txt`, plus two optional streams `ps aux | sort -nrk 3,3 | head -n 20` and `iotop -Pobt`. After `${DURATION}` the script kills the children, calls `sync`, then `tar -czf /metrics.tar.gz /metrics`.
+
+For network-layer capture, deploy a second DaemonSet (`monitor-daemonset`) sharing the same privileged hostPID/hostIPC/hostNetwork shape but running a `monitor.sh` ConfigMap that, every `${DELAY}` seconds for `${ITERATIONS}` iterations, collects `ss`, `nstat`, `netstat -s`, `ip address/route/neigh`, `tc -s qdisc`, `cat /proc/interrupts`, `/proc/net/softnet_stat`, `/proc/vmstat`, `ps -alfe`, `mpstat`, `top -c -b -n1`, `numastat`, `cat /proc/softirqs`, `cat /proc/net/sockstat`, `/proc/net/dev`, `ethtool -S <dev>`, and per-interface `/sys/.../statistics/*` into `${HOSTNAME}-network_stats_${now}/`. Independently `conntrack -L -n` runs in a background loop. The final archive is `/network-metrics.tar.gz`.
+
+## Diagnostic Steps
+
+Once a collection run finishes, extract the bundle from each pod and verify the archive:
+
+```bash
+kubectl cp -n metrics-debug <pod>:/metrics.tar.gz metrics.<pod>.<node>.tar.gz
+tar -tf metrics.<pod>.<node>.tar.gz
+```
+
+The bundle contents — `metrics/pidstat.txt`, `metrics/sar.txt`, `metrics/interrupts.txt`, `metrics/softirqs.txt`, `metrics/free.txt`, optional `metrics/ps.txt`, `metrics/iotop.txt` — are independent of any in-cluster monitoring stack and represent raw kernel telemetry directly off the node.
+
+For the network DaemonSet, retrieve `/network-metrics.tar.gz` the same way:
+
+```bash
+kubectl cp -n metrics-debug <pod>:/network-metrics.tar.gz network-metrics.<pod>.<node>.tar.gz
+```
+
+Inspect the per-iteration subdirectories for `ss`, `nstat`, `ethtool -S`, conntrack snapshots, and per-interface counters to correlate kernel-level network state with the host-OS performance streams from the first DaemonSet.