From c2c0667e842688f0da990c459895e46bb169abf5 Mon Sep 17 00:00:00 2001 From: Komh Date: Fri, 24 Apr 2026 23:17:41 +0000 Subject: [PATCH 1/3] [observability] Collecting Worker Node CPU, Memory, Interrupt, and Network Metrics via DaemonSet --- ...rrupt_and_Network_Metrics_via_DaemonSet.md | 193 ++++++++++++++++++ 1 file changed, 193 insertions(+) create mode 100644 docs/en/solutions/Collecting_Worker_Node_CPU_Memory_Interrupt_and_Network_Metrics_via_DaemonSet.md diff --git a/docs/en/solutions/Collecting_Worker_Node_CPU_Memory_Interrupt_and_Network_Metrics_via_DaemonSet.md b/docs/en/solutions/Collecting_Worker_Node_CPU_Memory_Interrupt_and_Network_Metrics_via_DaemonSet.md new file mode 100644 index 00000000..0d793f2a --- /dev/null +++ b/docs/en/solutions/Collecting_Worker_Node_CPU_Memory_Interrupt_and_Network_Metrics_via_DaemonSet.md @@ -0,0 +1,193 @@ +--- +kind: + - How To +products: + - Alauda Container Platform +ProductsVersion: + - 4.1.0,4.2.x +--- +## Issue + +Worker nodes are showing CPU load spikes, memory pressure, or interrupt storms and the platform-native monitoring stack does not capture the sub-second kernel-level signals needed to diagnose the root cause. Typical scenarios: + +- A workload triggers a hot path inside the kernel (softirq flood, contended spinlock) that Prometheus scrape intervals of 30–60 seconds cannot resolve. +- Host-level tooling output (`sar`, `pidstat`, `/proc/interrupts`, `/proc/softirqs`) needs to be collected simultaneously on every node over the window the problem reproduces, then consolidated into a tarball for offline analysis. +- A network-level investigation (HAProxy, conntrack, qdisc stats, per-interface counters) needs the same timed snapshots. + +Running these tools ad-hoc inside a single debug shell on one node does not scale to a fleet-wide incident. The solution is a DaemonSet that launches a privileged pod on every targeted node, runs the collection scripts under a shared ConfigMap, and leaves behind an extractable archive per node. + +## Resolution + +### Step 1 — namespace and privileges + +Create a dedicated namespace for the collection pods and give its default service account the privileges required to run the collector as root on the host with `hostPID`, `hostNetwork`, and `hostIPC`. This grant is scoped to the diagnostic namespace only. + +```bash +kubectl create namespace metrics-debug +``` + +Attach whatever privileged PodSecurity / Pod Security Admission configuration your cluster uses to this namespace so the collector pods can run as root with host namespaces. The exact label depends on the policy enforced — for a cluster using the upstream PSA labels: + +```bash +kubectl label namespace metrics-debug \ + pod-security.kubernetes.io/enforce=privileged \ + pod-security.kubernetes.io/audit=privileged \ + pod-security.kubernetes.io/warn=privileged +``` + +### Step 2 — ConfigMap with the collector scripts + +The first ConfigMap carries two scripts: an install script that pulls in the required packages inside the container, and a collector that starts the snapshot loops in the background, writes the outputs to `/metrics`, and tars the result at the end. + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: metrics-scripts + namespace: metrics-debug +data: + install-requirements.sh: | + #!/bin/bash + dnf install -y procps-ng perf psmisc hostname iproute sysstat iotop + collect-metrics.sh: | + #!/bin/bash + INTERVAL=1 + DURATION=${DURATION:-inf} + mkdir -p /metrics && rm -rf /metrics/* + uname -n > /metrics/hostname.txt + pidstat -ruwh ${INTERVAL} > /metrics/pidstat.txt & + sar -A ${INTERVAL} > /metrics/sar.txt & + bash -c "while true; do date; free -m; sleep ${INTERVAL}; done" > /metrics/free.txt & + bash -c "while true; do date; cat /proc/softirqs; sleep ${INTERVAL}; done" > /metrics/softirqs.txt & + bash -c "while true; do date; cat /proc/interrupts; sleep ${INTERVAL}; done" > /metrics/interrupts.txt & + echo "collection started; running for ${DURATION}" + sleep "${DURATION}" + pkill -P $$ || true + sync + tar -czf /metrics.tar.gz /metrics + echo "done" +``` + +Apply it: + +```bash +kubectl -n metrics-debug apply -f metrics-scripts-configmap.yaml +``` + +Tune `INTERVAL` (sample period in seconds) and `DURATION` (total collection window, or `inf` for until the pod is deleted) to the investigation window. One-second sampling is fine for short bursts; raise it for multi-hour captures or the archive balloons. + +### Step 3 — DaemonSet that mounts the scripts and runs them on every worker + +```yaml +apiVersion: apps/v1 +kind: DaemonSet +metadata: + name: metrics-daemonset + namespace: metrics-debug + labels: + app: metrics-daemonset +spec: + selector: + matchLabels: + app: metrics-daemonset + template: + metadata: + labels: + app: metrics-daemonset + spec: + nodeSelector: + node-role.kubernetes.io/worker: "" + hostPID: true + hostIPC: true + hostNetwork: true + containers: + - name: metrics-daemonset + image: fedora:latest + command: + - /bin/bash + - -c + - bash /entrypoint/install-requirements.sh && bash /entrypoint/collect-metrics.sh && sleep infinity + securityContext: + runAsUser: 0 + runAsGroup: 0 + privileged: true + volumeMounts: + - name: entrypoint + mountPath: /entrypoint + volumes: + - name: entrypoint + configMap: + name: metrics-scripts +``` + +Remove the `nodeSelector` stanza to run on every node instead of only workers; change the label key to target a specific node pool. + +```bash +kubectl -n metrics-debug apply -f metrics-daemonset.yaml +kubectl -n metrics-debug get pod -l app=metrics-daemonset -o wide +kubectl -n metrics-debug logs -l app=metrics-daemonset --prefix --timestamps --tail=50 +``` + +Once the collector scripts print `done` (or you have decided the current window is enough and are terminating a `DURATION=inf` run), copy the per-node archives to a local directory, using both pod name and node name in the destination filename so it is unambiguous which tarball came from where: + +```bash +mkdir -p metrics-out +for pod in $(kubectl -n metrics-debug get pod -l app=metrics-daemonset -o name); do + node=$(kubectl -n metrics-debug get "$pod" -o jsonpath='{.spec.nodeName}') + name=${pod##*/} + kubectl -n metrics-debug cp "${name}:/metrics.tar.gz" "metrics-out/metrics.${name}.${node}.tar.gz" +done +tar -czf metrics.tar.gz metrics-out +``` + +### Step 4 — network-focused companion DaemonSet + +For network latency, conntrack leaks, or packet-drop investigations, ship a second ConfigMap with a `monitor.sh` that collects `ss`, `ip`, `tc`, `ethtool`, `conntrack`, `/proc/net/*`, and `/proc/softirqs` at a configurable delay and iteration count, plus a matching DaemonSet. The structure mirrors the general-metrics one; the container also needs `NET_ADMIN` capability to run `conntrack -L` and `tc -s`: + +```yaml +securityContext: + capabilities: + add: ["NET_ADMIN"] +``` + +When the workload under investigation is a specific service on the host — for example an ingress controller — the network collector can also enter that process's network namespace with `nsenter -n -t ` to capture per-service socket state and conntrack entries, by resolving the PID inside the collector script at runtime. + +Tear everything down when the capture is complete so the cluster stops carrying privileged pods: + +```bash +kubectl delete namespace metrics-debug +``` + +## Diagnostic Steps + +Confirm the DaemonSet actually scheduled on every targeted node — a node that was cordoned or had a matching taint without the DaemonSet tolerating it silently drops out: + +```bash +kubectl -n metrics-debug get pod -l app=metrics-daemonset \ + -o custom-columns=POD:.metadata.name,NODE:.spec.nodeName,STATUS:.status.phase +kubectl get node -o name | wc -l +``` + +The pod count should match the node count (or node-selector count). Mismatches usually mean either a taint without a matching toleration, or insufficient privileges (container fails `CrashLoopBackOff` during the privileged `dnf install` step). + +Check progress on a single node by tailing its collector output while it runs: + +```bash +kubectl -n metrics-debug exec -it metrics-daemonset- -- \ + bash -c 'ls -la /metrics && head /metrics/sar.txt' +``` + +Verify the tarball is well-formed before ending the window: + +```bash +tar -tf metrics-out/metrics...tar.gz | head +``` + +For long-running captures, watch the archive grow on the pod filesystem — a non-growing size usually means one of the background samplers has died and the container is writing to a stale file descriptor: + +```bash +kubectl -n metrics-debug exec -it metrics-daemonset- -- \ + bash -c 'du -sh /metrics; wc -l /metrics/*.txt' +``` + +If the container itself is being killed by the kubelet for running out of memory during collection (the `sar` / `pidstat` log files grow indefinitely), lower the `INTERVAL`, shorten `DURATION`, or add `resources.limits.memory` on the DaemonSet Pod spec and use log rotation inside the collector. From 75e410b5e96985f759b70047acd0523af36e6857 Mon Sep 17 00:00:00 2001 From: Komh Date: Sat, 2 May 2026 12:51:13 +0000 Subject: [PATCH 2/3] [observability] Collecting Worker Node CPU, Memory, Interrupt, and Network Metrics via DaemonSet --- ...de_CPU_Memory_Interrupt_and_Network_Metrics_via_DaemonSet.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/en/solutions/Collecting_Worker_Node_CPU_Memory_Interrupt_and_Network_Metrics_via_DaemonSet.md b/docs/en/solutions/Collecting_Worker_Node_CPU_Memory_Interrupt_and_Network_Metrics_via_DaemonSet.md index 0d793f2a..30aac716 100644 --- a/docs/en/solutions/Collecting_Worker_Node_CPU_Memory_Interrupt_and_Network_Metrics_via_DaemonSet.md +++ b/docs/en/solutions/Collecting_Worker_Node_CPU_Memory_Interrupt_and_Network_Metrics_via_DaemonSet.md @@ -6,6 +6,8 @@ products: ProductsVersion: - 4.1.0,4.2.x --- + +# Collecting Worker Node CPU, Memory, Interrupt, and Network Metrics via DaemonSet ## Issue Worker nodes are showing CPU load spikes, memory pressure, or interrupt storms and the platform-native monitoring stack does not capture the sub-second kernel-level signals needed to diagnose the root cause. Typical scenarios: From c67414aca48fe51e985d3c0baeacf44fb3f59003 Mon Sep 17 00:00:00 2001 From: Komh Date: Sun, 17 May 2026 03:28:34 +0000 Subject: [PATCH 3/3] [observability] Collecting host-level performance and network telemetry via privileged DaemonSets on ACP --- ...rrupt_and_Network_Metrics_via_DaemonSet.md | 195 ------------------ ...emetry_via_privileged_DaemonSets_on_ACP.md | 95 +++++++++ 2 files changed, 95 insertions(+), 195 deletions(-) delete mode 100644 docs/en/solutions/Collecting_Worker_Node_CPU_Memory_Interrupt_and_Network_Metrics_via_DaemonSet.md create mode 100644 docs/en/solutions/Collecting_host_level_performance_and_network_telemetry_via_privileged_DaemonSets_on_ACP.md diff --git a/docs/en/solutions/Collecting_Worker_Node_CPU_Memory_Interrupt_and_Network_Metrics_via_DaemonSet.md b/docs/en/solutions/Collecting_Worker_Node_CPU_Memory_Interrupt_and_Network_Metrics_via_DaemonSet.md deleted file mode 100644 index 30aac716..00000000 --- a/docs/en/solutions/Collecting_Worker_Node_CPU_Memory_Interrupt_and_Network_Metrics_via_DaemonSet.md +++ /dev/null @@ -1,195 +0,0 @@ ---- -kind: - - How To -products: - - Alauda Container Platform -ProductsVersion: - - 4.1.0,4.2.x ---- - -# Collecting Worker Node CPU, Memory, Interrupt, and Network Metrics via DaemonSet -## Issue - -Worker nodes are showing CPU load spikes, memory pressure, or interrupt storms and the platform-native monitoring stack does not capture the sub-second kernel-level signals needed to diagnose the root cause. Typical scenarios: - -- A workload triggers a hot path inside the kernel (softirq flood, contended spinlock) that Prometheus scrape intervals of 30–60 seconds cannot resolve. -- Host-level tooling output (`sar`, `pidstat`, `/proc/interrupts`, `/proc/softirqs`) needs to be collected simultaneously on every node over the window the problem reproduces, then consolidated into a tarball for offline analysis. -- A network-level investigation (HAProxy, conntrack, qdisc stats, per-interface counters) needs the same timed snapshots. - -Running these tools ad-hoc inside a single debug shell on one node does not scale to a fleet-wide incident. The solution is a DaemonSet that launches a privileged pod on every targeted node, runs the collection scripts under a shared ConfigMap, and leaves behind an extractable archive per node. - -## Resolution - -### Step 1 — namespace and privileges - -Create a dedicated namespace for the collection pods and give its default service account the privileges required to run the collector as root on the host with `hostPID`, `hostNetwork`, and `hostIPC`. This grant is scoped to the diagnostic namespace only. - -```bash -kubectl create namespace metrics-debug -``` - -Attach whatever privileged PodSecurity / Pod Security Admission configuration your cluster uses to this namespace so the collector pods can run as root with host namespaces. The exact label depends on the policy enforced — for a cluster using the upstream PSA labels: - -```bash -kubectl label namespace metrics-debug \ - pod-security.kubernetes.io/enforce=privileged \ - pod-security.kubernetes.io/audit=privileged \ - pod-security.kubernetes.io/warn=privileged -``` - -### Step 2 — ConfigMap with the collector scripts - -The first ConfigMap carries two scripts: an install script that pulls in the required packages inside the container, and a collector that starts the snapshot loops in the background, writes the outputs to `/metrics`, and tars the result at the end. - -```yaml -apiVersion: v1 -kind: ConfigMap -metadata: - name: metrics-scripts - namespace: metrics-debug -data: - install-requirements.sh: | - #!/bin/bash - dnf install -y procps-ng perf psmisc hostname iproute sysstat iotop - collect-metrics.sh: | - #!/bin/bash - INTERVAL=1 - DURATION=${DURATION:-inf} - mkdir -p /metrics && rm -rf /metrics/* - uname -n > /metrics/hostname.txt - pidstat -ruwh ${INTERVAL} > /metrics/pidstat.txt & - sar -A ${INTERVAL} > /metrics/sar.txt & - bash -c "while true; do date; free -m; sleep ${INTERVAL}; done" > /metrics/free.txt & - bash -c "while true; do date; cat /proc/softirqs; sleep ${INTERVAL}; done" > /metrics/softirqs.txt & - bash -c "while true; do date; cat /proc/interrupts; sleep ${INTERVAL}; done" > /metrics/interrupts.txt & - echo "collection started; running for ${DURATION}" - sleep "${DURATION}" - pkill -P $$ || true - sync - tar -czf /metrics.tar.gz /metrics - echo "done" -``` - -Apply it: - -```bash -kubectl -n metrics-debug apply -f metrics-scripts-configmap.yaml -``` - -Tune `INTERVAL` (sample period in seconds) and `DURATION` (total collection window, or `inf` for until the pod is deleted) to the investigation window. One-second sampling is fine for short bursts; raise it for multi-hour captures or the archive balloons. - -### Step 3 — DaemonSet that mounts the scripts and runs them on every worker - -```yaml -apiVersion: apps/v1 -kind: DaemonSet -metadata: - name: metrics-daemonset - namespace: metrics-debug - labels: - app: metrics-daemonset -spec: - selector: - matchLabels: - app: metrics-daemonset - template: - metadata: - labels: - app: metrics-daemonset - spec: - nodeSelector: - node-role.kubernetes.io/worker: "" - hostPID: true - hostIPC: true - hostNetwork: true - containers: - - name: metrics-daemonset - image: fedora:latest - command: - - /bin/bash - - -c - - bash /entrypoint/install-requirements.sh && bash /entrypoint/collect-metrics.sh && sleep infinity - securityContext: - runAsUser: 0 - runAsGroup: 0 - privileged: true - volumeMounts: - - name: entrypoint - mountPath: /entrypoint - volumes: - - name: entrypoint - configMap: - name: metrics-scripts -``` - -Remove the `nodeSelector` stanza to run on every node instead of only workers; change the label key to target a specific node pool. - -```bash -kubectl -n metrics-debug apply -f metrics-daemonset.yaml -kubectl -n metrics-debug get pod -l app=metrics-daemonset -o wide -kubectl -n metrics-debug logs -l app=metrics-daemonset --prefix --timestamps --tail=50 -``` - -Once the collector scripts print `done` (or you have decided the current window is enough and are terminating a `DURATION=inf` run), copy the per-node archives to a local directory, using both pod name and node name in the destination filename so it is unambiguous which tarball came from where: - -```bash -mkdir -p metrics-out -for pod in $(kubectl -n metrics-debug get pod -l app=metrics-daemonset -o name); do - node=$(kubectl -n metrics-debug get "$pod" -o jsonpath='{.spec.nodeName}') - name=${pod##*/} - kubectl -n metrics-debug cp "${name}:/metrics.tar.gz" "metrics-out/metrics.${name}.${node}.tar.gz" -done -tar -czf metrics.tar.gz metrics-out -``` - -### Step 4 — network-focused companion DaemonSet - -For network latency, conntrack leaks, or packet-drop investigations, ship a second ConfigMap with a `monitor.sh` that collects `ss`, `ip`, `tc`, `ethtool`, `conntrack`, `/proc/net/*`, and `/proc/softirqs` at a configurable delay and iteration count, plus a matching DaemonSet. The structure mirrors the general-metrics one; the container also needs `NET_ADMIN` capability to run `conntrack -L` and `tc -s`: - -```yaml -securityContext: - capabilities: - add: ["NET_ADMIN"] -``` - -When the workload under investigation is a specific service on the host — for example an ingress controller — the network collector can also enter that process's network namespace with `nsenter -n -t ` to capture per-service socket state and conntrack entries, by resolving the PID inside the collector script at runtime. - -Tear everything down when the capture is complete so the cluster stops carrying privileged pods: - -```bash -kubectl delete namespace metrics-debug -``` - -## Diagnostic Steps - -Confirm the DaemonSet actually scheduled on every targeted node — a node that was cordoned or had a matching taint without the DaemonSet tolerating it silently drops out: - -```bash -kubectl -n metrics-debug get pod -l app=metrics-daemonset \ - -o custom-columns=POD:.metadata.name,NODE:.spec.nodeName,STATUS:.status.phase -kubectl get node -o name | wc -l -``` - -The pod count should match the node count (or node-selector count). Mismatches usually mean either a taint without a matching toleration, or insufficient privileges (container fails `CrashLoopBackOff` during the privileged `dnf install` step). - -Check progress on a single node by tailing its collector output while it runs: - -```bash -kubectl -n metrics-debug exec -it metrics-daemonset- -- \ - bash -c 'ls -la /metrics && head /metrics/sar.txt' -``` - -Verify the tarball is well-formed before ending the window: - -```bash -tar -tf metrics-out/metrics...tar.gz | head -``` - -For long-running captures, watch the archive grow on the pod filesystem — a non-growing size usually means one of the background samplers has died and the container is writing to a stale file descriptor: - -```bash -kubectl -n metrics-debug exec -it metrics-daemonset- -- \ - bash -c 'du -sh /metrics; wc -l /metrics/*.txt' -``` - -If the container itself is being killed by the kubelet for running out of memory during collection (the `sar` / `pidstat` log files grow indefinitely), lower the `INTERVAL`, shorten `DURATION`, or add `resources.limits.memory` on the DaemonSet Pod spec and use log rotation inside the collector. diff --git a/docs/en/solutions/Collecting_host_level_performance_and_network_telemetry_via_privileged_DaemonSets_on_ACP.md b/docs/en/solutions/Collecting_host_level_performance_and_network_telemetry_via_privileged_DaemonSets_on_ACP.md new file mode 100644 index 00000000..f7d85f1e --- /dev/null +++ b/docs/en/solutions/Collecting_host_level_performance_and_network_telemetry_via_privileged_DaemonSets_on_ACP.md @@ -0,0 +1,95 @@ +--- +kind: + - How To +products: + - Alauda Container Platform +ProductsVersion: + - 4.1.0,4.2.x +--- + +# Collecting host-level performance and network telemetry via privileged DaemonSets on ACP + +## Issue + +On Alauda Container Platform clusters where the first-party monitoring stack is not yet deployed, operators still need raw, per-node kernel telemetry — CPU, memory, IRQ/softirq, process tables, block I/O, and full network state — to diagnose worker-node performance problems. The standard delivery vehicle is a per-node DaemonSet running a privileged container with `hostPID: true`, `hostIPC: true`, `hostNetwork: true`, and `securityContext.runAsUser: 0` plus `privileged: true`, mounting two ConfigMaps (`install-requirements.sh`, `collect-metrics.sh`) at `/entrypoint`, then running `bash /entrypoint/install-requirements.sh && bash /entrypoint/collect-metrics.sh && sleep infinity`. + +ACP first-party monitoring is delivered by the `prometheus-operator` PackageManifest (displayName `Alauda build of Prometheus`), with the higher-level descriptors `Alauda Container Platform Monitoring for Prometheus` and `Alauda Container Platform Monitoring for VictoriaMetrics`. On a freshly-installed cluster without that operator (no `prometheuses.monitoring.coreos.com` CRD and no node-exporter pod present), the DaemonSet pattern below is the only available host-level telemetry path. + +## Root Cause + +Inside a `hostPID: true` privileged pod, processes on the underlying node are visible to in-container tooling: `ps`, `pidstat -ruwh`, and `top` enumerate node-wide processes, and `/proc/interrupts` plus `/proc/softirqs` reflect the host kernel's per-CPU IRQ/softirq counters rather than the container's PID-1 subview. The behavior is exercised on this cluster against Linux kernel 5.15 with containerd 2.2.1 as the node runtime. That direct host-namespace access is what makes a single DaemonSet pod sufficient to capture genuine node-level signal without an in-cluster monitoring agent. + +ACP does not ship a vendor-specific pod-security-constraints CRD chain; pod admission is governed exclusively by Kubernetes PodSecurity Admission (PSA), and the cluster-wide default policy is configured at `warn` level against the `baseline` profile. A privileged hostPID/hostIPC/hostNetwork pod is therefore admitted with a `PodSecurity baseline:latest` warning and runs without any cluster-side security-policy grant on the target namespace. + +## Resolution + +Use a namespace with no PSA `enforce` label (the cluster default of warn-only `baseline` admits the workload). The pod image must be reachable from the cluster — either a base image (e.g. `alpine` or `ubuntu`) mirrored into the internal registry, or a tool image such as `registry.alauda.cn:60080/3rdparty/kubectl:v4.3.1` extended with the diagnostic toolchain. The DaemonSet target set is the node group of interest; on this ACP cluster topology worker nodes carry no `node-role.kubernetes.io/worker` label (only the control-plane node carries `node-role.kubernetes.io/*` role labels), so omit the nodeSelector entirely to land on every node, or scope to non-control-plane nodes with `affinity.nodeAffinity` using a `node-role.kubernetes.io/control-plane DoesNotExist` requirement. + +DaemonSet skeleton (substitute `` with an internally-pullable base image carrying `bash`, `tar`, and a package manager; the ConfigMaps below carry the actual collection scripts): + +```yaml +apiVersion: apps/v1 +kind: DaemonSet +metadata: + name: metrics-daemonset + namespace: metrics-debug +spec: + selector: + matchLabels: + app: metrics-daemonset + template: + metadata: + labels: + app: metrics-daemonset + spec: + hostPID: true + hostIPC: true + hostNetwork: true + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: node-role.kubernetes.io/control-plane + operator: DoesNotExist + containers: + - name: collector + image: + securityContext: + runAsUser: 0 + privileged: true + command: ["/bin/bash", "-c"] + args: + - "bash /entrypoint/install-requirements.sh && bash /entrypoint/collect-metrics.sh && sleep infinity" + volumeMounts: + - name: entrypoint + mountPath: /entrypoint + volumes: + - name: entrypoint + configMap: + name: collector-scripts + defaultMode: 0755 +``` + +The `collect-metrics.sh` ConfigMap collects five mandatory host-level performance streams in parallel, each as a background process writing into `/metrics/`: `pidstat -ruwh ${INTERVAL}` to `pidstat.txt`, `sar -A ${INTERVAL}` to `sar.txt`, a `while` loop emitting `date; free -m; sleep ${INTERVAL}` to `free.txt`, the same loop pattern dumping `/proc/softirqs` to `softirqs.txt` and `/proc/interrupts` to `interrupts.txt`, plus two optional streams `ps aux | sort -nrk 3,3 | head -n 20` and `iotop -Pobt`. After `${DURATION}` the script kills the children, calls `sync`, then `tar -czf /metrics.tar.gz /metrics`. + +For network-layer capture, deploy a second DaemonSet (`monitor-daemonset`) sharing the same privileged hostPID/hostIPC/hostNetwork shape but running a `monitor.sh` ConfigMap that, every `${DELAY}` seconds for `${ITERATIONS}` iterations, collects `ss`, `nstat`, `netstat -s`, `ip address/route/neigh`, `tc -s qdisc`, `cat /proc/interrupts`, `/proc/net/softnet_stat`, `/proc/vmstat`, `ps -alfe`, `mpstat`, `top -c -b -n1`, `numastat`, `cat /proc/softirqs`, `cat /proc/net/sockstat`, `/proc/net/dev`, `ethtool -S `, and per-interface `/sys/.../statistics/*` into `${HOSTNAME}-network_stats_${now}/`. Independently `conntrack -L -n` runs in a background loop. The final archive is `/network-metrics.tar.gz`. + +## Diagnostic Steps + +Once a collection run finishes, extract the bundle from each pod and verify the archive: + +```bash +kubectl cp -n metrics-debug :/metrics.tar.gz metrics...tar.gz +tar -tf metrics...tar.gz +``` + +The bundle contents — `metrics/pidstat.txt`, `metrics/sar.txt`, `metrics/interrupts.txt`, `metrics/softirqs.txt`, `metrics/free.txt`, optional `metrics/ps.txt`, `metrics/iotop.txt` — are independent of any in-cluster monitoring stack and represent raw kernel telemetry directly off the node. + +For the network DaemonSet, retrieve `/network-metrics.tar.gz` the same way: + +```bash +kubectl cp -n metrics-debug :/network-metrics.tar.gz network-metrics...tar.gz +``` + +Inspect the per-iteration subdirectories for `ss`, `nstat`, `ethtool -S`, conntrack snapshots, and per-interface counters to correlate kernel-level network state with the host-OS performance streams from the first DaemonSet.