From 59629c31c37816ff637143ec312fd1ef30deb9e0 Mon Sep 17 00:00:00 2001 From: Komh Date: Sat, 2 May 2026 16:47:04 +0000 Subject: [PATCH 1/2] [observability] Troubleshooting the NodeClockNotSynchronising alert --- ...ing_the_NodeClockNotSynchronising_alert.md | 107 ++++++++++++++++++ 1 file changed, 107 insertions(+) create mode 100644 docs/en/solutions/Troubleshooting_the_NodeClockNotSynchronising_alert.md diff --git a/docs/en/solutions/Troubleshooting_the_NodeClockNotSynchronising_alert.md b/docs/en/solutions/Troubleshooting_the_NodeClockNotSynchronising_alert.md new file mode 100644 index 000000000..cf047a9db --- /dev/null +++ b/docs/en/solutions/Troubleshooting_the_NodeClockNotSynchronising_alert.md @@ -0,0 +1,107 @@ +--- +kind: + - Troubleshooting +products: + - Alauda Container Platform +ProductsVersion: + - 4.1.0,4.2.x +--- + +# Troubleshooting the NodeClockNotSynchronising alert +## Issue + +The Prometheus monitoring stack fires a `NodeClockNotSynchronising` alert against one or more cluster nodes. The default alert expression is: + +```text +alert: NodeClockNotSynchronising +expr: min_over_time(node_timex_sync_status[5m]) == 0 +for: 10m +labels: { severity: warning } +annotations: + summary: "Clock not synchronising." + message: "Clock on {{ $labels.instance }} is not synchronising. Ensure NTP is configured on this host." +``` + +The alert means `node_exporter` has observed the kernel `adjtimex` timex sync status to be zero for longer than the `for:` window. Pods inherit the host clock, so any node that is no longer stepping or slewing its clock to an upstream reference will drift relative to the rest of the cluster — with eventual consequences for etcd leader election, certificate validation and log correlation. + +## Root Cause + +`node_exporter` exposes the kernel timex status as the `node_timex_sync_status` metric. When the host NTP client (chrony on almost every ACP node) is not actively disciplining the system clock, the kernel reports unsynchronised and the gauge goes to zero. Typical causes: + +- The chrony service on the node is stopped or crash-looping. +- All configured NTP sources are unreachable (blocked UDP/123, DNS resolution failure, firewalled out from the NTP pool). +- The configured sources have a worse stratum than `local stratum` and chrony therefore refuses to discipline from them. +- The offset has grown so large that chrony refuses to step without `makestep` being set, leaving it stuck. + +## Resolution + +The alert is fixed by getting chrony back into a synchronised state on the affected node — pod clocks will follow automatically. + +1. Ensure the chrony service is running and has at least one reachable upstream source. From a shell on the node (use `kubectl debug node/ --image= --profile=sysadmin -- bash` when you do not have SSH; ACP's cluster PSA rejects `chroot /host`, so the debug pod's `/host` bind-mount is the supported path for reading host files): + + ```bash + systemctl status chronyd + chronyc tracking + chronyc sources -v + chronyc sourcestats -v + ``` + + `Reference ID` of `00000000` or an empty source list means the daemon has no upstream it can talk to. + +2. If the issue is network reachability, verify that outbound UDP/123 to the configured NTP servers is open from the node. A short packet capture during a poll interval confirms whether replies come back: + + ```bash + tcpdump -n -i any port 123 -vvv -w /tmp/chrony.pcap + ``` + +3. If the issue is configuration (wrong server hostnames, stratum too high, no `makestep` at boot), update `/etc/chrony.conf` and push it consistently to every affected node through your node configuration channel rather than editing one host manually. A declarative NTP config belongs with the rest of node OS configuration so that replacement nodes inherit it. + +4. For persistent chrony debugging, enable the tracking and measurement logs in `chrony.conf` and then inspect `/var/log/chrony/`: + + ```text + logdir /var/log/chrony + log tracking measurements statistics + ``` + +Once chrony reports a non-zero stratum and a small offset, `node_timex_sync_status` returns to `1` on the next scrape and the alert clears after the `for:` window. + +## Diagnostic Steps + +Confirm the cluster actually sees the alert firing, and which instances are affected, by querying Alertmanager and Prometheus directly. Port-forward Alertmanager / Prometheus from the monitoring namespace and query their HTTP APIs: + +```bash +kubectl -n cpaas-system port-forward svc/alertmanager-main 9093:9093 & +curl -sk 'http://localhost:9093/api/v2/alerts' | jq '.[].labels' +``` + +```bash +kubectl -n cpaas-system port-forward svc/prometheus-k8s 9090:9090 & +curl -sk 'http://localhost:9090/api/v1/query?query=node_timex_sync_status' | jq +curl -sk 'http://localhost:9090/api/v1/query?query=node_timex_offset_seconds' | jq +curl -sk 'http://localhost:9090/api/v1/query?query=node_timex_maxerror_seconds' | jq +``` + +Combine the three to reproduce the alert expression locally and identify noisy nodes: + +```text +min_over_time(node_timex_sync_status[5m]) == 0 + and node_timex_maxerror_seconds >= 16 + and ( + (node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0) + or + (node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0) + ) +``` + +On each reported node, collect the full chrony state for support: + +```bash +journalctl -u chronyd --since "1 hour ago" +chronyc -N sources -a +chronyc activity -v +chronyc ntpdata +chronyc clients +cat /etc/chrony.conf +``` + +If `chronyc ntpdata` reports `501 Not authorized`, the `local stratum` line in `chrony.conf` is higher than the stratum of the upstream server, and chrony is refusing to accept it — lower the local stratum below the upstream value. NTP servers specified by hostname must also resolve via the node's `/etc/resolv.conf`; `chronyc activity` showing sources with unknown addresses signals a DNS problem rather than an NTP problem. From e06b64e6a346b69194cb15e586e9be4aed68fc0f Mon Sep 17 00:00:00 2001 From: Komh Date: Sun, 17 May 2026 02:40:17 +0000 Subject: [PATCH 2/2] [observability] Diagnose NodeClockNotSynchronising on ACP Ubuntu nodes --- ...ockNotSynchronising_on_ACP_Ubuntu_nodes.md | 65 +++++++++++ ...ing_the_NodeClockNotSynchronising_alert.md | 107 ------------------ 2 files changed, 65 insertions(+), 107 deletions(-) create mode 100644 docs/en/solutions/Diagnose_NodeClockNotSynchronising_on_ACP_Ubuntu_nodes.md delete mode 100644 docs/en/solutions/Troubleshooting_the_NodeClockNotSynchronising_alert.md diff --git a/docs/en/solutions/Diagnose_NodeClockNotSynchronising_on_ACP_Ubuntu_nodes.md b/docs/en/solutions/Diagnose_NodeClockNotSynchronising_on_ACP_Ubuntu_nodes.md new file mode 100644 index 000000000..592bea58e --- /dev/null +++ b/docs/en/solutions/Diagnose_NodeClockNotSynchronising_on_ACP_Ubuntu_nodes.md @@ -0,0 +1,65 @@ +--- +kind: + - Troubleshooting +products: + - Alauda Container Platform +ProductsVersion: + - 4.1.0,4.2.x +--- + +# Diagnose NodeClockNotSynchronising on ACP Ubuntu nodes + +## Issue + +On an ACP cluster running the `prometheus` ModulePlugin (mainChart `ait/chart-kube-prometheus`, v4.3.x), the `NodeClockNotSynchronising` alert follows the standard upstream kube-prometheus form: a PrometheusRule that triggers on a sustained `node_timex_sync_status == 0` reading from node-exporter, gated by a multi-minute `for` window at warning severity. The rule is not pre-shipped by the ACP chart itself; when the alert is observed on a cluster it has either been authored by the operator on top of the installed PrometheusRule CRD or carried in via the upstream rule set. Operators verifying installation state should not expect to grep a chart-shipped rule with this exact name out of `kubectl get prometheusrule -A`. + +Because the underlying metric describes the host kernel's own view of the clock, the alert is not pointing at a Kubernetes object. The diagnosis target is the host-level NTP daemon — on Ubuntu 22.04 nodes that is chrony 4.2, managed by the systemd unit `chrony.service` with configuration in `/etc/chrony/chrony.conf`. Pod clocks track the host clock (there is no per-container clock namespace), so resolving chrony on the affected node also resolves the drift seen from pods on that node. + +## Root Cause + +`node_timex_sync_status` is produced by node-exporter's `timex` collector. The collector is a thin wrapper around the kernel `adjtimex(2)` syscall and emits `0` whenever the kernel sets the `STA_UNSYNC` status bit on the `timex` struct, i.e. when the kernel itself no longer trusts its clock as synchronised. Two related metrics from the same collector — `node_timex_maxerror_seconds` and `node_timex_offset_seconds` — quantify how far the kernel believes the clock can be off and the most recent offset estimate, and they remain useful for forensic work even after the boolean sync flag flips back to `1`. + +A `STA_UNSYNC` reading therefore means one of: chronyd has no reachable, selected upstream source; chronyd is running but its samples are being discarded as out-of-spec; or chronyd is not running on the node at all. The alert does not distinguish between these — that distinction is recovered by the diagnostic steps below. + +## Resolution + +Drive recovery from the node itself, against chrony. The standard diagnostic set from the chrony client covers the three failure modes above. `chronyc sources -v` lists every configured source with its current state and last sample; `chronyc sourcestats -v` adds regression statistics over the recent sample window; `chronyc tracking` shows the currently selected reference, the estimated offset and skew, and the last update time; `chronyc activity` reports counts of online / offline / unreachable / unknown-address sources; and `chronyc ntpdata ` exposes per-server protocol details. Read together these tell whether chronyd has at least one reachable and selected source, and if not, at which layer it failed: + +```bash +chronyc sources -v +chronyc sourcestats -v +chronyc tracking +chronyc activity +chronyc ntpdata +``` + +When `chronyc activity` reports a non-zero count under "sources with unknown address", chronyd has not yet resolved one or more configured source hostnames. This counter is independent of the online / offline counts and reflects the state of chronyd's async resolver, which in turn depends on the node's DNS configuration — on Ubuntu nodes, `/etc/resolv.conf`. Treat any non-zero unknown-address count as a DNS-side issue on the node, not as an NTP-side issue. + +If the active diagnostic snapshot shows healthy sources yet the clock still drifts, enable continuous logging so the next drift event can be reconstructed after the fact. Adding the following two lines to `/etc/chrony/chrony.conf` and restarting `chrony.service` causes chronyd to write per-measurement and per-tracking-update CSV logs under `/var/log/chrony/`, which retain offset, skew, and source-selection history beyond what the live `chronyc` snapshots show: + +```text +logdir /var/log/chrony +log tracking measurements statistics +``` + +## Diagnostic Steps + +Confirm first that the kernel itself reports the clock as unsynchronised — that is the actual condition the alert is reacting to. The metric is produced from `adjtimex(2)` so its value is independent of whether Prometheus is currently scraping; a value of `0` corresponds to the `STA_UNSYNC` bit being set. Reference syntax for the in-cluster query (substitute the installed prometheus pod name and node label): + +```bash +kubectl exec -n cpaas-system -c prometheus -- \ + promtool query instant http://localhost:9090 \ + 'node_timex_sync_status{instance=""}' +``` + +Next, inspect chrony directly on the affected node. Because cluster admission drops the `privileged` capability, recipes that `chroot /host` to run host binaries from a debug pod will fail with `Operation not permitted`; on this fleet, read host state through the bind-mounted `/host/proc/...`, `/host/etc/...` paths instead of chrooting. Where chronyc execution is required, run the same five subcommands listed above on the host shell (the bare `chronyc ` form, which talks to the local cmdmon Unix socket): + +```bash +chronyc sources -v +chronyc tracking +chronyc activity +``` + +Interpret `chronyc activity` carefully: an online count of zero with a non-zero unknown-address count means chronyd cannot even name-resolve its pool / server entries, and the next thing to verify is `/etc/resolv.conf` on the node and reachability of the listed nameservers. An online count of zero with all sources offline instead means name resolution worked but the configured servers are unreachable on the NTP port — that is a network / firewall question, not a chrony question. + +If sources are reachable and selected but the clock still walks outside the expected envelope, persist the data needed to investigate after recovery. With `logdir /var/log/chrony` and `log tracking measurements statistics` in `chrony.conf`, the files `tracking.log`, `measurements.log`, and `statistics.log` accumulate CSV rows that can be correlated against the alert firing window once the next drift event occurs. diff --git a/docs/en/solutions/Troubleshooting_the_NodeClockNotSynchronising_alert.md b/docs/en/solutions/Troubleshooting_the_NodeClockNotSynchronising_alert.md deleted file mode 100644 index cf047a9db..000000000 --- a/docs/en/solutions/Troubleshooting_the_NodeClockNotSynchronising_alert.md +++ /dev/null @@ -1,107 +0,0 @@ ---- -kind: - - Troubleshooting -products: - - Alauda Container Platform -ProductsVersion: - - 4.1.0,4.2.x ---- - -# Troubleshooting the NodeClockNotSynchronising alert -## Issue - -The Prometheus monitoring stack fires a `NodeClockNotSynchronising` alert against one or more cluster nodes. The default alert expression is: - -```text -alert: NodeClockNotSynchronising -expr: min_over_time(node_timex_sync_status[5m]) == 0 -for: 10m -labels: { severity: warning } -annotations: - summary: "Clock not synchronising." - message: "Clock on {{ $labels.instance }} is not synchronising. Ensure NTP is configured on this host." -``` - -The alert means `node_exporter` has observed the kernel `adjtimex` timex sync status to be zero for longer than the `for:` window. Pods inherit the host clock, so any node that is no longer stepping or slewing its clock to an upstream reference will drift relative to the rest of the cluster — with eventual consequences for etcd leader election, certificate validation and log correlation. - -## Root Cause - -`node_exporter` exposes the kernel timex status as the `node_timex_sync_status` metric. When the host NTP client (chrony on almost every ACP node) is not actively disciplining the system clock, the kernel reports unsynchronised and the gauge goes to zero. Typical causes: - -- The chrony service on the node is stopped or crash-looping. -- All configured NTP sources are unreachable (blocked UDP/123, DNS resolution failure, firewalled out from the NTP pool). -- The configured sources have a worse stratum than `local stratum` and chrony therefore refuses to discipline from them. -- The offset has grown so large that chrony refuses to step without `makestep` being set, leaving it stuck. - -## Resolution - -The alert is fixed by getting chrony back into a synchronised state on the affected node — pod clocks will follow automatically. - -1. Ensure the chrony service is running and has at least one reachable upstream source. From a shell on the node (use `kubectl debug node/ --image= --profile=sysadmin -- bash` when you do not have SSH; ACP's cluster PSA rejects `chroot /host`, so the debug pod's `/host` bind-mount is the supported path for reading host files): - - ```bash - systemctl status chronyd - chronyc tracking - chronyc sources -v - chronyc sourcestats -v - ``` - - `Reference ID` of `00000000` or an empty source list means the daemon has no upstream it can talk to. - -2. If the issue is network reachability, verify that outbound UDP/123 to the configured NTP servers is open from the node. A short packet capture during a poll interval confirms whether replies come back: - - ```bash - tcpdump -n -i any port 123 -vvv -w /tmp/chrony.pcap - ``` - -3. If the issue is configuration (wrong server hostnames, stratum too high, no `makestep` at boot), update `/etc/chrony.conf` and push it consistently to every affected node through your node configuration channel rather than editing one host manually. A declarative NTP config belongs with the rest of node OS configuration so that replacement nodes inherit it. - -4. For persistent chrony debugging, enable the tracking and measurement logs in `chrony.conf` and then inspect `/var/log/chrony/`: - - ```text - logdir /var/log/chrony - log tracking measurements statistics - ``` - -Once chrony reports a non-zero stratum and a small offset, `node_timex_sync_status` returns to `1` on the next scrape and the alert clears after the `for:` window. - -## Diagnostic Steps - -Confirm the cluster actually sees the alert firing, and which instances are affected, by querying Alertmanager and Prometheus directly. Port-forward Alertmanager / Prometheus from the monitoring namespace and query their HTTP APIs: - -```bash -kubectl -n cpaas-system port-forward svc/alertmanager-main 9093:9093 & -curl -sk 'http://localhost:9093/api/v2/alerts' | jq '.[].labels' -``` - -```bash -kubectl -n cpaas-system port-forward svc/prometheus-k8s 9090:9090 & -curl -sk 'http://localhost:9090/api/v1/query?query=node_timex_sync_status' | jq -curl -sk 'http://localhost:9090/api/v1/query?query=node_timex_offset_seconds' | jq -curl -sk 'http://localhost:9090/api/v1/query?query=node_timex_maxerror_seconds' | jq -``` - -Combine the three to reproduce the alert expression locally and identify noisy nodes: - -```text -min_over_time(node_timex_sync_status[5m]) == 0 - and node_timex_maxerror_seconds >= 16 - and ( - (node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0) - or - (node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0) - ) -``` - -On each reported node, collect the full chrony state for support: - -```bash -journalctl -u chronyd --since "1 hour ago" -chronyc -N sources -a -chronyc activity -v -chronyc ntpdata -chronyc clients -cat /etc/chrony.conf -``` - -If `chronyc ntpdata` reports `501 Not authorized`, the `local stratum` line in `chrony.conf` is higher than the stratum of the upstream server, and chrony is refusing to accept it — lower the local stratum below the upstream value. NTP servers specified by hostname must also resolve via the node's `/etc/resolv.conf`; `chronyc activity` showing sources with unknown addresses signals a DNS problem rather than an NTP problem.