Skip to content

[observability] Diagnose NodeClockNotSynchronising on ACP Ubuntu nodes#485

Open
jing2uo wants to merge 2 commits into
mainfrom
kb/2026-02/troubleshooting-the-nodeclocknotsynchron
Open

[observability] Diagnose NodeClockNotSynchronising on ACP Ubuntu nodes#485
jing2uo wants to merge 2 commits into
mainfrom
kb/2026-02/troubleshooting-the-nodeclocknotsynchron

Conversation

@jing2uo
Copy link
Copy Markdown
Collaborator

@jing2uo jing2uo commented Apr 24, 2026

新增一篇 ACP KB 文章。

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 24, 2026

Walkthrough

This PR adds a new troubleshooting guide documenting how to diagnose and resolve the NodeClockNotSynchronising alert on ACP Ubuntu nodes. The guide explains the upstream Prometheus alert metric, provides diagnostic commands using chrony and promtool, and recommends configuration changes for persistent timing data logging.

Changes

NodeClockNotSynchronising Troubleshooting Documentation

Layer / File(s) Summary
NodeClockNotSynchronising diagnostic guide
docs/en/solutions/Diagnose_NodeClockNotSynchronising_on_ACP_Ubuntu_nodes.md
New document explaining the alert meaning from node-exporter's node_timex_sync_status metric, concrete chrony/chronyc diagnostic commands for DNS, connectivity, and service status issues, promtool steps to validate kernel STA_UNSYNC state, and chrony configuration changes to enable CSV logging for post-recovery analysis.

Estimated Code Review Effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

Poem

🐰 A guide was born, so clear and bright,
For clocks that dance but lose the light,
With chronyc spells and promtool's sight,
We sync the nodes—all will be right! ✨⏰

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and specifically describes the main change: adding a troubleshooting guide for the NodeClockNotSynchronising alert on ACP Ubuntu nodes, which matches the documentation file added.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch kb/2026-02/troubleshooting-the-nodeclocknotsynchron

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@jing2uo jing2uo force-pushed the kb/2026-02/troubleshooting-the-nodeclocknotsynchron branch from ae9cefb to 59629c3 Compare May 2, 2026 16:47
@jing2uo jing2uo changed the title [observability] Troubleshooting the NodeClockNotSynchronising alert [observability] Diagnose NodeClockNotSynchronising on ACP Ubuntu nodes May 17, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/en/solutions/Diagnose_NodeClockNotSynchronising_on_ACP_Ubuntu_nodes.md`:
- Line 47: Change the sentence that currently says the metric is “independent of
whether Prometheus is currently scraping” to clearly separate metric semantics
from query freshness: state that the metric value itself is produced by
adjtimex(2) and that a value of 0 corresponds to the STA_UNSYNC bit, but that
using promtool query instant (or any Prometheus query) will only return a sample
if the target has been scraped recently — results can be stale or missing even
though the kernel state exists. Update the wording around the
adjtimex(2)/STA_UNSYNC mention and add a short clause about scrape freshness and
staleness behavior for promtool query instant.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1ee6d99a-6a01-4b3c-9ce4-dcdba85eafed

📥 Commits

Reviewing files that changed from the base of the PR and between 181f78f and e06b64e.

📒 Files selected for processing (1)
  • docs/en/solutions/Diagnose_NodeClockNotSynchronising_on_ACP_Ubuntu_nodes.md


## Diagnostic Steps

Confirm first that the kernel itself reports the clock as unsynchronised — that is the actual condition the alert is reacting to. The metric is produced from `adjtimex(2)` so its value is independent of whether Prometheus is currently scraping; a value of `0` corresponds to the `STA_UNSYNC` bit being set. Reference syntax for the in-cluster query (substitute the installed prometheus pod name and node label):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clarify scrape-dependence of the queried value.

At Line 47, saying the value is “independent of whether Prometheus is currently scraping” is misleading for operators using promtool query instant; the returned sample still depends on successful recent scrapes (or can be stale/missing). Please reword to separate metric semantics from query freshness.

✏️ Proposed wording
-Confirm first that the kernel itself reports the clock as unsynchronised — that is the actual condition the alert is reacting to. The metric is produced from `adjtimex(2)` so its value is independent of whether Prometheus is currently scraping; a value of `0` corresponds to the `STA_UNSYNC` bit being set. Reference syntax for the in-cluster query (substitute the installed prometheus pod name and node label):
+Confirm first that the kernel itself reports the clock as unsynchronised — that is the actual condition the alert is reacting to. The metric is produced from `adjtimex(2)` and a value of `0` corresponds to the `STA_UNSYNC` bit being set; however, the value returned by `promtool query instant` reflects the most recent successfully scraped sample. Reference syntax for the in-cluster query (substitute the installed prometheus pod name and node label):
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/en/solutions/Diagnose_NodeClockNotSynchronising_on_ACP_Ubuntu_nodes.md`
at line 47, Change the sentence that currently says the metric is “independent
of whether Prometheus is currently scraping” to clearly separate metric
semantics from query freshness: state that the metric value itself is produced
by adjtimex(2) and that a value of 0 corresponds to the STA_UNSYNC bit, but that
using promtool query instant (or any Prometheus query) will only return a sample
if the target has been scraped recently — results can be stale or missing even
though the kernel state exists. Update the wording around the
adjtimex(2)/STA_UNSYNC mention and add a short clause about scrape freshness and
staleness behavior for promtool query instant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant