Skip to content

*: insufficient round change metric#4512

Merged
KaloyanTanev merged 7 commits into
mainfrom
kalo/insufficient-round-change
May 7, 2026
Merged

*: insufficient round change metric#4512
KaloyanTanev merged 7 commits into
mainfrom
kalo/insufficient-round-change

Conversation

@KaloyanTanev
Copy link
Copy Markdown
Collaborator

If a proposed value from leader is too close to the threshold of round expiration, we can see some peers receiving it on time while others don't. It'd be nice to track such discrepancies.

Here I'm adding a metric that tracks "insufficient round change messages" which in the majority of the scenarios means other peers did not agree on a round change. The case when they don't agree is if they are already doing another round. I'm splitting the metric in 2 - received "consensus decided" message or not. If a node receives "consensus decided", even if it wasn't part of the consensus (i.e.: stuck on another round), it still counts the consensus as successful and propagates the data decided upon to the subscribers (= the VC). If it doesn't receive it though, then the QBFT fails after some timeouts (= the VC never receives the data)

category: feature
ticket: #4478

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds observability for QBFT instances that time out in rounds >1 due to receiving fewer than quorum round-change messages (the scenario described in #4478, often caused by late leader payload propagation and peer set segregation). It introduces a new Prometheus counter with an outcome label to distinguish between nodes that later recover via MsgDecided vs instances that ultimately time out.

Changes:

  • Added core_consensus_insufficient_round_changes_total{protocol,duty,timer,outcome} and incremented it from the QBFT instance loop on the “recovered via decided” and “full timeout” paths.
  • Added a health check that warns when this counter increases over the scrape window.
  • Added unit tests covering the detection helper, metric outcome behavior, and the new health check.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
p2p/sender.go Minor formatting-only change (blank line) near deferred metric observation.
docs/metrics.md Documents the new core_consensus_insufficient_round_changes_total metric and labels.
core/consensus/qbft/qbft.go Tracks “insufficient round changes” during round timeouts and emits the new metric with `outcome=decided
core/consensus/qbft/qbft_internal_test.go Adds tests for isInsufficientRoundChanges and for the two metric outcomes (decided vs timeout).
core/consensus/metrics/metrics.go Registers the new counter vec and exposes IncInsufficientRoundChanges on the metrics interface/implementation.
core/consensus/metrics/metrics_test.go Adds test for the new metric counter emission.
app/health/select.go Simplifies sumLabels to always sum all series in a metric family (no label filtering).
app/health/checks.go Adds insufficient_round_changes health check based on counter increase.
app/health/checks_internal_test.go Adds coverage for the new health check behavior across decided/timeout outcomes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread core/consensus/metrics/metrics_test.go
Comment thread core/consensus/qbft/qbft.go Outdated
Comment thread core/consensus/metrics/metrics.go
@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

Codecov Report

❌ Patch coverage is 55.88235% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.09%. Comparing base (b35fc00) to head (6e61d3b).
⚠️ Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
core/consensus/qbft/qbft.go 35.00% 13 Missing ⚠️
app/health/checks.go 75.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4512      +/-   ##
==========================================
- Coverage   57.11%   57.09%   -0.02%     
==========================================
  Files         245      245              
  Lines       32920    32965      +45     
==========================================
+ Hits        18801    18822      +21     
- Misses      11752    11770      +18     
- Partials     2367     2373       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Comment thread core/consensus/qbft/qbft.go Outdated
Comment thread core/consensus/qbft/qbft_internal_test.go Outdated
Comment thread app/health/select.go Outdated
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 7, 2026

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated no new comments.

@KaloyanTanev KaloyanTanev merged commit 2b8f0f4 into main May 7, 2026
14 of 15 checks passed
@KaloyanTanev KaloyanTanev deleted the kalo/insufficient-round-change branch May 7, 2026 18:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants