Skip to content

feat(metrics): add block transaction count and SR set change monitoring#6624

Open
warku123 wants to merge 14 commits intotronprotocol:developfrom
warku123:metric_modify_clean
Open

feat(metrics): add block transaction count and SR set change monitoring#6624
warku123 wants to merge 14 commits intotronprotocol:developfrom
warku123:metric_modify_clean

Conversation

@warku123
Copy link
Copy Markdown

@warku123 warku123 commented Apr 1, 2026

What does this PR do?

Add two new metrics enhancements for better network monitoring:

  1. Block Transaction Count Histogram

    • Replace tron:block_empty_total counter with tron:block_transaction_count histogram
    • Track transaction count distribution per block with buckets: [0, 10, 50, 100, 200, 500, 1000, 2000, 5000, 10000]
    • Empty blocks can be monitored via le=0.0 bucket
  2. SR Set Change Monitoring

    • Add tron:sr_set_change counter to track SR membership changes
    • Detect SR additions and removals at each maintenance time interval
    • Labels: action (add/remove), witness (SR address)

Changes:

  • Add overloaded init() method in MetricsHistogram for custom buckets
  • Add BLOCK_TRANSACTION_COUNT histogram metric
  • Add SR_SET_CHANGE counter with SR_ADD/SR_REMOVE labels
  • Implement SR set change detection logic in BlockChainMetricManager
  • Update PrometheusApiServiceTest with comprehensive test coverage

Why are these changes required?

  • Transaction histogram provides richer insights than simple counter for network analysis while maintaining empty block detection capability
  • SR set change monitoring enables tracking of network consensus participant changes for governance analysis

Compared to #6602, this PR provides a cleaner implementation focused solely on metrics enhancement without additional complexity.

This PR has been tested by:

  • Unit Tests (updated PrometheusApiServiceTest)
  • Manual Testing

Follow up

N/A

Extra details

…monitoring

Replace the dedicated tron:block_empty_total counter with a more comprehensive
tron:block_transaction_count histogram that tracks the distribution of
transaction counts per block.

Changes:
- Add overloaded init() method in MetricsHistogram to support custom buckets
- Add BLOCK_TRANSACTION_COUNT histogram with buckets [0, 10, 50, 100, 200, 500, 1000, 2000, 5000, 10000]
- Record transaction count for all blocks (including empty blocks with txCount=0)
- Empty blocks can be queried via bucket le=0.0
- Remove unused BLOCK_EMPTY counter

This provides richer insights for network analysis while still supporting
empty block monitoring via histogram bucket queries.

Closes tronprotocol#6590
@warku123 warku123 changed the title feat(metrics): add block transaction count histogram for empty block monitoring feat(metrics): add block transaction count and SR set change monitoring Apr 1, 2026
warku123 added a commit to warku123/java-tron that referenced this pull request Apr 2, 2026
Replace duplicated miner string literal with MINER_LABEL constant
to fix SonarQube code smell.

Relates tronprotocol#6624
warku123 added a commit to warku123/java-tron that referenced this pull request Apr 2, 2026
- Replace duplicated miner string literal with MINER_LABEL constant
  in PrometheusApiServiceTest to fix SonarQube code smell
- Add //NOSONAR comment for new BLOCK_TRANSACTION_COUNT histogram
  label name in MetricsHistogram

Relates tronprotocol#6624
@warku123 warku123 force-pushed the metric_modify_clean branch from 4433c52 to 93d0083 Compare April 2, 2026 03:44
warku123 added a commit to warku123/java-tron that referenced this pull request Apr 2, 2026
Extract repeated 'miner' string literal into MINER_LABEL constant
in MetricsHistogram to follow DRY principle and fix SonarQube warning.

Also update PrometheusApiServiceTest to use MINER_LABEL constant
for test assertions.

Relates tronprotocol#6624
@warku123 warku123 force-pushed the metric_modify_clean branch from 93d0083 to 6315ffc Compare April 2, 2026 03:54
Extract repeated 'miner' string literal into MINER_LABEL constant
in MetricsHistogram to follow DRY principle and fix SonarQube warning.

Also update PrometheusApiServiceTest to use MINER_LABEL constant
for test assertions.

Relates tronprotocol#6624
Comment thread common/src/main/java/org/tron/common/prometheus/MetricKeys.java
init(MetricKeys.Counter.TXS, "tron txs info .", "type", "detail");
init(MetricKeys.Counter.MINER, "tron miner info .", "miner", "type");
init(MetricKeys.Counter.BLOCK_FORK, "tron block fork info .", "type");
init(MetricKeys.Counter.SR_SET_CHANGE, "tron sr set change .", "action", "witness");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The witness label uses addresses. Although the Active SR pool is relatively limited (27 slots), as SRs rotate in and out over time, label values will accumulate monotonically — they only grow, never shrink. This is a known Prometheus anti-pattern.

Practical impact assessment:

  • TRON's SR candidate pool is not infinite, and rotation frequency is low, so it won't explode in the short term

  • However, as a long-running node metric, months or years of operation could still accumulate a significant number of time series

Possible approaches:

  • If the address dimension is truly needed, keep the current design, but document this characteristic

  • Or remove the witness label entirely, keeping only action (add/remove), and output address details via logs instead. Follow the principle that Prometheus should be used for alerting and awareness, while logs are for precise investigation and root-cause analysis.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising the high-cardinality concern — it's a valid Prometheus best practice to watch for.

However, I believe the witness label is acceptable here given TRON's specific constraints:

  1. Bounded candidate pool: The SR candidate pool is finite. Even over years of operation, the total number of distinct addresses that have ever held an SR slot has a practical ceiling — far from the
    unbounded growth typical of high-cardinality anti-patterns.
  2. Low rotation frequency: SR set changes only occur at maintenance intervals (every 6 hours), so new label values are introduced very infrequently.
  3. Diagnostic value: The primary purpose of this metric is to identify which SR was added or removed. Dropping the witness label would reduce the metric to just "something changed", requiring operators to
    cross-reference logs — losing the at-a-glance visibility in Grafana dashboards that makes this metric useful in the first place.

In summary, the worst-case time series count is roughly candidate_pool_size × 2 (add/remove), which stays well within Prometheus's comfortable operating range.

That said, I agree it's worth documenting this characteristic. Could you suggest what form you'd prefer — a code comment at the metric registration site, a note in the PR description, or something else?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer to a code comment at the metric registration site.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@halibobo1205 Added a comment at the SR_SET_CHANGE constant definition in MetricKeys.java (commit 96bf315) explaining the cardinality rationale:

witness label: bounded cardinality -- SR candidate pool is finite, rotation is infrequent (at most once per maintenance interval); kept for at-a-glance SR identification in dashboards rather than requiring log cross-referencing.

- Extract MINER_LABEL to MetricLabels.Histogram.MINER as shared constant
- Use int instead of double for block count in tests
- Store witness addresses as base58 directly, removing unnecessary hex round-trip

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@warku123 warku123 requested a review from halibobo1205 April 9, 2026 02:46
Copy link
Copy Markdown
Collaborator

@bladehan1 bladehan1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: 2 findings on lastActiveWitnesses / lastNextMaintenanceTime state management.

private long failProcessBlockNum = 0;
@Setter
private String failProcessBlockReason = "";
private final Set<String> lastActiveWitnesses = ConcurrentHashMap.newKeySet();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] Two issues with lastActiveWitnesses and lastNextMaintenanceTime

1) False thread-safety guarantee from ConcurrentHashMap.newKeySet()

applyBlock() is only called from TronNetDelegate.processBlock(), which holds the dbManager lock — a single-threaded path. More importantly, the compound operations in recordSrSetChange() (read → diff → clear → addAll) are not atomic even with a concurrent set. If concurrent access were ever needed, this implementation would silently produce incorrect diffs.

Suggest replacing with a plain HashSet:

// Only accessed from block processing thread
private final Set<String> lastActiveWitnesses = new HashSet<>();

2) In-memory state is not rolled back during fork switching

lastActiveWitnesses and lastNextMaintenanceTime are pure in-memory state that does not participate in the revoke/undo mechanism.

During Manager.switchFork():

  1. eraseBlock() rolls back the old chain — DB state (nextMaintenanceTime, activeWitnesses) is reverted via revokingStore.fastPop()
  2. Manager.applyBlock() applies new-chain blocks — DB state is rebuilt
  3. But this internal applyBlock does not go through TronNetDelegate.processBlock(), so metricsService.applyBlock() is never called and the in-memory state is not updated

Scenario — fork across a maintenance boundary:

  • Old chain passed maintenance: lastNextMaintenanceTime = T1, lastActiveWitnesses = {SR_old}
  • Fork switches back to a pre-maintenance chain: DB now has nextMaintenanceTime = T0 < T1
  • On the next block, metrics sees nextMaintenanceTime(T0) != lastNextMaintenanceTime(T1) → triggers SR change detection
  • But it compares lastActiveWitnesses (stale, from old chain) against the current store (new chain) → phantom SR changes or missed real changes

Suggest adding a rollback guard:

} else if (nextMaintenanceTime != lastNextMaintenanceTime) {
    if (nextMaintenanceTime < lastNextMaintenanceTime) {
        // Fork rollback detected — reinitialize instead of diffing
        lastActiveWitnesses.clear();
        lastActiveWitnesses.addAll(...current witnesses...);
        lastNextMaintenanceTime = nextMaintenanceTime;
        return;
    }
    // normal maintenance transition — diff and record
    ...
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, adding a rollback guard here over-complicates the monitoring layer. The core issue is that SR set change detection is business logic monitoring — it depends on chain state transitions (maintenance boundaries, fork switching), which makes it fundamentally unsuitable for an in-process metric that cannot participate in the revoke mechanism.

A simpler and more robust approach would be to move SR set change detection off-chain: an external monitor reads the /metrics endpoint or listens to block events, compares consecutive maintenance periods' active witness lists, and emits its own alerts. This completely decouples monitoring from chain state management and avoids the fork/rollback problem entirely.

For the histogram (BLOCK_TRANSACTION_COUNT), in-chain observation is fine — it's a stateless per-block metric with no accumulated state to go stale. The issue is specific to SR_SET_CHANGE because it maintains cross-block state (lastActiveWitnesses, lastNextMaintenanceTime) that the chain's undo mechanism cannot reach.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough analysis! Both points are valid — I've addressed them in the latest commit (7fdb3c5):

1) ConcurrentHashMap.newKeySet()HashSet

Agreed. Since applyBlock() is only called from TronNetDelegate.processBlock() under synchronized (blockLock), using a concurrent set adds no real safety and gives a misleading impression. Replaced with a plain HashSet with a comment clarifying the single-threaded access pattern.

2) Fork rollback guard

Added a check for nextMaintenanceTime < lastNextMaintenanceTime — since maintenance time only increases monotonically under normal conditions, a decrease is a reliable signal that a fork rollback has occurred. In that case, we reinitialize the in-memory state from the current DB instead of diffing against stale data:

if (nextMaintenanceTime < lastNextMaintenanceTime) {
    // Fork rollback detected — reinitialize instead of diffing
    lastActiveWitnesses.clear();
    lastActiveWitnesses.addAll(currentWitnesses);
} else {
    recordSrSetChange(currentWitnesses);
}
lastNextMaintenanceTime = nextMaintenanceTime;

Regarding your second comment about moving SR set change detection off-chain entirely — I understand the concern that cross-block state cannot participate in the revoke mechanism, but I believe the rollback guard above addresses the root issue with minimal complexity. Removing the in-process metric would lose the at-a-glance visibility in Grafana dashboards that makes it useful for operators, and pushing the diff logic to an external monitor adds operational overhead (deploying/maintaining a separate component) for an edge case that is already handled. I'd prefer to keep it in-chain with this guard in place.

- Replace ConcurrentHashMap.newKeySet() with HashSet since applyBlock()
  is only called from a single-threaded path (synchronized on blockLock)
- Add fork rollback detection: when nextMaintenanceTime < lastNextMaintenanceTime,
  reinitialize in-memory state instead of diffing, preventing phantom SR changes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
warku123 added a commit to warku123/tron-docker that referenced this pull request Apr 14, 2026
Add two new panel sections to the java-tron-server Grafana dashboard
to support metrics from tronprotocol/java-tron#6624:

- Block Metrics: empty block rate, per-SR empty block count,
  avg transactions per block, transaction count distribution
- SR Set Change: event table and cumulative totals by action

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
// Fork rollback detected — reinitialize instead of diffing
lastActiveWitnesses.clear();
lastActiveWitnesses.addAll(currentWitnesses);
} else {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a fork rollback does occur, wouldn’t the error message already have been written to the metrics?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bladehan1 @lvs0075 You are absolutely right: the rollback guard can only re-calibrate the in-memory baseline for subsequent blocks; it cannot retract Prometheus counters that have already been incremented. This is an inherent limitation because in-process metrics do not participate in the chain's revoking mechanism.

That said, this scenario — a fork rollback that crosses a maintenance boundary — is extremely rare in practice because TRON's maintenance interval is 6 hours and deep reorganizations of that magnitude are uncommon on the mainnet.

From an observability perspective, the recorded add/remove events are not necessarily "false positives" either: they reflect real SR set transitions that occurred on the abandoned chain at the time. When a fork rollback does cross a maintenance boundary, the chain genuinely experienced a different SR set during that (later invalidated) period. The metric captures that transient state change, even though the chain eventually switched to another fork.

In other words, SR_SET_CHANGE behaves like an event log of witnessed transitions rather than a guaranteed final-state snapshot. For operators, seeing an add followed by a remove (or vice versa) after a rollback is an indicator that a reorganization crossed the maintenance boundary.

This is a semantic trade-off. If the project prefers metrics to strictly mirror only the final canonical chain state, the only robust solution is to move SR change detection off-chain as @bladehan1 suggested. Otherwise, we can accept this limitation and document it clearly. Please let me know which direction the team prefers.

MetricLabels.Counter.TXS_SUCCESS, MetricLabels.Counter.TXS_SUCCESS);
}
// Record transaction count distribution for all blocks (including empty blocks)
Metrics.histogramObserve(MetricKeys.Histogram.BLOCK_TRANSACTION_COUNT, txCount,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SHOULD] If the primary goal is empty block monitoring, a simple Counter or Gauge would suffice. A Histogram is better suited when transaction count distribution analysis is actually needed — it can always be upgraded later.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we have disscussed this question in the issue #6590, and finally decide to implement Histogram instead of the simple Counter

MetricLabels.Counter.TXS_SUCCESS, MetricLabels.Counter.TXS_SUCCESS);
}
// Record transaction count distribution for all blocks (including empty blocks)
Metrics.histogramObserve(MetricKeys.Histogram.BLOCK_TRANSACTION_COUNT, txCount,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SR set change detection logic (reading WitnessScheduleStore, stream mapping, encode58Check encoding, HashSet diff) runs on every block regardless of whether metrics are enabled. The Metrics.enabled() guard inside counterInc()/histogramObserve() only prevents the final write — the upstream computation still executes unconditionally.

Suggest wrapping the entire SR detection block and the histogram observation (including the encode58Check call) with a Metrics.enabled() check to avoid unnecessary computation when metrics are disabled.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Wrapped the entire block (histogram observation + SR set change detection) with a Metrics.enabled() guard in commit 7791944. This avoids the DB reads, encode58Check encoding, stream mapping, and HashSet diff on every block when Prometheus is disabled.

Note that the existing metrics in the same method (e.g., MINER_LATENCY histogram + encode58Check at L151-152) have the same issue, but I'll leave those untouched to keep this PR scoped to the new metrics only.

Wrap histogram observation and SR set change detection with
Metrics.enabled() guard to avoid unnecessary computation (DB reads,
encode58Check, stream mapping, HashSet diff) on every block when
metrics are not enabled.
@warku123 warku123 requested a review from Little-Peony April 27, 2026 03:03
The SR set change detection used to live in BlockChainMetricManager.applyBlock
and inferred SR set transitions from nextMaintenanceTime changes plus a
diff against a cached HashSet. That approach was indirect (re-derived
state already known at the source) and required a fork-rollback guard.

Hoist the recording into the consensus layer where the transition is
explicit: MaintenanceManager.doMaintenance already computes currentWits
and newWits when the active witness set changes, so it can call
SrSetChangeMetric.record(currentWits, newWits) directly. The new helper
lives in the common module so both consensus and framework can depend on
it without inverting the existing module hierarchy.

Tests: SrSetChangeMetricTest exercises SR_ADD via the real
doMaintenance() flow, plus the no-vote / metrics-disabled / SR_REMOVE
branches. PrometheusApiServiceTest drops the obsolete SR assertions.
Metrics.counterInc(MetricKeys.Counter.TXS, txCount,
MetricLabels.Counter.TXS_SUCCESS, MetricLabels.Counter.TXS_SUCCESS);
}
if (Metrics.enabled()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this if (Metrics.enabled()) { is no long needed, or put it at first line of this function?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. Two reasons I'd like to keep this guard where it is:

  1. Not redundant — the histogram label StringUtil.encode58Check(address) runs a sha256 checksum, and Java evaluates it before histogramObserve is even entered. The internal Metrics.enabled() check inside observe() only short-circuits the Prometheus call, not the argument evaluation. Wrapping the call site is what actually saves the per-block sha256 work when metrics are disabled.

  2. Can't move to the first line — the earlier part of applyBlock maintains witnessInfo / dupWitnessBlockNum, which are read by getDupWitness() to populate BlockChainInfo. That's non-metric state and must run regardless of Metrics.enabled(). An early return at the top of the function would skip the duplicate-witness bookkeeping.

So the guard is intentionally local to the histogram call.

@Sunny6889 Sunny6889 requested a review from halibobo1205 April 27, 2026 08:07
@warku123 warku123 requested a review from Sunny6889 April 27, 2026 08:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[Feature] Add Prometheus metrics for empty blocks and SR set changes

8 participants