Skip to content

ZOOKEEPER-5036: Add Timeline Metrics Provider for external metrics collection systems#2371

Open
himanshumaurya09876 wants to merge 2 commits intoapache:masterfrom
himanshumaurya09876:zookeeper-5036
Open

ZOOKEEPER-5036: Add Timeline Metrics Provider for external metrics collection systems#2371
himanshumaurya09876 wants to merge 2 commits intoapache:masterfrom
himanshumaurya09876:zookeeper-5036

Conversation

@himanshumaurya09876
Copy link
Copy Markdown

Added a new Timeline Metrics Provider that enables ZooKeeper to export metrics to external Timeline-based metrics collection systems such as Apache Ambari Metrics Collector and YARN Timeline Service.
Validated it using Ambari Metrics, I was able to receive Zookeeper Metrics in Ambari and visualise it on Grafana dashboards.
Jira :- ZOOKEEPER-5036

@himanshumaurya09876
Copy link
Copy Markdown
Author

/build

@himanshumaurya09876
Copy link
Copy Markdown
Author

Hi @anmolnar @PDavid
Kindly review and merge this PR.

Copy link
Copy Markdown
Contributor

@PDavid PDavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks, looks really nice already. 👍

I really like the clean separation of concerns: Provider/Sink/Snapshot are well-decomposed. 💪

Can you please add unit tests for the new classes?

Also, in order to be able to load the metrics provider when running from the source clone, we have to add it to the classpath. See:

<parent>
<groupId>org.apache.zookeeper</groupId>
<artifactId>zookeeper-metrics-providers</artifactId>
<version>3.9.3</version>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this match with other projects to refer to 3.10.0-SNAPSHOT?

Suggested change
<version>3.9.3</version>
<version>3.10.0-SNAPSHOT</version>

This also seems to cause Maven compilation problem.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @PDavid
I will make these changes

/**
* Returns all counter metrics in this snapshot.
*
* @return an unmodifiable view of the counters map
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This claims to return "an unmodifiable view" but return the raw HashMap directly. Should be wrapped with Collections.unmodifiableMap() or remove this claim here?

The same applies to getGauges(), getSummaries().

Comment on lines +393 to +395
* <p>This context reuses existing metric implementations from zookeeper-server
* (SimpleCounter, AvgMinMaxCounter, etc.) to ensure consistent behavior with
* other metrics providers.</p>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that you reused the existing metric classes (SimpleCounter, AvgMinMaxCounter, etc.) 👍

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thankyou @PDavid

Comment on lines +157 to +160
LOG.warn("Timeline sink class not found: {}. Timeline metrics will be disabled. "
+ "To enable Timeline metrics, ensure the sink implementation JAR is available on the classpath. "
+ "ZooKeeper will continue to operate normally without Timeline metrics.", sinkClassName);
this.sink = null;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I see when the sink class isn't found, the provider logs a warning but continues. It can be like that but PrometheusMetricsProvider throws an Exception when misconfigured and ZooKeeper will fail to start. Would it maybe make sense to be consistent in behavior? What do you all think?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about consistency. I deliberately chose the "log warning and continue" approach for Timeline metrics because:

  1. Different Architecture: Prometheus requires an embedded HTTP server (Jetty) to function - if that fails, the whole feature is broken. Timeline just needs a scheduler and pushes metrics out, so even without a sink, ZooKeeper's core functionality isn't affected.
  2. External Dependency Model: Timeline's sink is intentionally designed to be loaded from external JARs (not compile-time dependencies). This allows ZooKeeper to ship without any Timeline-specific dependencies, and users add the sink JAR only when needed. If we fail hard on missing sink, ZooKeeper won't start even though the user might not care about Timeline metrics at all.
  3. Graceful Degradation: The current behavior allows a deployment to work even if:
    • The sink JAR isn't deployed yet
    • There's a temporary classpath issue
    • Timeline metrics are configured but not currently needed

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks, I completely see your points. And I really like that ZK not need to depend on Timeline-specific dependencies. 👍

I was just thinking that if I explicitly enable the timeline metrics provider but forgot to add the sink to the classpath then it would be harder for me to realize that the metrics are not sent and I have a problem when there is only a warn in the log compared to when ZK fails to start. 🤔

But I can see that having graceful degradation can be more useful.

@himanshumaurya09876 himanshumaurya09876 force-pushed the zookeeper-5036 branch 2 times, most recently from 9b8cb87 to cf9fb2b Compare April 19, 2026 11:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants