feat(server): add Prometheus metrics infrastructure and gRPC/HTTP request metrics by sjenning · Pull Request #920 · NVIDIA/OpenShell

sjenning · 2026-04-22T19:11:28Z

Summary

Add Prometheus metrics support to the openshell-server with a dedicated /metrics endpoint on a configurable port, and instrument gRPC and HTTP request handlers with request count and duration histogram metrics.

Related Issue

Starts work on #909

Changes

Add metrics and metrics-exporter-prometheus workspace dependencies
Add metrics_bind_address config field and --metrics-port / OPENSHELL_METRICS_PORT CLI flag
Stand up a dedicated Axum metrics server with a /metrics Prometheus scrape endpoint
Instrument the gRPC/HTTP multiplex service with openshell_grpc_requests_total, openshell_grpc_request_duration_seconds, openshell_http_requests_total, and openshell_http_request_duration_seconds metrics
Normalize HTTP paths and extract gRPC method names to keep metric cardinality bounded
Add port conflict validation between --port, --health-port, and --metrics-port
Update Helm chart (service, statefulset, values) to expose the metrics port (default 9090)

Testing

mise run pre-commit passes
Unit tests added for grpc_method_from_path, grpc_status_from_response, and normalize_http_path
E2E tests added/updated (if applicable)

Checklist

Follows Conventional Commits
Commits are signed off (DCO)
Architecture docs updated (if applicable)

…uest metrics Add metrics exposition via a dedicated server port (--metrics-port, default disabled) following the same optional-listener pattern as the health port. The metrics crate facade records counters and histograms in the MultiplexedService hot path, and a PrometheusHandle renders them at GET /metrics on the dedicated port. Metrics added: - openshell_grpc_requests_total (counter: method, code) - openshell_grpc_request_duration_seconds (histogram: method, code) - openshell_http_requests_total (counter: path, status) - openshell_http_request_duration_seconds (histogram: path, status)

Expose the Prometheus metrics endpoint in the helm chart by adding service.metricsPort (default 9090) to values, the --metrics-port arg and container port to the statefulset, and a metrics service port. Set metricsPort to 0 to disable.

copy-pr-bot · 2026-04-22T19:11:32Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

TaylorMutch · 2026-04-22T19:22:03Z

+        if args.health_port != 0 && args.health_port == args.metrics_port {
+            return Err(miette::miette!(
+                "--health-port and --metrics-port must be different (both set to {})",
+                args.health_port
+            ));
+        }


Could the health and metrics port be the same? This way we have fewer addresses to reason about.

It definitely could be. My thinking was that if we wanted to TLS protect the metrics endpoint in the future, it might be nice if it were separate.

Plus we just named the health port ... health 😅 vs "assorted unauth'ed stuff" port.

I'm also not a fan of adding a bunch of ports but it leads to a straightforward and flexible implementation here.

My thinking was that if we wanted to TLS protect the metrics endpoint in the future, it might be nice if it were separate.

I think if that's true, likely we would be adding TLS to both endpoints right? My take is also that health and metrics handlers are usually grouped together. I would be fine doing that grouping here just to keep the surface area smaller here; if we later find that it's necessary to have a separate bind for some operational reason we can add it. WDYT?

TaylorMutch · 2026-04-22T19:29:53Z

+                counter!("openshell_grpc_requests_total", "method" => method.clone(), "code" => code.clone()).increment(1);
+                histogram!("openshell_grpc_request_duration_seconds", "method" => method, "code" => code).record(elapsed);


I'm new to the Rust implementation for prometheus metrics, but in the Go implementation there is a nice naming convention which matches the data model for Prometheus metrics (name, namespace, subsystem).

I suppose https://docs.rs/metrics/latest/metrics/ indicates there's no naming convention we need to stick to, and the prometheus metrics exporter handles the export.

So I suppose this is just food for thought for later as we implement instrumentations in the future.

Now that you mention this, we might want to use openshell_server_* here to distinguish it from metrics from other openshell components (supervisor, etc) 🤔

sjenning added 2 commits April 22, 2026 13:19

sjenning requested a review from a team as a code owner April 22, 2026 19:11

TaylorMutch reviewed Apr 22, 2026

View reviewed changes

TaylorMutch self-assigned this Apr 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(server): add Prometheus metrics infrastructure and gRPC/HTTP request metrics#920

feat(server): add Prometheus metrics infrastructure and gRPC/HTTP request metrics#920
sjenning wants to merge 2 commits intoNVIDIA:mainfrom
sjenning:feat/server-prometheus-metrics

sjenning commented Apr 22, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Apr 22, 2026

Uh oh!

TaylorMutch Apr 22, 2026

Uh oh!

sjenning Apr 22, 2026 •

edited

Loading

Uh oh!

TaylorMutch Apr 22, 2026 •

edited

Loading

Uh oh!

TaylorMutch Apr 22, 2026

Uh oh!

sjenning Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		counter!("openshell_grpc_requests_total", "method" => method.clone(), "code" => code.clone()).increment(1);
		histogram!("openshell_grpc_request_duration_seconds", "method" => method, "code" => code).record(elapsed);

Conversation

sjenning commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Testing

Checklist

Uh oh!

copy-pr-bot Bot commented Apr 22, 2026

Uh oh!

TaylorMutch Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

sjenning Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TaylorMutch Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TaylorMutch Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

sjenning Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sjenning commented Apr 22, 2026 •

edited

Loading

sjenning Apr 22, 2026 •

edited

Loading

TaylorMutch Apr 22, 2026 •

edited

Loading