Skip to content

feat(server): add Prometheus metrics infrastructure and gRPC/HTTP request metrics#920

Open
sjenning wants to merge 2 commits intoNVIDIA:mainfrom
sjenning:feat/server-prometheus-metrics
Open

feat(server): add Prometheus metrics infrastructure and gRPC/HTTP request metrics#920
sjenning wants to merge 2 commits intoNVIDIA:mainfrom
sjenning:feat/server-prometheus-metrics

Conversation

@sjenning
Copy link
Copy Markdown
Contributor

@sjenning sjenning commented Apr 22, 2026

Summary

Add Prometheus metrics support to the openshell-server with a dedicated /metrics endpoint on a configurable port, and instrument gRPC and HTTP request handlers with request count and duration histogram metrics.

Related Issue

Starts work on #909

Changes

  • Add metrics and metrics-exporter-prometheus workspace dependencies
  • Add metrics_bind_address config field and --metrics-port / OPENSHELL_METRICS_PORT CLI flag
  • Stand up a dedicated Axum metrics server with a /metrics Prometheus scrape endpoint
  • Instrument the gRPC/HTTP multiplex service with openshell_grpc_requests_total, openshell_grpc_request_duration_seconds, openshell_http_requests_total, and openshell_http_request_duration_seconds metrics
  • Normalize HTTP paths and extract gRPC method names to keep metric cardinality bounded
  • Add port conflict validation between --port, --health-port, and --metrics-port
  • Update Helm chart (service, statefulset, values) to expose the metrics port (default 9090)

Testing

  • mise run pre-commit passes
  • Unit tests added for grpc_method_from_path, grpc_status_from_response, and normalize_http_path
  • E2E tests added/updated (if applicable)

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

…uest metrics

Add metrics exposition via a dedicated server port (--metrics-port,
default disabled) following the same optional-listener pattern as the
health port. The metrics crate facade records counters and histograms
in the MultiplexedService hot path, and a PrometheusHandle renders them
at GET /metrics on the dedicated port.

Metrics added:
- openshell_grpc_requests_total (counter: method, code)
- openshell_grpc_request_duration_seconds (histogram: method, code)
- openshell_http_requests_total (counter: path, status)
- openshell_http_request_duration_seconds (histogram: path, status)
Expose the Prometheus metrics endpoint in the helm chart by adding
service.metricsPort (default 9090) to values, the --metrics-port arg
and container port to the statefulset, and a metrics service port.
Set metricsPort to 0 to disable.
@sjenning sjenning requested a review from a team as a code owner April 22, 2026 19:11
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 22, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Comment on lines +257 to +262
if args.health_port != 0 && args.health_port == args.metrics_port {
return Err(miette::miette!(
"--health-port and --metrics-port must be different (both set to {})",
args.health_port
));
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the health and metrics port be the same? This way we have fewer addresses to reason about.

Copy link
Copy Markdown
Contributor Author

@sjenning sjenning Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It definitely could be. My thinking was that if we wanted to TLS protect the metrics endpoint in the future, it might be nice if it were separate.

Plus we just named the health port ... health 😅 vs "assorted unauth'ed stuff" port.

I'm also not a fan of adding a bunch of ports but it leads to a straightforward and flexible implementation here.

Copy link
Copy Markdown
Collaborator

@TaylorMutch TaylorMutch Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was that if we wanted to TLS protect the metrics endpoint in the future, it might be nice if it were separate.

I think if that's true, likely we would be adding TLS to both endpoints right? My take is also that health and metrics handlers are usually grouped together. I would be fine doing that grouping here just to keep the surface area smaller here; if we later find that it's necessary to have a separate bind for some operational reason we can add it. WDYT?

Comment on lines +214 to +215
counter!("openshell_grpc_requests_total", "method" => method.clone(), "code" => code.clone()).increment(1);
histogram!("openshell_grpc_request_duration_seconds", "method" => method, "code" => code).record(elapsed);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm new to the Rust implementation for prometheus metrics, but in the Go implementation there is a nice naming convention which matches the data model for Prometheus metrics (name, namespace, subsystem).

I suppose https://docs.rs/metrics/latest/metrics/ indicates there's no naming convention we need to stick to, and the prometheus metrics exporter handles the export.

So I suppose this is just food for thought for later as we implement instrumentations in the future.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that you mention this, we might want to use openshell_server_* here to distinguish it from metrics from other openshell components (supervisor, etc) 🤔

@TaylorMutch TaylorMutch self-assigned this Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants