feat(server): add Prometheus metrics infrastructure and gRPC/HTTP request metrics#920
feat(server): add Prometheus metrics infrastructure and gRPC/HTTP request metrics#920sjenning wants to merge 2 commits intoNVIDIA:mainfrom
Conversation
…uest metrics Add metrics exposition via a dedicated server port (--metrics-port, default disabled) following the same optional-listener pattern as the health port. The metrics crate facade records counters and histograms in the MultiplexedService hot path, and a PrometheusHandle renders them at GET /metrics on the dedicated port. Metrics added: - openshell_grpc_requests_total (counter: method, code) - openshell_grpc_request_duration_seconds (histogram: method, code) - openshell_http_requests_total (counter: path, status) - openshell_http_request_duration_seconds (histogram: path, status)
Expose the Prometheus metrics endpoint in the helm chart by adding service.metricsPort (default 9090) to values, the --metrics-port arg and container port to the statefulset, and a metrics service port. Set metricsPort to 0 to disable.
| if args.health_port != 0 && args.health_port == args.metrics_port { | ||
| return Err(miette::miette!( | ||
| "--health-port and --metrics-port must be different (both set to {})", | ||
| args.health_port | ||
| )); | ||
| } |
There was a problem hiding this comment.
Could the health and metrics port be the same? This way we have fewer addresses to reason about.
There was a problem hiding this comment.
It definitely could be. My thinking was that if we wanted to TLS protect the metrics endpoint in the future, it might be nice if it were separate.
Plus we just named the health port ... health 😅 vs "assorted unauth'ed stuff" port.
I'm also not a fan of adding a bunch of ports but it leads to a straightforward and flexible implementation here.
There was a problem hiding this comment.
My thinking was that if we wanted to TLS protect the metrics endpoint in the future, it might be nice if it were separate.
I think if that's true, likely we would be adding TLS to both endpoints right? My take is also that health and metrics handlers are usually grouped together. I would be fine doing that grouping here just to keep the surface area smaller here; if we later find that it's necessary to have a separate bind for some operational reason we can add it. WDYT?
| counter!("openshell_grpc_requests_total", "method" => method.clone(), "code" => code.clone()).increment(1); | ||
| histogram!("openshell_grpc_request_duration_seconds", "method" => method, "code" => code).record(elapsed); |
There was a problem hiding this comment.
I'm new to the Rust implementation for prometheus metrics, but in the Go implementation there is a nice naming convention which matches the data model for Prometheus metrics (name, namespace, subsystem).
I suppose https://docs.rs/metrics/latest/metrics/ indicates there's no naming convention we need to stick to, and the prometheus metrics exporter handles the export.
So I suppose this is just food for thought for later as we implement instrumentations in the future.
There was a problem hiding this comment.
Now that you mention this, we might want to use openshell_server_* here to distinguish it from metrics from other openshell components (supervisor, etc) 🤔
Summary
Add Prometheus metrics support to the openshell-server with a dedicated
/metricsendpoint on a configurable port, and instrument gRPC and HTTP request handlers with request count and duration histogram metrics.Related Issue
Starts work on #909
Changes
metricsandmetrics-exporter-prometheusworkspace dependenciesmetrics_bind_addressconfig field and--metrics-port/OPENSHELL_METRICS_PORTCLI flag/metricsPrometheus scrape endpointopenshell_grpc_requests_total,openshell_grpc_request_duration_seconds,openshell_http_requests_total, andopenshell_http_request_duration_secondsmetrics--port,--health-port, and--metrics-portTesting
mise run pre-commitpassesgrpc_method_from_path,grpc_status_from_response, andnormalize_http_pathChecklist