[OTEL] Add OpenTelemetry observability support#285
[OTEL] Add OpenTelemetry observability support#285royischoss wants to merge 36 commits intomlrun:developmentfrom
Conversation
…R_SPACE and MLRUN_MODEL_ENDPOINT_MONITORING__STORE_PREFIXES__MONITORING_APPLICATION plus removes MLRUN_MODEL_ENDPOINT_MONITORING__ENDPOINT_STORE_CONNECTION
# Conflicts: # charts/mlrun-ce/Chart.yaml # charts/mlrun-ce/README.md # charts/mlrun-ce/requirements.lock # charts/mlrun-ce/values.yaml # tests/kind-test.sh
…ion accordingly. add request and limit for crdReadinessJob and namespaceLabelJob
# Conflicts: # charts/mlrun-ce/Chart.yaml # charts/mlrun-ce/README.md # charts/mlrun-ce/requirements.lock
…, change naming for otel metrics using metadata.name fieldRef
…, change naming for otel metrics using metadata.name fieldRef
… empty templates, kubectl image - Move hardcoded OTel collector pipeline config into values.yaml under opentelemetry.collector.config — users can now override receivers, processors, exporters without forking the chart. Prometheus endpoint uses short DNS (prometheus-operated:9090) removing namespace interpolation from the helper. - Add opentelemetry.kubectlImage to values.yaml (defaults to bitnami/kubectl:latest) and reference it in both crd-readiness-job.yaml and namespace-label.yaml instead of hardcoded tag. - Fix namespace-label.yaml: replace indent with nindent for correct YAML formatting; change restartPolicy: Never to OnFailure so the job retries on transient failures. - Delete empty collector.yaml and instrumentation.yaml template files that generated no resources and were misleading. Move their documentation comment into crd-readiness-job.yaml where the actual CR creation happens. - Replace 50-line hardcoded collector manifest in _helpers.tpl with toYaml .Values.opentelemetry.collector.config | nindent 4.
…ilience, package.sh Issue 2 — Bounded retry in CR installer (High) The until kubectl apply loop in otel-cr-installer.yaml had no exit condition. On permanent failure (crashed operator, image pull error) it would spin silently until the 300s hook-timeout killed it with a cryptic deadline error. Added a max_retries=30 counter (30 × 5s = 150s max) with a clear exit 1 and operator log pointer on exhaustion. Applied to both the Collector and Instrumentation CR apply loops. Issue 3 — Skip rollout restart on upgrade (Medium) The otel-cr-installer hook runs on both post-install and post-upgrade. Previously it always restarted all mlrun.io/otel=true labeled pods, causing unnecessary Jupyter + Nuclio churn on every helm upgrade even when OTel was already running. Added an init container presence check: if pods already have the opentelemetry init container injected, the restart is skipped. Only restarts on fresh installs (where pods started before the webhook was ready). Issue 4 — Document PYTHONPATH workaround (Medium) The mlrun.api.extraEnvKeyValue.PYTHONPATH entry exists to prevent OTel's $(PYTHONPATH) expansion from resolving to an empty string. The ideal fix is adding instrumentation.opentelemetry.io/inject-python: "false" as a pod annotation to opt the MLRun API out of injection — but the upstream mlrun chart template hardcodes pod annotations with no podAnnotations values key. Improved the comment to document this constraint clearly so the next reader doesn't re-investigate. Issue 5 — Document Prometheus OTLP receiver (Medium) enableFeatures: [otlp-write-receiver] and --web.enable-otlp-receiver are always set on the Prometheus sub-chart regardless of opentelemetry.collector.enabled. There's no Helm-native way to conditionally configure sub-chart values. Added a comment explaining the always-on behavior and why the attack surface increase is negligible (OTLP endpoint shares the already-unauthenticated port 9090 with the Prometheus query API). Issue 6 — RBAC resources leak on uninstall (Medium) All 5 RBAC resources (ClusterRole + ClusterRoleBinding for otel-crd-reader, ServiceAccount + Role + RoleBinding for otel-cr-creator) were annotated as Helm hooks with before-hook-creation delete policy only. helm uninstall doesn't run hooks, so these cluster-scoped resources were never deleted — confirmed on the test cluster where old my-mlrun-otel-crd-reader from 11 days prior was still present. Removed all hook annotations; resources are now regular Helm-managed and deleted on uninstall. namespace-label.yaml moved from pre-install,pre-upgrade to post-install,post-upgrade (weight -10) so the regular RBAC resources exist before the labeling job runs. Issue 7 — package.sh version hardcoded + CRD slimming broken (Medium) Two bugs in tests/package.sh: 1. The OTel operator .tgz filename was hardcoded as opentelemetry-operator-0.78.1.tgz in two places — version bumps would silently skip the schema patch. Changed to read the version dynamically from requirements.yaml via Python. 2. The CRD slimming step replaced conf/crds/ templates with empty stubs (crds.create: false). This was broken when we switched to Option B (crds.create: true) — stubs rendered nothing, so CRDs were never created on a fresh packaged-chart install. Removed the slimming step entirely. The full CRD YAML (~1.6 MB) compresses to ~160 KB gzipped in the Helm release Secret, well within the 3 MB Kubernetes limit. Bash heredoc-in-until fix (discovered during cluster validation) The Instrumentation CR apply originally used until cat <<'EOF' | kubectl apply -f -; do — a heredoc inside an until condition. This is unreliable: the heredoc content is consumed on the first evaluation and subsequent retries pipe empty input to kubectl. Changed to write to /tmp/instrumentation-cr.yaml first (same pattern as the Collector CR), which is unconditionally safe across bash versions. Tests Added 7 new test cases to tests/helm-template-test.sh that catch all of the above regressions locally before cluster install: - RBAC: no helm.sh/hook annotations on any resource (catches hook leak) - RBAC: no before-hook-creation delete policy (same) - Namespace-label: uses post-install,post-upgrade not pre-install (catches SA missing at hook time) - CR installer: has max_retries, retries=0, and exit 1 (catches infinite loop) - CR installer: uses collector-cr.yaml and instrumentation-cr.yaml temp files (catches heredoc-in-until) - CR installer: has init container check and skip message (catches unnecessary upgrade restarts)
| sleep 5 | ||
| done | ||
|
|
||
| # Restart pods labeled mlrun.io/otel=true so they go through the webhook |
There was a problem hiding this comment.
Will this not increase installation time?
There was a problem hiding this comment.
It will because of the post install auto-instrumentation injection. as discussed will add a comment and what pods will be restarted and that OTEL will increase the installation time
| - "" | ||
| resources: | ||
| - pods | ||
| - namespaces |
There was a problem hiding this comment.
why need namespaces?
| # The PYTHONPATH workaround is the accepted solution unless the upstream chart adds | ||
| # podAnnotations support. Keep this in sync with the mlrun image's internal PYTHONPATH. | ||
| extraEnvKeyValue: | ||
| PYTHONPATH: "/mlrun/server/py:/mlrun/server/py/schemas/proto" |
There was a problem hiding this comment.
If I understand correctly this overwrite the valued form the image we build in MLRun. Please add a link to the Docker file in MLRun if this line will need to be updated
| # on each CRD before applying CRs, so there is no race condition on fresh install. | ||
| # CRDs are automatically kept in sync with the operator chart version on every upgrade. | ||
| crds: | ||
| create: true |
There was a problem hiding this comment.
if enable equal to false why enable the crd or in the OTEL chart the crd's does not get's created if is enable?
There was a problem hiding this comment.
The sub chart is disabled for opentelemetry-operator.enabled: false so this is a no-op
Adds OTel-based observability to MLRun CE with automatic Python instrumentation, deployment-mode metrics collection, and Prometheus integration.
https://iguazio.atlassian.net/browse/CEML-685
What's implemented
OTel operator sub-chart
opentelemetry-operatorv0.78.1 added as an optional dependencycrds.create: true— the sub-chart manages OTel CRDs directly; full CRD YAML (~1.6 MB) compresses to ~160 KB gzipped in the Helm release Secret, well within the 3 MB Kubernetes API limitopentelemetry.io/inject=enabled— avoids injecting into unrelated namespacestemplates/opentelemetry/
namespace-label.yaml — post-install,post-upgrade hook (weight -10) that labels and annotates the release namespace:
otel-cr-installer.yaml — post-install,post-upgrade hook (weight 10) that:
rbac.yaml — regular Helm-managed resources (no hook annotations): ServiceAccount, ClusterRole, ClusterRoleBinding, Role, RoleBinding for the installer job. Being regular resources means they are cleaned up on helm uninstall.
Metrics pipeline: push model (OTLP to Prometheus)
Instrumentation CR (Python auto-instrumentation)
Nuclio function pods
MLRun API — PYTHONPATH workaround
tests/package.sh
Admin / non-admin split
Generated with Claude Code