From 9a10495e6f1e4670cfc83b2b96a6e2d34b90e59e Mon Sep 17 00:00:00 2001 From: lilyz-ai Date: Tue, 21 Apr 2026 01:11:28 +0000 Subject: [PATCH 01/27] chore: bump model-engine image tag to latest main (2e9d0078) Co-Authored-By: Claude Sonnet 4.6 --- charts/model-engine/values_sample.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/charts/model-engine/values_sample.yaml b/charts/model-engine/values_sample.yaml index 301fca89..eee15c54 100644 --- a/charts/model-engine/values_sample.yaml +++ b/charts/model-engine/values_sample.yaml @@ -24,7 +24,7 @@ celery_broker_type_redis: null # - ALL # tag [required] is the LLM Engine docker image tag -tag: e360bfb1d21d9d4e7b7fcb6b29ca752095b4d0f4 +tag: 2e9d00786419ef44ec5c9d3305d8d6451d6aabfb # context is a user-specified deployment tag. Can be used to context: production image: From 4332cada6856afc237551959a9596bfa4a116fe2 Mon Sep 17 00:00:00 2001 From: lilyz-ai Date: Wed, 29 Apr 2026 00:07:23 +0000 Subject: [PATCH 02/27] docs: add SEV1 post-mortem for model-engine 500s incident (MLI-6574) Documents the Apr 24-25 outage caused by deploying an ORM model referencing a missing DB column, including timeline, root cause, impact, and follow-up action items. Co-Authored-By: Claude Sonnet 4.6 --- .../post-mortem-mli-6574-model-engine-500s.md | 78 +++++++++++++++++++ 1 file changed, 78 insertions(+) create mode 100644 docs/internal/post-mortem-mli-6574-model-engine-500s.md diff --git a/docs/internal/post-mortem-mli-6574-model-engine-500s.md b/docs/internal/post-mortem-mli-6574-model-engine-500s.md new file mode 100644 index 00000000..659333c9 --- /dev/null +++ b/docs/internal/post-mortem-mli-6574-model-engine-500s.md @@ -0,0 +1,78 @@ +# SEV1 Post-Mortem: model-engine Deployment Caused 500s on All Endpoint Operations + +**Incident ID:** MLI-6574 +**Severity:** SEV1 +**Date:** Apr 24–25, 2026 +**Duration:** ~58 min (23:50Z deployment → 00:48Z rollback) +**Customer-facing impact:** ~47 min of 500s (23:50Z – 00:37Z PagerDuty alert; resolved 00:48Z) +**Status:** Resolved + +--- + +## Summary + +A `kubectl set image` at 23:50Z on Apr 24 deployed a model-engine image that referenced a new ORM field (`endpoints.temporal_task_queue`) which did not exist in the production database. Every `list_model_endpoints` and `get_model_endpoint` call returned 500, blocking all endpoint deployments and operations for all users. Five patch attempts over ~34 minutes failed to address the root cause before a rollback to the previous image resolved the incident. At least one user missed a project delivery deadline. + +--- + +## Timeline + +| Time (UTC) | Event | +|---|---| +| Feb 27, 05:54Z | Stable image `f395ffa6…` deployed; ran cleanly for ~57 days | +| **Apr 24, 23:50:25Z** | **`kubectl set image` → `04729cef…` (rev 293); ORM references missing `temporal_task_queue` column; all endpoint list/get → 500** | +| 23:55:43Z | `-internal` tag rolled (rev 294); same bug, same 500s | +| Apr 25, 00:09:32Z | `-patch` (rev 296) — does not fix missing column | +| 00:15:29Z | `-patch2` (rev 297) — does not fix missing column | +| 00:20:27Z | `-patch3` (rev 298) — does not fix missing column | +| 00:24:11Z | `-patch4` (rev 299) — does not fix missing column | +| **00:37Z** | **PagerDuty fires; Envoy 5xx ratio = 0.052** | +| **00:48Z** | **Rollback to `f395ffa6…` (rev 300); errors clear; alert auto-resolves** | +| 01:22Z | 0 errors in last 2 min; incident closed | + +--- + +## Root Cause + +The ORM model for the `endpoints` table was updated to declare a new column `temporal_task_queue` (added as part of the temporal endpoint type feature, MLI-6425). The corresponding Alembic database migration **was never applied to production** before the new image was rolled out. + +When model-engine started, SQLAlchemy attempted to reference `endpoints.temporal_task_queue` in every endpoint query. Because the column did not exist in the live DB, PostgreSQL returned an error on every `list_model_endpoints` and `get_model_endpoint` call, causing universal 500s. + +The five patch images deployed during the incident did not address this — they lacked the migration and the ORM field continued to reference the non-existent column. + +**Contributing factors:** +- No migration-before-rollout enforcement: the deployment pipeline does not block a rollout if pending Alembic migrations exist. +- No startup schema validation: model-engine does not fail fast on ORM/DB schema drift; errors only surfaced at query time. +- No rollback SLA: 5 patch attempts were made over ~34 minutes before rollback was chosen. The correct fix (rollback) was not prioritized early enough. +- Delayed alerting: PagerDuty did not fire until 00:37Z, ~47 min after the bad deployment. + +--- + +## Impact + +| Dimension | Detail | +|---|---| +| Duration of 500s | ~47 min (23:50Z – 00:37Z alert; resolved 00:48Z) | +| Affected operations | All endpoint list, get, create, update, delete | +| Affected users | All model-engine users | +| Known downstream impact | At least one user missed a project delivery deadline | + +--- + +## Action Items + +| # | Action | Owner | Priority | +|---|---|---|---| +| 1 | **Enforce migration-first deployment**: CI/CD pipeline must verify all pending Alembic migrations are applied before new image goes live (or gate rollout on a migration job completing). | Infra/model-engine | High | +| 2 | **Add migration drift detection**: Startup health check that fails fast if ORM schema diverges from live DB schema, preventing silent 500s. | model-engine | High | +| 3 | **Define rollback SLA**: If ≥2 patch attempts fail within 15 minutes, initiate rollback immediately. Document this in the on-call runbook. | On-call/Infra | Medium | +| 4 | **Improve alerting latency**: Envoy 5xx alert threshold and evaluation window should catch this class of incident in <10 min, not 47 min. | Infra | Medium | +| 5 | **Proactive incident communication**: Users mid-deployment should receive Slack/status-page notification during active SEV1s. | On-call/Eng | Medium | + +--- + +## Lessons Learned + +- **Schema changes must be deployed atomically with or ahead of the code that depends on them.** A two-phase deploy (migration first, code second) or a migration-gating CI step would have prevented this entirely. +- **Fail fast on startup beats silently failing on every request.** A startup check that validates ORM↔DB schema alignment would have contained the blast radius to a failed rollout rather than a live outage. +- **Five patches in 34 minutes is a sign to stop and rollback.** When the root cause is unclear, reverting to a known-good state is faster and safer than iterating forward. From de0c35402dcdf77706b5ea6b843ef5de73740662 Mon Sep 17 00:00:00 2001 From: lilyz-ai Date: Wed, 29 Apr 2026 00:27:40 +0000 Subject: [PATCH 03/27] docs: expand post-mortem with concrete action item plans (MLI-6574) --- .../post-mortem-mli-6574-model-engine-500s.md | 210 +++++++++++++++--- 1 file changed, 184 insertions(+), 26 deletions(-) diff --git a/docs/internal/post-mortem-mli-6574-model-engine-500s.md b/docs/internal/post-mortem-mli-6574-model-engine-500s.md index 659333c9..02830987 100644 --- a/docs/internal/post-mortem-mli-6574-model-engine-500s.md +++ b/docs/internal/post-mortem-mli-6574-model-engine-500s.md @@ -4,14 +4,14 @@ **Severity:** SEV1 **Date:** Apr 24–25, 2026 **Duration:** ~58 min (23:50Z deployment → 00:48Z rollback) -**Customer-facing impact:** ~47 min of 500s (23:50Z – 00:37Z PagerDuty alert; resolved 00:48Z) -**Status:** Resolved +**Customer-facing 500s:** ~47 min +**Status:** Resolved; follow-up actions tracked below --- ## Summary -A `kubectl set image` at 23:50Z on Apr 24 deployed a model-engine image that referenced a new ORM field (`endpoints.temporal_task_queue`) which did not exist in the production database. Every `list_model_endpoints` and `get_model_endpoint` call returned 500, blocking all endpoint deployments and operations for all users. Five patch attempts over ~34 minutes failed to address the root cause before a rollback to the previous image resolved the incident. At least one user missed a project delivery deadline. +At 23:50Z on Apr 24, a direct `kubectl set image` pushed a model-engine image (`04729cef…`) that declared a new ORM column `endpoints.temporal_task_queue` with no corresponding Alembic migration applied to production. Every `list_model_endpoints` and `get_model_endpoint` call immediately returned 500. Five patch images over ~34 minutes failed to resolve the issue before a rollback to `f395ffa6…` cleared errors at 00:48Z. At least one user missed a project delivery deadline. --- @@ -22,10 +22,10 @@ A `kubectl set image` at 23:50Z on Apr 24 deployed a model-engine image that ref | Feb 27, 05:54Z | Stable image `f395ffa6…` deployed; ran cleanly for ~57 days | | **Apr 24, 23:50:25Z** | **`kubectl set image` → `04729cef…` (rev 293); ORM references missing `temporal_task_queue` column; all endpoint list/get → 500** | | 23:55:43Z | `-internal` tag rolled (rev 294); same bug, same 500s | -| Apr 25, 00:09:32Z | `-patch` (rev 296) — does not fix missing column | -| 00:15:29Z | `-patch2` (rev 297) — does not fix missing column | -| 00:20:27Z | `-patch3` (rev 298) — does not fix missing column | -| 00:24:11Z | `-patch4` (rev 299) — does not fix missing column | +| Apr 25, 00:09:32Z | `-patch` (rev 296) pushed; does not include migration; 500s continue | +| 00:15:29Z | `-patch2` (rev 297); same | +| 00:20:27Z | `-patch3` (rev 298); same | +| 00:24:11Z | `-patch4` (rev 299); same | | **00:37Z** | **PagerDuty fires; Envoy 5xx ratio = 0.052** | | **00:48Z** | **Rollback to `f395ffa6…` (rev 300); errors clear; alert auto-resolves** | | 01:22Z | 0 errors in last 2 min; incident closed | @@ -34,17 +34,28 @@ A `kubectl set image` at 23:50Z on Apr 24 deployed a model-engine image that ref ## Root Cause -The ORM model for the `endpoints` table was updated to declare a new column `temporal_task_queue` (added as part of the temporal endpoint type feature, MLI-6425). The corresponding Alembic database migration **was never applied to production** before the new image was rolled out. +The `temporal_task_queue` column was added to the `Endpoint` ORM model (`hosted_model_inference.py`) as part of the temporal endpoint type feature (MLI-6425), but **no Alembic migration was written for it**. The column was referenced in every endpoint query via SQLAlchemy's column loading. When PostgreSQL returned `column "temporal_task_queue" does not exist`, SQLAlchemy surfaced a 500 on every call. -When model-engine started, SQLAlchemy attempted to reference `endpoints.temporal_task_queue` in every endpoint query. Because the column did not exist in the live DB, PostgreSQL returned an error on every `list_model_endpoints` and `get_model_endpoint` call, causing universal 500s. +### Why the existing migration gate didn't protect us -The five patch images deployed during the incident did not address this — they lacked the migration and the ORM field continued to reference the non-existent column. +The Helm chart already enforces migration-first ordering: -**Contributing factors:** -- No migration-before-rollout enforcement: the deployment pipeline does not block a rollout if pending Alembic migrations exist. -- No startup schema validation: model-engine does not fail fast on ORM/DB schema drift; errors only surfaced at query time. -- No rollback SLA: 5 patch attempts were made over ~34 minutes before rollback was chosen. The correct fix (rollback) was not prioritized early enough. -- Delayed alerting: PagerDuty did not fire until 00:37Z, ~47 min after the bad deployment. +```yaml +# charts/model-engine/templates/database_migration_job.yaml +annotations: + "helm.sh/hook": pre-install,pre-upgrade + "helm.sh/hook-weight": "-1" # runs before any pod rollout +``` + +This pre-upgrade hook runs `alembic upgrade head` before the gateway deployment starts. **The incident bypassed this entirely because the deployment used `kubectl set image`, not `helm upgrade`.** `kubectl set image` updates the pod spec directly, skipping all Helm hooks. + +### Contributing factors + +1. **`kubectl set image` used for production image updates** — bypasses the migration pre-hook built into the Helm chart. +2. **No ORM/DB schema validation at startup** — the app started successfully and served traffic with an invalid schema; errors only appeared at query time. The `/readyz` probe passes as long as the process is up, regardless of DB state. +3. **No migration written for the new column** — the ORM model was updated without a paired migration file. +4. **No rollback SLA** — 5 patch attempts over 34 minutes before rollback was chosen. +5. **Delayed alerting** — PagerDuty did not fire until 00:37Z, 47 min into the incident. --- @@ -52,7 +63,7 @@ The five patch images deployed during the incident did not address this — they | Dimension | Detail | |---|---| -| Duration of 500s | ~47 min (23:50Z – 00:37Z alert; resolved 00:48Z) | +| Customer-facing 500s | ~47 min (23:50Z → 00:37Z alert; resolved 00:48Z) | | Affected operations | All endpoint list, get, create, update, delete | | Affected users | All model-engine users | | Known downstream impact | At least one user missed a project delivery deadline | @@ -61,18 +72,165 @@ The five patch images deployed during the incident did not address this — they ## Action Items -| # | Action | Owner | Priority | -|---|---|---|---| -| 1 | **Enforce migration-first deployment**: CI/CD pipeline must verify all pending Alembic migrations are applied before new image goes live (or gate rollout on a migration job completing). | Infra/model-engine | High | -| 2 | **Add migration drift detection**: Startup health check that fails fast if ORM schema diverges from live DB schema, preventing silent 500s. | model-engine | High | -| 3 | **Define rollback SLA**: If ≥2 patch attempts fail within 15 minutes, initiate rollback immediately. Document this in the on-call runbook. | On-call/Infra | Medium | -| 4 | **Improve alerting latency**: Envoy 5xx alert threshold and evaluation window should catch this class of incident in <10 min, not 47 min. | Infra | Medium | -| 5 | **Proactive incident communication**: Users mid-deployment should receive Slack/status-page notification during active SEV1s. | On-call/Eng | Medium | +### P0 — Require Alembic migration before ORM column is added + +**What:** Add a CI check that detects when ORM models have columns not covered by any migration. Block merge if a new `Column(...)` in `hosted_model_inference.py` or `model.py` has no corresponding `add_column` in `alembic/versions/`. + +**How:** + +Option A (preferred) — use `alembic check` in CI: +```bash +# In CI, in the job that already spins up a Postgres container for integration tests +alembic check # exits non-zero if autogenerate detects pending schema changes +``` + +Option B — static linter (simpler, imperfect): parse `alembic/versions/` for `add_column("endpoints", …)` and cross-reference against columns declared in the ORM model. Fails the lint job on mismatch. + +**Files to change:** CI config, `Makefile` (add `make check-migrations` target). + +**Owner:** model-engine on-call +**Effort:** ~1 day + +--- + +### P0 — Enforce `helm upgrade` for production image rollouts; ban `kubectl set image` + +**What:** The `kubectl set image` pattern bypasses all Helm pre-upgrade hooks, including the migration job. All production image updates must go through `helm upgrade` so the migration job runs first. + +**How:** + +Add a deploy wrapper script used by both CI and on-call: +```bash +#!/bin/bash +# scripts/deploy.sh +set -euo pipefail +TAG=${1:?usage: deploy.sh } +helm upgrade model-engine charts/model-engine \ + --set tag="$TAG" \ + --wait \ # block until all hooks and pods are healthy + --timeout 10m \ + --atomic # auto-rollback if migration job or pod readiness fails +``` + +`--atomic` is the key flag: if the migration job exits non-zero or the pod fails its readiness probe within the timeout, Helm automatically rolls back to the previous release with no manual intervention needed. + +If emergency direct `kubectl set image` is ever required (e.g., Helm state is corrupted), require running the migration job manually first: +```bash +kubectl create job db-migration-manual-$(date +%s) \ + --from=job/$(kubectl get jobs -l app=model-engine-database-migration --sort-by=.metadata.creationTimestamp -o name | tail -1) +kubectl wait --for=condition=complete job/db-migration-manual-... --timeout=600s +# only then: kubectl set image ... +``` + +**Files to change:** `scripts/deploy.sh` (new), `docs/internal/` runbook, CI pipeline. + +**Owner:** model-engine on-call +**Effort:** ~0.5 day + +--- + +### P1 — Add startup schema validation so bad images fail fast instead of serving 500s + +**What:** Add an `@app.on_event("startup")` check in `api/app.py` that compares ORM column declarations against actual DB columns. If any ORM column is missing from the DB, raise an exception before the process begins serving traffic. Kubernetes will keep old pods running and never route traffic to the bad pod. + +**How:** + +Add to `model-engine/model_engine_server/db/base.py`: +```python +from sqlalchemy import inspect as sa_inspect +from model_engine_server.db.models.hosted_model_inference import Base + +def validate_schema_or_raise(engine) -> None: + inspector = sa_inspect(engine) + for table in Base.metadata.sorted_tables: + schema = table.schema + db_cols = {c["name"] for c in inspector.get_columns(table.name, schema=schema)} + orm_cols = {c.name for c in table.columns} + missing = orm_cols - db_cols + if missing: + raise RuntimeError( + f"Schema drift detected: table {schema}.{table.name} " + f"is missing columns {missing}. " + f"Run 'alembic upgrade head' before deploying this image." + ) +``` + +Wire into startup in `api/app.py` (alongside the existing `load_redis` startup event): +```python +from model_engine_server.db.base import get_db_manager, validate_schema_or_raise + +@app.on_event("startup") +def validate_db_schema(): + db = get_db_manager() + with db.session_sync() as session: + validate_schema_or_raise(session.get_bind()) +``` + +The existing readiness probe on `/readyz` (checked every 2s, `failureThreshold: 30` → ~60s window) will prevent Kubernetes from routing traffic to the pod while startup events are failing. If startup raises, the process exits and Kubernetes keeps the old replica set running. + +**Files to change:** +- `model-engine/model_engine_server/db/base.py` — add `validate_schema_or_raise()` +- `model-engine/model_engine_server/api/app.py` — add startup event + +**Owner:** model-engine on-call +**Effort:** ~1 day (including unit tests with a mock inspector) + +--- + +### P1 — Write the missing `temporal_task_queue` migration before re-deploying MLI-6425 + +**What:** PR #815 (`lilyz-ai/temporal-endpoint-type`) adds `temporal_task_queue` to the `Endpoint` ORM. A migration must be written and merged — and applied to production — before this image is deployed again. + +**How:** +```bash +cd model-engine/model_engine_server/db/migrations +alembic revision --autogenerate -m "add_temporal_task_queue_column" +# review generated file, confirm it contains: +# op.add_column("endpoints", sa.Column("temporal_task_queue", sa.Text(), nullable=True), schema="hosted_model_inference") +alembic upgrade head # validate locally against test DB +``` + +**Files to change:** new file in `model-engine/model_engine_server/db/migrations/alembic/versions/`. + +**Owner:** MLI-6425 author +**Effort:** ~1 hour + +--- + +### P2 — Define rollback SLA in the on-call runbook + +**What:** Codify a clear decision rule: if the active incident is not resolved within **15 minutes of the first 5xx spike**, or after **2 failed forward-patch attempts**, initiate rollback immediately. + +``` +Rollback trigger (whichever comes first): + - 2 forward-patch attempts failed, OR + - 15 minutes elapsed since first 5xx spike + → helm rollback model-engine +``` + +**Files to change:** `docs/internal/smoke-tests.md` or new `docs/internal/oncall-runbook.md`. + +**Owner:** on-call lead +**Effort:** ~0.5 day + +--- + +### P2 — Reduce 5xx alert latency from ~47 min to <5 min + +**What:** The PagerDuty alert fired 47 minutes after the bad deploy. The Envoy 5xx monitor evaluation window needs to be tightened to catch a step-function spike within 5 minutes. + +**How:** In Datadog, find the monitor for `envoy.cluster.upstream_rq_5xx` (or the Envoy 5xx ratio metric). Set the evaluation window to 2–3 minutes with a threshold of ≥1% error rate sustained for 1 minute. + +**Files to change:** Datadog monitor config in Terracode-ML `datadog/` directory for the ml-serving account. + +**Owner:** on-call/infra +**Effort:** ~1 hour --- ## Lessons Learned -- **Schema changes must be deployed atomically with or ahead of the code that depends on them.** A two-phase deploy (migration first, code second) or a migration-gating CI step would have prevented this entirely. -- **Fail fast on startup beats silently failing on every request.** A startup check that validates ORM↔DB schema alignment would have contained the blast radius to a failed rollout rather than a live outage. -- **Five patches in 34 minutes is a sign to stop and rollback.** When the root cause is unclear, reverting to a known-good state is faster and safer than iterating forward. +1. **Migration hooks only protect you if you use Helm.** The project had the right guardrail (pre-upgrade hook at weight -1); it was bypassed by using `kubectl set image` directly. The operational pattern matters as much as the technical control. +2. **Fail loudly at startup, not silently on every request.** A startup check that raises an exception is strictly better than an app that boots clean and then 500s on every call. The readiness probe would have contained the blast radius to a failed rollout. +3. **Every ORM column needs a migration.** The ORM model is a lie until the migration runs; treat them as inseparable — the migration file is part of the same diff as the column declaration. +4. **After two failed patches, go backward.** The cost of 5 failed patches was 34 extra minutes of outage. When the root cause is unclear, rollback is always faster than forward. From 413ea988dea0a219b7fce4c259004b102ceff910 Mon Sep 17 00:00:00 2001 From: lilyz-ai Date: Wed, 29 Apr 2026 02:04:11 +0000 Subject: [PATCH 04/27] docs: add kubectl ban enforcement options and image testing workflow (MLI-6574) --- .../post-mortem-mli-6574-model-engine-500s.md | 25 ++++++++++++++++--- 1 file changed, 22 insertions(+), 3 deletions(-) diff --git a/docs/internal/post-mortem-mli-6574-model-engine-500s.md b/docs/internal/post-mortem-mli-6574-model-engine-500s.md index 02830987..20130558 100644 --- a/docs/internal/post-mortem-mli-6574-model-engine-500s.md +++ b/docs/internal/post-mortem-mli-6574-model-engine-500s.md @@ -114,7 +114,26 @@ helm upgrade model-engine charts/model-engine \ `--atomic` is the key flag: if the migration job exits non-zero or the pod fails its readiness probe within the timeout, Helm automatically rolls back to the previous release with no manual intervention needed. -If emergency direct `kubectl set image` is ever required (e.g., Helm state is corrupted), require running the migration job manually first: +**Enforcing the ban — options by strength:** + +1. **Kubernetes RBAC (simplest):** Remove `patch`/`update` on `deployments` from developer roles. Only the CI service account (used by `helm upgrade`) retains that permission. Developers keep read and exec access; they just can't mutate deployment specs directly. + +2. **GitOps with ArgoCD (most robust):** ArgoCD continuously reconciles cluster state against git. A manual `kubectl set image` is detected as drift and reverted within ~3 minutes. Changes must go through a git commit + PR — there is no persistent path outside the pipeline. + +3. **Admission controller (OPA/Kyverno):** Write a policy that rejects any `PATCH` on a deployment's image unless it originates from the CI service account or carries a specific Helm annotation. More surgical than RBAC. + +RBAC is the fastest win today; GitOps is the right long-term answer. + +**Testing a new image without `kubectl set image`:** + +- **Staging:** run `scripts/deploy.sh ` against the staging context — same Helm path as production, with migrations enforced. +- **Standalone pod (no deployment mutation needed):** to verify an image starts cleanly on the live cluster without touching the `Deployment` resource: + ```bash + kubectl run test-pod --image= --rm -it -- bash + ``` + This doesn't require `deployments/patch`, so RBAC restrictions don't apply. Use it to manually verify DB connectivity, run the schema validator, or inspect logs before cutting a Helm release. + +If emergency direct `kubectl set image` is ever required (e.g., Helm state is corrupted), run the migration job manually first: ```bash kubectl create job db-migration-manual-$(date +%s) \ --from=job/$(kubectl get jobs -l app=model-engine-database-migration --sort-by=.metadata.creationTimestamp -o name | tail -1) @@ -122,10 +141,10 @@ kubectl wait --for=condition=complete job/db-migration-manual-... --timeout=600s # only then: kubectl set image ... ``` -**Files to change:** `scripts/deploy.sh` (new), `docs/internal/` runbook, CI pipeline. +**Files to change:** `scripts/deploy.sh` (new), `docs/internal/` runbook, CI pipeline, RBAC role manifests. **Owner:** model-engine on-call -**Effort:** ~0.5 day +**Effort:** ~1 day --- From ab7cee48642e89602c9840212a5b7c6463ad5a1a Mon Sep 17 00:00:00 2001 From: lilyz-ai Date: Sun, 3 May 2026 23:42:42 +0000 Subject: [PATCH 05/27] docs(post-mortem): clarify terms, fix action items, replace P1/P2 with rollout strategy - Add ORM column and Alembic migration definitions inline - Explain why every endpoint query failed (SQLAlchemy SELECT includes all ORM columns) - Remove incorrect downstream impact entry - Clarify P0 migration CI check doesn't prevent kubectl bypass (scoped note added) - Reframe kubectl ban as preventative (RBAC blocks at API server) not reactive (--atomic) - Replace P1/P2 items with a Staging vs Production rollout strategy section Co-Authored-By: Claude Sonnet 4.6 --- .../post-mortem-mli-6574-model-engine-500s.md | 176 +++++------------- 1 file changed, 47 insertions(+), 129 deletions(-) diff --git a/docs/internal/post-mortem-mli-6574-model-engine-500s.md b/docs/internal/post-mortem-mli-6574-model-engine-500s.md index 20130558..d140cefd 100644 --- a/docs/internal/post-mortem-mli-6574-model-engine-500s.md +++ b/docs/internal/post-mortem-mli-6574-model-engine-500s.md @@ -11,7 +11,7 @@ ## Summary -At 23:50Z on Apr 24, a direct `kubectl set image` pushed a model-engine image (`04729cef…`) that declared a new ORM column `endpoints.temporal_task_queue` with no corresponding Alembic migration applied to production. Every `list_model_endpoints` and `get_model_endpoint` call immediately returned 500. Five patch images over ~34 minutes failed to resolve the issue before a rollback to `f395ffa6…` cleared errors at 00:48Z. At least one user missed a project delivery deadline. +At 23:50Z on Apr 24, a direct `kubectl set image` pushed a model-engine image (`04729cef…`) that declared a new ORM column `endpoints.temporal_task_queue` with no corresponding Alembic migration applied to production. Every `list_model_endpoints` and `get_model_endpoint` call immediately returned 500. Five patch images over ~34 minutes failed to resolve the issue before a rollback to `f395ffa6…` cleared errors at 00:48Z. --- @@ -34,7 +34,11 @@ At 23:50Z on Apr 24, a direct `kubectl set image` pushed a model-engine image (` ## Root Cause -The `temporal_task_queue` column was added to the `Endpoint` ORM model (`hosted_model_inference.py`) as part of the temporal endpoint type feature (MLI-6425), but **no Alembic migration was written for it**. The column was referenced in every endpoint query via SQLAlchemy's column loading. When PostgreSQL returned `column "temporal_task_queue" does not exist`, SQLAlchemy surfaced a 500 on every call. +**Background terms:** +- **ORM column** — a Python class attribute like `temporal_task_queue = Column(Text)` on the `Endpoint` model. SQLAlchemy (our ORM) translates these attributes into SQL; every declared column is included in the `SELECT` it generates when loading `Endpoint` objects. +- **Alembic migration** — a versioned SQL script (e.g. `ALTER TABLE endpoints ADD COLUMN temporal_task_queue TEXT`) that updates the live database schema to match what the application code expects. Without it, the DB is missing the column the ORM declares. + +The `temporal_task_queue` column was added to the `Endpoint` ORM model (`hosted_model_inference.py`) as part of the temporal endpoint type feature (MLI-6425), but **no Alembic migration was written for it**. Because SQLAlchemy includes every ORM-declared column in its generated `SELECT` statement, all endpoint operations — list, get, create, update, delete — issued a query referencing `temporal_task_queue`. PostgreSQL returned `column "temporal_task_queue" does not exist` on every call, causing a 500 across the board. ### Why the existing migration gate didn't protect us @@ -66,25 +70,24 @@ This pre-upgrade hook runs `alembic upgrade head` before the gateway deployment | Customer-facing 500s | ~47 min (23:50Z → 00:37Z alert; resolved 00:48Z) | | Affected operations | All endpoint list, get, create, update, delete | | Affected users | All model-engine users | -| Known downstream impact | At least one user missed a project delivery deadline | --- ## Action Items -### P0 — Require Alembic migration before ORM column is added +### P0 — Add CI check: every new ORM column requires a paired Alembic migration -**What:** Add a CI check that detects when ORM models have columns not covered by any migration. Block merge if a new `Column(...)` in `hosted_model_inference.py` or `model.py` has no corresponding `add_column` in `alembic/versions/`. +**What:** Add a CI check that blocks merge when the ORM model declares a column that has no corresponding migration. This prevents the "no migration written" contributing factor, though it does not by itself prevent `kubectl set image` from bypassing migration execution (that is addressed in the next item). **How:** Option A (preferred) — use `alembic check` in CI: ```bash -# In CI, in the job that already spins up a Postgres container for integration tests +# In the CI job that already spins up a Postgres container for integration tests alembic check # exits non-zero if autogenerate detects pending schema changes ``` -Option B — static linter (simpler, imperfect): parse `alembic/versions/` for `add_column("endpoints", …)` and cross-reference against columns declared in the ORM model. Fails the lint job on mismatch. +Option B — static linter (simpler, imperfect): cross-reference `add_column("endpoints", …)` calls in `alembic/versions/` against columns declared in the ORM model. Fail the lint job on mismatch. **Files to change:** CI config, `Makefile` (add `make check-migrations` target). @@ -93,13 +96,29 @@ Option B — static linter (simpler, imperfect): parse `alembic/versions/` for ` --- -### P0 — Enforce `helm upgrade` for production image rollouts; ban `kubectl set image` +### P0 — Prevent `kubectl set image` in production (preventative, not reactive) -**What:** The `kubectl set image` pattern bypasses all Helm pre-upgrade hooks, including the migration job. All production image updates must go through `helm upgrade` so the migration job runs first. +**What:** `kubectl set image` updates pods directly, bypassing all Helm pre-upgrade hooks including the migration job. The fix must be preventative: `kubectl set image` on the production deployment should be blocked or immediately reverted — not just rolled back after a failure is detected. -**How:** +**Preferred approach — Kubernetes RBAC:** + +Remove `patch`/`update` permissions on `Deployment` resources from developer roles in the production namespace. Only the CI service account retains those permissions. A `kubectl set image` attempt from a developer will be rejected by the API server immediately — no deployment occurs, no rollback needed. + +```yaml +# Remove from developer ClusterRole / Role in prod namespace: +# - apiGroups: ["apps"] +# resources: ["deployments"] +# verbs: ["patch", "update"] # <-- remove these +# verbs: ["get", "list", "watch"] # <-- keep read-only +``` -Add a deploy wrapper script used by both CI and on-call: +**Alternative — GitOps (ArgoCD):** + +ArgoCD continuously reconciles cluster state against git. A manual `kubectl set image` is detected as drift and reverted within ~3 minutes. This is more robust than RBAC alone because it also catches changes made via other paths (e.g. direct `kubectl edit`). + +**Sanctioned production deploy path:** + +All production image updates go through a deploy script that runs `helm upgrade` with the migration pre-hook: ```bash #!/bin/bash # scripts/deploy.sh @@ -107,143 +126,42 @@ set -euo pipefail TAG=${1:?usage: deploy.sh } helm upgrade model-engine charts/model-engine \ --set tag="$TAG" \ - --wait \ # block until all hooks and pods are healthy + --wait \ --timeout 10m \ - --atomic # auto-rollback if migration job or pod readiness fails + --atomic ``` -`--atomic` is the key flag: if the migration job exits non-zero or the pod fails its readiness probe within the timeout, Helm automatically rolls back to the previous release with no manual intervention needed. - -**Enforcing the ban — options by strength:** - -1. **Kubernetes RBAC (simplest):** Remove `patch`/`update` on `deployments` from developer roles. Only the CI service account (used by `helm upgrade`) retains that permission. Developers keep read and exec access; they just can't mutate deployment specs directly. - -2. **GitOps with ArgoCD (most robust):** ArgoCD continuously reconciles cluster state against git. A manual `kubectl set image` is detected as drift and reverted within ~3 minutes. Changes must go through a git commit + PR — there is no persistent path outside the pipeline. +**Testing a new image before deploying to production:** -3. **Admission controller (OPA/Kyverno):** Write a policy that rejects any `PATCH` on a deployment's image unless it originates from the CI service account or carries a specific Helm annotation. More surgical than RBAC. - -RBAC is the fastest win today; GitOps is the right long-term answer. - -**Testing a new image without `kubectl set image`:** - -- **Staging:** run `scripts/deploy.sh ` against the staging context — same Helm path as production, with migrations enforced. -- **Standalone pod (no deployment mutation needed):** to verify an image starts cleanly on the live cluster without touching the `Deployment` resource: - ```bash - kubectl run test-pod --image= --rm -it -- bash - ``` - This doesn't require `deployments/patch`, so RBAC restrictions don't apply. Use it to manually verify DB connectivity, run the schema validator, or inspect logs before cutting a Helm release. - -If emergency direct `kubectl set image` is ever required (e.g., Helm state is corrupted), run the migration job manually first: +Use staging (see Rollout Strategy below) or run a standalone pod against the production cluster without mutating the `Deployment`: ```bash -kubectl create job db-migration-manual-$(date +%s) \ - --from=job/$(kubectl get jobs -l app=model-engine-database-migration --sort-by=.metadata.creationTimestamp -o name | tail -1) -kubectl wait --for=condition=complete job/db-migration-manual-... --timeout=600s -# only then: kubectl set image ... +kubectl run test-pod --image= --rm -it -- bash ``` +This requires only `pods/create`, which developer roles retain. -**Files to change:** `scripts/deploy.sh` (new), `docs/internal/` runbook, CI pipeline, RBAC role manifests. +**Files to change:** RBAC role manifests (production namespace), `scripts/deploy.sh` (new), CI pipeline config. **Owner:** model-engine on-call **Effort:** ~1 day --- -### P1 — Add startup schema validation so bad images fail fast instead of serving 500s - -**What:** Add an `@app.on_event("startup")` check in `api/app.py` that compares ORM column declarations against actual DB columns. If any ORM column is missing from the DB, raise an exception before the process begins serving traffic. Kubernetes will keep old pods running and never route traffic to the bad pod. - -**How:** - -Add to `model-engine/model_engine_server/db/base.py`: -```python -from sqlalchemy import inspect as sa_inspect -from model_engine_server.db.models.hosted_model_inference import Base - -def validate_schema_or_raise(engine) -> None: - inspector = sa_inspect(engine) - for table in Base.metadata.sorted_tables: - schema = table.schema - db_cols = {c["name"] for c in inspector.get_columns(table.name, schema=schema)} - orm_cols = {c.name for c in table.columns} - missing = orm_cols - db_cols - if missing: - raise RuntimeError( - f"Schema drift detected: table {schema}.{table.name} " - f"is missing columns {missing}. " - f"Run 'alembic upgrade head' before deploying this image." - ) -``` - -Wire into startup in `api/app.py` (alongside the existing `load_redis` startup event): -```python -from model_engine_server.db.base import get_db_manager, validate_schema_or_raise - -@app.on_event("startup") -def validate_db_schema(): - db = get_db_manager() - with db.session_sync() as session: - validate_schema_or_raise(session.get_bind()) -``` - -The existing readiness probe on `/readyz` (checked every 2s, `failureThreshold: 30` → ~60s window) will prevent Kubernetes from routing traffic to the pod while startup events are failing. If startup raises, the process exits and Kubernetes keeps the old replica set running. - -**Files to change:** -- `model-engine/model_engine_server/db/base.py` — add `validate_schema_or_raise()` -- `model-engine/model_engine_server/api/app.py` — add startup event - -**Owner:** model-engine on-call -**Effort:** ~1 day (including unit tests with a mock inspector) - ---- - -### P1 — Write the missing `temporal_task_queue` migration before re-deploying MLI-6425 - -**What:** PR #815 (`lilyz-ai/temporal-endpoint-type`) adds `temporal_task_queue` to the `Endpoint` ORM. A migration must be written and merged — and applied to production — before this image is deployed again. - -**How:** -```bash -cd model-engine/model_engine_server/db/migrations -alembic revision --autogenerate -m "add_temporal_task_queue_column" -# review generated file, confirm it contains: -# op.add_column("endpoints", sa.Column("temporal_task_queue", sa.Text(), nullable=True), schema="hosted_model_inference") -alembic upgrade head # validate locally against test DB -``` - -**Files to change:** new file in `model-engine/model_engine_server/db/migrations/alembic/versions/`. - -**Owner:** MLI-6425 author -**Effort:** ~1 hour - ---- +### Rollout Strategy -### P2 — Define rollback SLA in the on-call runbook +**Staging:** No restrictions. Developers can use `kubectl set image`, `helm upgrade`, or any other mechanism to test images against the staging cluster. Staging exists specifically for iteration before production. -**What:** Codify a clear decision rule: if the active incident is not resolved within **15 minutes of the first 5xx spike**, or after **2 failed forward-patch attempts**, initiate rollback immediately. +**Production:** All image updates must go through `helm upgrade` via `scripts/deploy.sh`. `kubectl set image` is blocked at the RBAC level — developer roles do not have `patch`/`update` on `Deployment` resources in the production namespace. The `helm upgrade` path runs the migration pre-hook automatically before any pod rolls over. +The deploy path in both environments: ``` -Rollback trigger (whichever comes first): - - 2 forward-patch attempts failed, OR - - 15 minutes elapsed since first 5xx spike - → helm rollback model-engine +staging: any method (kubectl set image, helm upgrade, etc.) + ↓ + validate: smoke tests pass, no 500s + ↓ +production: scripts/deploy.sh (helm upgrade, migration-first, RBAC-enforced) ``` -**Files to change:** `docs/internal/smoke-tests.md` or new `docs/internal/oncall-runbook.md`. - -**Owner:** on-call lead -**Effort:** ~0.5 day - ---- - -### P2 — Reduce 5xx alert latency from ~47 min to <5 min - -**What:** The PagerDuty alert fired 47 minutes after the bad deploy. The Envoy 5xx monitor evaluation window needs to be tightened to catch a step-function spike within 5 minutes. - -**How:** In Datadog, find the monitor for `envoy.cluster.upstream_rq_5xx` (or the Envoy 5xx ratio metric). Set the evaluation window to 2–3 minutes with a threshold of ≥1% error rate sustained for 1 minute. - -**Files to change:** Datadog monitor config in Terracode-ML `datadog/` directory for the ml-serving account. - -**Owner:** on-call/infra -**Effort:** ~1 hour +Any change that requires a DB schema update (new ORM column) must have its Alembic migration merged and applied to production before the new image is deployed. --- From 97fa12351a655d3d66169d2642ee959dc5ff3053 Mon Sep 17 00:00:00 2001 From: lilyz-ai Date: Mon, 4 May 2026 00:16:13 +0000 Subject: [PATCH 06/27] docs(post-mortem): streamline action items and expand rollout strategy - Remove redundant CI migration check (Helm pre-upgrade hook already enforces this) - Drop ArgoCD alternative; keep RBAC as the single preventative approach - Expand RBAC section with specific files to change (Terracode-ML Role, scripts/deploy.sh, CircleCI config) - Move image testing guidance into Rollout Strategy - Add staging environment gap (no values_staging.yaml exists) with concrete plan Co-Authored-By: Claude Sonnet 4.6 --- .../post-mortem-mli-6574-model-engine-500s.md | 89 +++++++++---------- 1 file changed, 44 insertions(+), 45 deletions(-) diff --git a/docs/internal/post-mortem-mli-6574-model-engine-500s.md b/docs/internal/post-mortem-mli-6574-model-engine-500s.md index d140cefd..194fd651 100644 --- a/docs/internal/post-mortem-mli-6574-model-engine-500s.md +++ b/docs/internal/post-mortem-mli-6574-model-engine-500s.md @@ -75,71 +75,58 @@ This pre-upgrade hook runs `alembic upgrade head` before the gateway deployment ## Action Items -### P0 — Add CI check: every new ORM column requires a paired Alembic migration - -**What:** Add a CI check that blocks merge when the ORM model declares a column that has no corresponding migration. This prevents the "no migration written" contributing factor, though it does not by itself prevent `kubectl set image` from bypassing migration execution (that is addressed in the next item). - -**How:** - -Option A (preferred) — use `alembic check` in CI: -```bash -# In the CI job that already spins up a Postgres container for integration tests -alembic check # exits non-zero if autogenerate detects pending schema changes -``` +### P0 — Prevent `kubectl set image` in production (preventative, not reactive) -Option B — static linter (simpler, imperfect): cross-reference `add_column("endpoints", …)` calls in `alembic/versions/` against columns declared in the ORM model. Fail the lint job on mismatch. +**What:** `kubectl set image` updates pods directly, bypassing all Helm pre-upgrade hooks including the migration job. The fix must be preventative: a `kubectl set image` attempt against the production deployment is rejected by the API server before it takes effect. -**Files to change:** CI config, `Makefile` (add `make check-migrations` target). +**How — Kubernetes RBAC:** -**Owner:** model-engine on-call -**Effort:** ~1 day +Developer kubeconfig contexts for the production namespace/cluster must not carry `patch`/`update` on `Deployment` resources. Only the CI service account (the one that runs `helm upgrade`) retains those verbs. When a developer runs `kubectl set image`, the API server returns 403 immediately — no pods are updated, no rollback is needed. ---- +Two files need to change: -### P0 — Prevent `kubectl set image` in production (preventative, not reactive) +**1. Terracode-ML — developer Role in the production namespace** -**What:** `kubectl set image` updates pods directly, bypassing all Helm pre-upgrade hooks including the migration job. The fix must be preventative: `kubectl set image` on the production deployment should be blocked or immediately reverted — not just rolled back after a failure is detected. - -**Preferred approach — Kubernetes RBAC:** - -Remove `patch`/`update` permissions on `Deployment` resources from developer roles in the production namespace. Only the CI service account retains those permissions. A `kubectl set image` attempt from a developer will be rejected by the API server immediately — no deployment occurs, no rollback needed. +Find the existing developer `Role` or `ClusterRole` that grants access to the production `model-engine` namespace (likely in `scaleapi-ml-serving/` or similar). Remove `patch` and `update` from the `deployments` rule: ```yaml -# Remove from developer ClusterRole / Role in prod namespace: -# - apiGroups: ["apps"] -# resources: ["deployments"] -# verbs: ["patch", "update"] # <-- remove these -# verbs: ["get", "list", "watch"] # <-- keep read-only +# Before (grants kubectl set image): +- apiGroups: ["apps"] + resources: ["deployments"] + verbs: ["get", "list", "watch", "patch", "update"] + +# After (blocks kubectl set image): +- apiGroups: ["apps"] + resources: ["deployments"] + verbs: ["get", "list", "watch"] ``` -**Alternative — GitOps (ArgoCD):** +Developers retain read access (`get`, `list`, `watch`) and exec access (`pods/exec`), so debugging and log inspection are unaffected. -ArgoCD continuously reconciles cluster state against git. A manual `kubectl set image` is detected as drift and reverted within ~3 minutes. This is more robust than RBAC alone because it also catches changes made via other paths (e.g. direct `kubectl edit`). +**2. llm-engine — `scripts/deploy.sh` (new file)** -**Sanctioned production deploy path:** +This becomes the only sanctioned path for production image updates. It runs `helm upgrade`, which triggers the migration pre-hook before any pod rolls: -All production image updates go through a deploy script that runs `helm upgrade` with the migration pre-hook: ```bash #!/bin/bash # scripts/deploy.sh set -euo pipefail TAG=${1:?usage: deploy.sh } +VALUES=${2:-charts/model-engine/values_sample.yaml} helm upgrade model-engine charts/model-engine \ + --values "$VALUES" \ --set tag="$TAG" \ --wait \ --timeout 10m \ --atomic ``` -**Testing a new image before deploying to production:** - -Use staging (see Rollout Strategy below) or run a standalone pod against the production cluster without mutating the `Deployment`: -```bash -kubectl run test-pod --image= --rm -it -- bash -``` -This requires only `pods/create`, which developer roles retain. +`--atomic` ensures that if the migration job or any pod fails readiness within the timeout, Helm rolls back to the previous release automatically. -**Files to change:** RBAC role manifests (production namespace), `scripts/deploy.sh` (new), CI pipeline config. +**Files to change:** +- **Terracode-ML:** developer `Role`/`ClusterRole` manifest in the production namespace — remove `patch`/`update` on `apps/deployments` +- **llm-engine:** `scripts/deploy.sh` (new) +- **llm-engine `.circleci/config.yml`:** replace any `kubectl set image` calls in CI deploy jobs with `scripts/deploy.sh ` **Owner:** model-engine on-call **Effort:** ~1 day @@ -148,20 +135,32 @@ This requires only `pods/create`, which developer roles retain. ### Rollout Strategy -**Staging:** No restrictions. Developers can use `kubectl set image`, `helm upgrade`, or any other mechanism to test images against the staging cluster. Staging exists specifically for iteration before production. +There is currently no dedicated staging environment for model-engine in this repo — `values_circleci.yaml` is ephemeral (spun up and torn down per CI run) and `values_sample.yaml` targets production. A stable staging environment should be added: + +**Staging (new):** +- Add `values_staging.yaml` mirroring `values_sample.yaml` but pointing at the staging cluster and a staging DB. +- No RBAC restrictions in staging: developers can use `kubectl set image`, `helm upgrade`, or any other mechanism freely. +- Use staging to validate that a new image starts cleanly, migrations run, and endpoint smoke tests pass before touching production. + +**Testing a new image on staging before promoting to production:** +```bash +# Any of these are fine in staging: +kubectl set image deployment/model-engine-gateway gateway= # quick iteration +scripts/deploy.sh charts/model-engine/values_staging.yaml # full helm path +``` -**Production:** All image updates must go through `helm upgrade` via `scripts/deploy.sh`. `kubectl set image` is blocked at the RBAC level — developer roles do not have `patch`/`update` on `Deployment` resources in the production namespace. The `helm upgrade` path runs the migration pre-hook automatically before any pod rolls over. +**Production:** +All image updates must go through `scripts/deploy.sh` with the production values file. `kubectl set image` is blocked at the RBAC level — developer roles do not have `patch`/`update` on `Deployment` resources in the production namespace. -The deploy path in both environments: ``` -staging: any method (kubectl set image, helm upgrade, etc.) +staging: any method — iterate freely ↓ - validate: smoke tests pass, no 500s + validate: migrations apply cleanly, smoke tests pass, no 500s ↓ production: scripts/deploy.sh (helm upgrade, migration-first, RBAC-enforced) ``` -Any change that requires a DB schema update (new ORM column) must have its Alembic migration merged and applied to production before the new image is deployed. +Any change that requires a DB schema update (new ORM column) must have its Alembic migration merged and applied to production before the new image is deployed via `scripts/deploy.sh`. --- From cb1b3a3726cfa84357738b5a4411328436073189 Mon Sep 17 00:00:00 2001 From: lilyz-ai Date: Mon, 4 May 2026 00:29:12 +0000 Subject: [PATCH 07/27] docs(post-mortem): fix RBAC location, drop scripts/deploy.sh, expand rollout strategy - RBAC: note no developer Role exists in Terracode-ML scaleapi-ml-serving/; add inspection steps to find the correct EKS access entry before making changes - Deploy path: replace proposed scripts/deploy.sh with existing just deploy prod (model-engine-internal justfile already does helm upgrade --atomic --timeout 120m0s) - Rollout strategy: add env table (training/launch/prod/circleci), note launch cluster may already serve as staging with ~1h effort vs new cluster, clarify effort range Co-Authored-By: Claude Sonnet 4.6 --- .../post-mortem-mli-6574-model-engine-500s.md | 95 +++++++++---------- 1 file changed, 46 insertions(+), 49 deletions(-) diff --git a/docs/internal/post-mortem-mli-6574-model-engine-500s.md b/docs/internal/post-mortem-mli-6574-model-engine-500s.md index 194fd651..8cad3f39 100644 --- a/docs/internal/post-mortem-mli-6574-model-engine-500s.md +++ b/docs/internal/post-mortem-mli-6574-model-engine-500s.md @@ -81,86 +81,83 @@ This pre-upgrade hook runs `alembic upgrade head` before the gateway deployment **How — Kubernetes RBAC:** -Developer kubeconfig contexts for the production namespace/cluster must not carry `patch`/`update` on `Deployment` resources. Only the CI service account (the one that runs `helm upgrade`) retains those verbs. When a developer runs `kubectl set image`, the API server returns 403 immediately — no pods are updated, no rollback is needed. +Developer kubeconfig contexts for the production cluster (`ml-serving-new`) must not carry `patch`/`update` on `Deployment` resources. Only the CI service account retains those verbs. When a developer runs `kubectl set image`, the API server returns 403 immediately — no pods are updated, no rollback is needed. Developers retain read access and `pods/exec`, so debugging and log inspection are unaffected. -Two files need to change: +**Finding the right file to change:** -**1. Terracode-ML — developer Role in the production namespace** +There is no existing developer `Role`/`ClusterRole` for the model-engine namespace in Terracode-ML — no such manifest was found in `scaleapi-ml-serving/`. Developer access to `ml-serving-new` is likely granted via AWS EKS access entries or the `aws-auth` ConfigMap, which may give developers a cluster-level role (e.g. `system:masters` or a custom group). Before making this change: -Find the existing developer `Role` or `ClusterRole` that grants access to the production `model-engine` namespace (likely in `scaleapi-ml-serving/` or similar). Remove `patch` and `update` from the `deployments` rule: +1. Run `kubectl describe clusterrolebinding -n default | grep -A5 ` on `ml-serving-new` to find which ClusterRole/Role your personal context binds to. +2. Locate that binding in Terracode-ML `scaleapi-ml-serving/` (or the relevant EKS access entry config). +3. Remove `patch`/`update` on `apps/deployments` from the developer role, or create a namespace-scoped `Role` that overrides the cluster-level grant for the model-engine namespace. -```yaml -# Before (grants kubectl set image): -- apiGroups: ["apps"] - resources: ["deployments"] - verbs: ["get", "list", "watch", "patch", "update"] - -# After (blocks kubectl set image): -- apiGroups: ["apps"] - resources: ["deployments"] - verbs: ["get", "list", "watch"] -``` +**Deploy path — already exists in model-engine-internal:** -Developers retain read access (`get`, `list`, `watch`) and exec access (`pods/exec`), so debugging and log inspection are unaffected. +No new script is needed. `model-engine-internal/justfile` already provides the sanctioned deploy command: -**2. llm-engine — `scripts/deploy.sh` (new file)** - -This becomes the only sanctioned path for production image updates. It runs `helm upgrade`, which triggers the migration pre-hook before any pod rolls: +```bash +# From model-engine-internal/ +just deploy prod # helm upgrade with --atomic and --timeout 120m0s +just deploy launch # same path, targets ml-launch-new cluster (staging) +``` +This runs: ```bash -#!/bin/bash -# scripts/deploy.sh -set -euo pipefail -TAG=${1:?usage: deploy.sh } -VALUES=${2:-charts/model-engine/values_sample.yaml} -helm upgrade model-engine charts/model-engine \ - --values "$VALUES" \ - --set tag="$TAG" \ - --wait \ - --timeout 10m \ - --atomic +helm upgrade model-engine model-engine \ + -f model-engine-internal/resources/values/values_prod.yaml \ + --set tag= \ + --atomic \ + --timeout 120m0s ``` -`--atomic` ensures that if the migration job or any pod fails readiness within the timeout, Helm rolls back to the previous release automatically. +The Helm pre-upgrade hook runs the migration job before any pod rolls. `--atomic` rolls back automatically if it fails. **Files to change:** -- **Terracode-ML:** developer `Role`/`ClusterRole` manifest in the production namespace — remove `patch`/`update` on `apps/deployments` -- **llm-engine:** `scripts/deploy.sh` (new) -- **llm-engine `.circleci/config.yml`:** replace any `kubectl set image` calls in CI deploy jobs with `scripts/deploy.sh ` +- **Terracode-ML `scaleapi-ml-serving/`:** investigate and restrict developer role — remove `patch`/`update` on `apps/deployments` for the production cluster/namespace (exact file TBD after running the kubectl inspect above) +- **model-engine-internal `justfile`:** already correct; no changes needed — `just deploy prod` is the right path **Owner:** model-engine on-call -**Effort:** ~1 day +**Effort:** ~1 day (mostly the RBAC investigation + review cycle) --- ### Rollout Strategy -There is currently no dedicated staging environment for model-engine in this repo — `values_circleci.yaml` is ephemeral (spun up and torn down per CI run) and `values_sample.yaml` targets production. A stable staging environment should be added: +**Existing environments in model-engine-internal:** + +| Env | Cluster | Values file | Purpose | +|---|---|---|---| +| `training` | `ml-training-new` | `values_training.yaml` | default dev/test target | +| `launch` | `ml-launch-new` | `values_launch.yaml` | separate launch cluster | +| `prod` | `ml-serving-new` | `values_prod.yaml` | production | +| `circleci` | minikube (ephemeral) | `values_circleci.yaml` | CI integration tests only | + +The `launch` environment (`ml-launch-new`) may already serve as a staging environment for production deploys — confirm whether it is actively used as production before treating it as staging. If it is available, no new cluster is needed. -**Staging (new):** -- Add `values_staging.yaml` mirroring `values_sample.yaml` but pointing at the staging cluster and a staging DB. -- No RBAC restrictions in staging: developers can use `kubectl set image`, `helm upgrade`, or any other mechanism freely. -- Use staging to validate that a new image starts cleanly, migrations run, and endpoint smoke tests pass before touching production. +**Staging (using `launch` or `training`):** +- No RBAC restrictions — developers can use `kubectl set image`, `helm upgrade`, or any other mechanism freely. +- Use staging to validate: new image starts cleanly, migrations apply without error, smoke tests pass. +- **Effort:** if `launch` is already available and not serving production traffic, this is ~1 hour of process documentation. If a new cluster is needed, effort is significantly higher (new EKS cluster + Terraform + values file). **Testing a new image on staging before promoting to production:** ```bash -# Any of these are fine in staging: +# From model-engine-internal/ — anything goes in staging: +just deploy launch # full helm path, migration pre-hook runs kubectl set image deployment/model-engine-gateway gateway= # quick iteration -scripts/deploy.sh charts/model-engine/values_staging.yaml # full helm path ``` **Production:** -All image updates must go through `scripts/deploy.sh` with the production values file. `kubectl set image` is blocked at the RBAC level — developer roles do not have `patch`/`update` on `Deployment` resources in the production namespace. +All image updates go through the existing `just deploy prod` command. `kubectl set image` is blocked at the RBAC level — developer roles do not have `patch`/`update` on `Deployment` resources in the production cluster. ``` -staging: any method — iterate freely - ↓ - validate: migrations apply cleanly, smoke tests pass, no 500s - ↓ -production: scripts/deploy.sh (helm upgrade, migration-first, RBAC-enforced) +staging (launch/training): any method — iterate freely + ↓ + validate: migrations apply, smoke tests pass, no 500s + ↓ +production: just deploy prod (helm upgrade, migration-first, RBAC-enforced) ``` -Any change that requires a DB schema update (new ORM column) must have its Alembic migration merged and applied to production before the new image is deployed via `scripts/deploy.sh`. +Any change that requires a DB schema update (new ORM column) must have its Alembic migration merged and applied to production before `just deploy prod` is run. --- From a6a7994eedeb1170a5633b0a2e7b59d27ef143ce Mon Sep 17 00:00:00 2001 From: lilyz-ai Date: Mon, 4 May 2026 00:37:22 +0000 Subject: [PATCH 08/27] docs(post-mortem): pin actual RBAC files and add real staging effort estimate - RBAC: identify eks.tf (lines 545-561) and workloads.tf as exact files to change; explain ml_infra_admin cluster-admin grant is why kubectl set image works today - Staging: clarify all 3 clusters are prod; add component table showing RDS/Redis/VPC/DNS already exist in scaleapi-ml-serving/global/; new EKS cluster is the dominant cost; total estimate ~8-11 days Co-Authored-By: Claude Sonnet 4.6 --- .../post-mortem-mli-6574-model-engine-500s.md | 120 ++++++++++-------- 1 file changed, 69 insertions(+), 51 deletions(-) diff --git a/docs/internal/post-mortem-mli-6574-model-engine-500s.md b/docs/internal/post-mortem-mli-6574-model-engine-500s.md index 8cad3f39..33c2d35b 100644 --- a/docs/internal/post-mortem-mli-6574-model-engine-500s.md +++ b/docs/internal/post-mortem-mli-6574-model-engine-500s.md @@ -81,83 +81,101 @@ This pre-upgrade hook runs `alembic upgrade head` before the gateway deployment **How — Kubernetes RBAC:** -Developer kubeconfig contexts for the production cluster (`ml-serving-new`) must not carry `patch`/`update` on `Deployment` resources. Only the CI service account retains those verbs. When a developer runs `kubectl set image`, the API server returns 403 immediately — no pods are updated, no rollback is needed. Developers retain read access and `pods/exec`, so debugging and log inspection are unaffected. +Developer access to `ml-serving-new` is granted via the `ml_infra_admin` EKS access entry in Terracode-ML, which binds the `AWSReservedSSO_MLInfraAdmin_*` IAM role to `AmazonEKSClusterAdminPolicy` — effectively cluster-admin. That is why `kubectl set image` currently succeeds from a developer context. The fix requires replacing the blanket cluster-admin grant with a scoped role that excludes `patch`/`update` on `apps/deployments`. -**Finding the right file to change:** - -There is no existing developer `Role`/`ClusterRole` for the model-engine namespace in Terracode-ML — no such manifest was found in `scaleapi-ml-serving/`. Developer access to `ml-serving-new` is likely granted via AWS EKS access entries or the `aws-auth` ConfigMap, which may give developers a cluster-level role (e.g. `system:masters` or a custom group). Before making this change: +**Files to change:** -1. Run `kubectl describe clusterrolebinding -n default | grep -A5 ` on `ml-serving-new` to find which ClusterRole/Role your personal context binds to. -2. Locate that binding in Terracode-ML `scaleapi-ml-serving/` (or the relevant EKS access entry config). -3. Remove `patch`/`update` on `apps/deployments` from the developer role, or create a namespace-scoped `Role` that overrides the cluster-level grant for the model-engine namespace. +**1. `Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/eks.tf` (lines 545–561)** -**Deploy path — already exists in model-engine-internal:** +Currently grants cluster-admin to ml-infra admins: +```hcl +resource "aws_eks_access_policy_association" "ml_infra_admin" { + policy_arn = "arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy" + access_scope { type = "cluster" } +} +``` +Change: replace `AmazonEKSClusterAdminPolicy` with a namespace-scoped policy, or keep cluster-admin for infra but add the namespace-scoped Role below to override deployment mutations in `scale-deploy`. + +**2. `Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/workloads.tf`** + +Add a namespace-scoped `Role` + `RoleBinding` in the `scale-deploy` namespace that restricts deployment mutations to the CI service account only. In Kubernetes, RBAC is additive — you cannot deny via a Role — so the correct approach is to ensure developers are not bound to any role that grants `patch`/`update` on `apps/deployments` in `scale-deploy`. Concretely: + +```hcl +resource "kubernetes_role" "model_engine_deployer" { + metadata { + name = "model-engine-deployer" + namespace = "scale-deploy" + } + rule { + api_groups = ["apps"] + resources = ["deployments"] + verbs = ["get", "list", "watch"] # read-only; no patch/update + } + # ... all other developer-needed verbs on pods, logs, configmaps, etc. +} +``` -No new script is needed. `model-engine-internal/justfile` already provides the sanctioned deploy command: +Only the `ml-k8s-admin` service account (already cluster-admin, used by `helm upgrade`) retains deployment mutation rights. A developer running `kubectl set image` against `scale-deploy` gets a 403. -```bash -# From model-engine-internal/ -just deploy prod # helm upgrade with --atomic and --timeout 120m0s -just deploy launch # same path, targets ml-launch-new cluster (staging) -``` +**Deploy path — already exists in model-engine-internal:** -This runs: +No new script is needed. `model-engine-internal/justfile` already provides: ```bash -helm upgrade model-engine model-engine \ - -f model-engine-internal/resources/values/values_prod.yaml \ - --set tag= \ - --atomic \ - --timeout 120m0s +just deploy prod # helm upgrade --atomic --timeout 120m0s, runs migration pre-hook first ``` -The Helm pre-upgrade hook runs the migration job before any pod rolls. `--atomic` rolls back automatically if it fails. - **Files to change:** -- **Terracode-ML `scaleapi-ml-serving/`:** investigate and restrict developer role — remove `patch`/`update` on `apps/deployments` for the production cluster/namespace (exact file TBD after running the kubectl inspect above) -- **model-engine-internal `justfile`:** already correct; no changes needed — `just deploy prod` is the right path +- **`Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/eks.tf`** (lines 545–561) — scope or replace the ml_infra_admin cluster-admin policy +- **`Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/workloads.tf`** — add `model-engine-deployer` Role + RoleBinding for `scale-deploy` namespace, removing `patch`/`update` on `apps/deployments` +- **`model-engine-internal/justfile`** — already correct; `just deploy prod` is the right path **Owner:** model-engine on-call -**Effort:** ~1 day (mostly the RBAC investigation + review cycle) +**Effort:** ~2 days (Terraform RBAC changes + Atlantis plan/apply + validation) --- ### Rollout Strategy -**Existing environments in model-engine-internal:** +**Existing environments — all carry production traffic:** -| Env | Cluster | Values file | Purpose | -|---|---|---|---| -| `training` | `ml-training-new` | `values_training.yaml` | default dev/test target | -| `launch` | `ml-launch-new` | `values_launch.yaml` | separate launch cluster | -| `prod` | `ml-serving-new` | `values_prod.yaml` | production | -| `circleci` | minikube (ephemeral) | `values_circleci.yaml` | CI integration tests only | +| Env | Cluster | Purpose | +|---|---|---| +| `training` | `ml-training-new` | production training workloads | +| `launch` | `ml-launch-new` | production Launch API | +| `prod` | `ml-serving-new` | production serving | +| `circleci` | minikube (ephemeral) | CI integration tests only | -The `launch` environment (`ml-launch-new`) may already serve as a staging environment for production deploys — confirm whether it is actively used as production before treating it as staging. If it is available, no new cluster is needed. +There is no staging environment for model-engine. One needs to be created. -**Staging (using `launch` or `training`):** -- No RBAC restrictions — developers can use `kubectl set image`, `helm upgrade`, or any other mechanism freely. -- Use staging to validate: new image starts cleanly, migrations apply without error, smoke tests pass. -- **Effort:** if `launch` is already available and not serving production traffic, this is ~1 hour of process documentation. If a new cluster is needed, effort is significantly higher (new EKS cluster + Terraform + values file). +**Staging environment — effort estimate (~8–11 days total):** -**Testing a new image on staging before promoting to production:** -```bash -# From model-engine-internal/ — anything goes in staging: -just deploy launch # full helm path, migration pre-hook runs -kubectl set image deployment/model-engine-gateway gateway= # quick iteration -``` +Several staging infrastructure components already exist in Terracode-ML `scaleapi-ml-serving/global/`: + +| Component | Status | File | Effort | +|---|---|---|---| +| RDS Aurora PostgreSQL | **Already exists** (`ml-infra-staging`) | `global/db.tf` lines 85–164 | 0 days | +| ElastiCache Redis | **Already exists** (`staging-celery-redis-rg-1`) | `global/celery-elasticache.tf` lines 44–80 | 0 days | +| VPC / subnets | **Already exists** (reuse ml-serving VPC) | — | 0 days | +| Route53 DNS (`ml-staging-internal.scale.com`) | **Already exists** | `global/dns.tf` lines 35–49 | 0 days | +| EKS cluster (`ml-staging-new`) | **Does not exist** | new `clusters/ml-staging-new/` | ~4–5 days | +| IAM / IRSA roles | Partial (extend from prod) | `global/iam-irsa.tf` | ~1–2 days | +| K8s namespace + secrets | Does not exist | applied via Helm/kubectl post-cluster | ~0.5 day | +| `values_staging.yaml` + service/infra configs | Does not exist | `model-engine-internal/resources/values/` | ~1 day | +| `justfile` staging env wiring | Does not exist | `model-engine-internal/justfile` | ~0.5 day | + +The dominant cost is the new EKS cluster (node groups, GPU operators, Istio, autoscaler, IRSA wiring). The DB, Redis, VPC, and DNS are already provisioned. -**Production:** -All image updates go through the existing `just deploy prod` command. `kubectl set image` is blocked at the RBAC level — developer roles do not have `patch`/`update` on `Deployment` resources in the production cluster. +**Process once staging exists:** ``` -staging (launch/training): any method — iterate freely - ↓ - validate: migrations apply, smoke tests pass, no 500s - ↓ -production: just deploy prod (helm upgrade, migration-first, RBAC-enforced) +staging: any method — kubectl set image, just deploy staging, etc. + ↓ + validate: migrations apply cleanly, smoke tests pass, no 500s + ↓ +production: just deploy prod (helm upgrade, migration-first, RBAC-enforced) ``` -Any change that requires a DB schema update (new ORM column) must have its Alembic migration merged and applied to production before `just deploy prod` is run. +Any change requiring a DB schema update (new ORM column) must have its Alembic migration merged and applied to production before `just deploy prod` is run. --- From 6e084d25347a15fe9310f5bbc4333d33abeee158 Mon Sep 17 00:00:00 2001 From: lilyz-ai Date: Mon, 4 May 2026 00:52:38 +0000 Subject: [PATCH 09/27] docs(post-mortem): write concrete RBAC diffs and fix staging estimate MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - eks.tf: show exact before/after — scope ml_infra_admin from type=cluster to type=namespace excluding scale-deploy (workload_namespaces_default minus scale-deploy) - workloads.tf: rename model_engine_deployer -> scale_deploy_developer; full Role content with deployment read-only + pod/exec/log/job rules + RoleBinding - Staging: clarify staging DB/Redis/DNS already exist (never wired up); add Option A (new namespace in ml-serving-new, ~3-4 days) vs Option B (new EKS cluster, ~8-11 days) Co-Authored-By: Claude Sonnet 4.6 --- .../post-mortem-mli-6574-model-engine-500s.md | 114 +++++++++++++----- 1 file changed, 85 insertions(+), 29 deletions(-) diff --git a/docs/internal/post-mortem-mli-6574-model-engine-500s.md b/docs/internal/post-mortem-mli-6574-model-engine-500s.md index 33c2d35b..5880e489 100644 --- a/docs/internal/post-mortem-mli-6574-model-engine-500s.md +++ b/docs/internal/post-mortem-mli-6574-model-engine-500s.md @@ -85,37 +85,85 @@ Developer access to `ml-serving-new` is granted via the `ml_infra_admin` EKS acc **Files to change:** -**1. `Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/eks.tf` (lines 545–561)** +**1. `Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/eks.tf` (lines 552–561)** + +Kubernetes RBAC is purely additive — a namespace Role cannot deny what a cluster-admin grant allows. The only fix is to remove the cluster-admin grant in `scale-deploy`. Change the `ml_infra_admin` access scope from cluster-wide to all namespaces except `scale-deploy`: -Currently grants cluster-admin to ml-infra admins: ```hcl +# Before: resource "aws_eks_access_policy_association" "ml_infra_admin" { - policy_arn = "arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy" - access_scope { type = "cluster" } + for_each = toset(data.aws_iam_roles.ml_infra_admin.arns) + cluster_name = local.cluster_name + policy_arn = "arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy" + principal_arn = each.value + + access_scope { + type = "cluster" + } +} + +# After: +resource "aws_eks_access_policy_association" "ml_infra_admin" { + for_each = toset(data.aws_iam_roles.ml_infra_admin.arns) + cluster_name = local.cluster_name + policy_arn = "arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy" + principal_arn = each.value + + access_scope { + type = "namespace" + namespaces = [for ns in local.workload_namespaces_default : ns if ns != "scale-deploy"] + # ["mlflow", "image-builder", "pyspark", "default", "gpu-operator"] + } } ``` -Change: replace `AmazonEKSClusterAdminPolicy` with a namespace-scoped policy, or keep cluster-admin for infra but add the namespace-scoped Role below to override deployment mutations in `scale-deploy`. + +With this change, ml_infra_admin has cluster-admin in all other namespaces but no implicit grant in `scale-deploy`. **2. `Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/workloads.tf`** -Add a namespace-scoped `Role` + `RoleBinding` in the `scale-deploy` namespace that restricts deployment mutations to the CI service account only. In Kubernetes, RBAC is additive — you cannot deny via a Role — so the correct approach is to ensure developers are not bound to any role that grants `patch`/`update` on `apps/deployments` in `scale-deploy`. Concretely: +Add a `scale-deploy`-scoped Role that gives developers what they need (read, exec, logs) without deployment mutations, and bind it to the ml_infra_admin group: ```hcl -resource "kubernetes_role" "model_engine_deployer" { +resource "kubernetes_role" "scale_deploy_developer" { metadata { - name = "model-engine-deployer" + name = "scale-deploy-developer" namespace = "scale-deploy" } rule { api_groups = ["apps"] - resources = ["deployments"] - verbs = ["get", "list", "watch"] # read-only; no patch/update + resources = ["deployments", "replicasets"] + verbs = ["get", "list", "watch"] # no patch/update — blocks kubectl set image + } + rule { + api_groups = [""] + resources = ["pods", "pods/log", "pods/exec", "configmaps", "services", "endpoints"] + verbs = ["get", "list", "watch", "create", "delete"] + } + rule { + api_groups = ["batch"] + resources = ["jobs"] + verbs = ["get", "list", "watch", "create", "delete"] + } +} + +resource "kubernetes_role_binding" "scale_deploy_developer" { + metadata { + name = "scale-deploy-developer-binding" + namespace = "scale-deploy" + } + role_ref { + api_group = "rbac.authorization.k8s.io" + kind = "Role" + name = kubernetes_role.scale_deploy_developer.metadata[0].name + } + subject { + kind = "Group" + name = "ml-infra-admin" # matches the k8s group from aws-auth ConfigMap for ml_infra_admin } - # ... all other developer-needed verbs on pods, logs, configmaps, etc. } ``` -Only the `ml-k8s-admin` service account (already cluster-admin, used by `helm upgrade`) retains deployment mutation rights. A developer running `kubectl set image` against `scale-deploy` gets a 403. +The `ml-k8s-admin` service account (already cluster-admin, used by `just deploy prod` via `helm upgrade`) is unaffected and retains full deployment mutation rights. **Deploy path — already exists in model-engine-internal:** @@ -125,8 +173,8 @@ just deploy prod # helm upgrade --atomic --timeout 120m0s, runs migration pre- ``` **Files to change:** -- **`Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/eks.tf`** (lines 545–561) — scope or replace the ml_infra_admin cluster-admin policy -- **`Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/workloads.tf`** — add `model-engine-deployer` Role + RoleBinding for `scale-deploy` namespace, removing `patch`/`update` on `apps/deployments` +- **`Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/eks.tf`** (lines 552–561) — scope `ml_infra_admin` cluster-admin to all namespaces except `scale-deploy` +- **`Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/workloads.tf`** — add `scale_deploy_developer` Role + RoleBinding for `scale-deploy` - **`model-engine-internal/justfile`** — already correct; `just deploy prod` is the right path **Owner:** model-engine on-call @@ -147,23 +195,31 @@ just deploy prod # helm upgrade --atomic --timeout 120m0s, runs migration pre- There is no staging environment for model-engine. One needs to be created. -**Staging environment — effort estimate (~8–11 days total):** +**Staging environment — effort estimate:** -Several staging infrastructure components already exist in Terracode-ML `scaleapi-ml-serving/global/`: +The staging backing infrastructure already exists in Terracode-ML `scaleapi-ml-serving/global/` — it was provisioned but never wired up to a model-engine deployment: -| Component | Status | File | Effort | -|---|---|---|---| -| RDS Aurora PostgreSQL | **Already exists** (`ml-infra-staging`) | `global/db.tf` lines 85–164 | 0 days | -| ElastiCache Redis | **Already exists** (`staging-celery-redis-rg-1`) | `global/celery-elasticache.tf` lines 44–80 | 0 days | -| VPC / subnets | **Already exists** (reuse ml-serving VPC) | — | 0 days | -| Route53 DNS (`ml-staging-internal.scale.com`) | **Already exists** | `global/dns.tf` lines 35–49 | 0 days | -| EKS cluster (`ml-staging-new`) | **Does not exist** | new `clusters/ml-staging-new/` | ~4–5 days | -| IAM / IRSA roles | Partial (extend from prod) | `global/iam-irsa.tf` | ~1–2 days | -| K8s namespace + secrets | Does not exist | applied via Helm/kubectl post-cluster | ~0.5 day | -| `values_staging.yaml` + service/infra configs | Does not exist | `model-engine-internal/resources/values/` | ~1 day | -| `justfile` staging env wiring | Does not exist | `model-engine-internal/justfile` | ~0.5 day | - -The dominant cost is the new EKS cluster (node groups, GPU operators, Istio, autoscaler, IRSA wiring). The DB, Redis, VPC, and DNS are already provisioned. +| Component | Status | File | +|---|---|---| +| RDS Aurora PostgreSQL (`ml-infra-staging`, secret `staging/ml_infra_pg`) | **Already exists** | `global/db.tf` lines 85–164 | +| ElastiCache Redis (`staging-celery-redis-rg-1`) | **Already exists** | `global/celery-elasticache.tf` lines 44–80 | +| VPC / subnets | **Already exists** (shared ml-serving VPC) | — | +| Route53 DNS (`ml-staging-internal.scale.com`) | **Already exists** | `global/dns.tf` lines 35–49 | +| EKS cluster | **Does not exist** | — | + +Two options for the missing K8s layer: + +**Option A — New namespace in ml-serving-new (~3–4 days):** Add a `scale-deploy-staging` namespace in the existing production cluster. DB and Redis already point at staging endpoints. Lower isolation (shares cluster with prod) but fastest path to a working staging deploy. +- `values_staging.yaml` pointing at `staging/ml_infra_pg` and staging Redis: ~1 day +- K8s namespace + IRSA service account + secrets: ~0.5 day +- `justfile` staging env wiring: ~0.5 day +- Validation / smoke tests: ~1 day + +**Option B — New EKS cluster (~8–11 days):** Full isolation. Dominant cost is the new EKS cluster definition (node groups, GPU operators, Istio, autoscaler IRSA). Everything else reuses existing staging infra. +- New `clusters/ml-staging-new/` Terraform: ~4–5 days +- IAM / IRSA: ~1–2 days +- Config files + justfile: ~2 days +- Validation: ~1–2 days **Process once staging exists:** From 5805dbb29cdf6faa09568614e546b2a70fbdca41 Mon Sep 17 00:00:00 2001 From: lilyz-ai Date: Mon, 4 May 2026 01:03:24 +0000 Subject: [PATCH 10/27] docs(post-mortem): fix deploy impact of RBAC change, rename role, trim sections - eks.tf: note that scoping away scale-deploy also blocks just deploy prod (uses dev context); production deploys must move to CI (ml_integration_test_lambda) first - Rename scale_deploy_developer -> scale_developer / scale-developer - Staging: drop Option B (new EKS cluster); only Option A (new namespace, ~3-4 days) - Remove Lessons Learned section Co-Authored-By: Claude Sonnet 4.6 --- .../post-mortem-mli-6574-model-engine-500s.md | 53 ++++++------------- 1 file changed, 16 insertions(+), 37 deletions(-) diff --git a/docs/internal/post-mortem-mli-6574-model-engine-500s.md b/docs/internal/post-mortem-mli-6574-model-engine-500s.md index 5880e489..e735d8c9 100644 --- a/docs/internal/post-mortem-mli-6574-model-engine-500s.md +++ b/docs/internal/post-mortem-mli-6574-model-engine-500s.md @@ -81,7 +81,9 @@ This pre-upgrade hook runs `alembic upgrade head` before the gateway deployment **How — Kubernetes RBAC:** -Developer access to `ml-serving-new` is granted via the `ml_infra_admin` EKS access entry in Terracode-ML, which binds the `AWSReservedSSO_MLInfraAdmin_*` IAM role to `AmazonEKSClusterAdminPolicy` — effectively cluster-admin. That is why `kubectl set image` currently succeeds from a developer context. The fix requires replacing the blanket cluster-admin grant with a scoped role that excludes `patch`/`update` on `apps/deployments`. +Developer access to `ml-serving-new` is granted via the `ml_infra_admin` EKS access entry in Terracode-ML, which binds `AWSReservedSSO_MLInfraAdmin_*` to `AmazonEKSClusterAdminPolicy` — cluster-admin cluster-wide. That is why `kubectl set image` succeeds from a developer context today. + +**Important:** scoping away cluster-admin from `scale-deploy` also blocks `just deploy prod` run locally, because it uses the developer's personal kubectl context (`_set_k8s prod` switches to `ml-serving-new`). To make the RBAC change safe, **production deploys must move to CI**, where the CI runner authenticates as a service account that retains cluster-admin (`ml_integration_test_lambda`, already granted `AmazonEKSClusterAdminPolicy` in `eks.tf` lines 529–543). Developers trigger the CI deploy job rather than running `just deploy prod` locally. **Files to change:** @@ -117,16 +119,14 @@ resource "aws_eks_access_policy_association" "ml_infra_admin" { } ``` -With this change, ml_infra_admin has cluster-admin in all other namespaces but no implicit grant in `scale-deploy`. - **2. `Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/workloads.tf`** -Add a `scale-deploy`-scoped Role that gives developers what they need (read, exec, logs) without deployment mutations, and bind it to the ml_infra_admin group: +Add a `scale-deploy`-scoped Role that gives developers read/exec/log access without deployment mutations: ```hcl -resource "kubernetes_role" "scale_deploy_developer" { +resource "kubernetes_role" "scale_developer" { metadata { - name = "scale-deploy-developer" + name = "scale-developer" namespace = "scale-deploy" } rule { @@ -146,39 +146,34 @@ resource "kubernetes_role" "scale_deploy_developer" { } } -resource "kubernetes_role_binding" "scale_deploy_developer" { +resource "kubernetes_role_binding" "scale_developer" { metadata { - name = "scale-deploy-developer-binding" + name = "scale-developer-binding" namespace = "scale-deploy" } role_ref { api_group = "rbac.authorization.k8s.io" kind = "Role" - name = kubernetes_role.scale_deploy_developer.metadata[0].name + name = kubernetes_role.scale_developer.metadata[0].name } subject { kind = "Group" - name = "ml-infra-admin" # matches the k8s group from aws-auth ConfigMap for ml_infra_admin + name = "ml-infra-admin" } } ``` -The `ml-k8s-admin` service account (already cluster-admin, used by `just deploy prod` via `helm upgrade`) is unaffected and retains full deployment mutation rights. +**3. `model-engine-internal` — move production deploy to CI** -**Deploy path — already exists in model-engine-internal:** - -No new script is needed. `model-engine-internal/justfile` already provides: -```bash -just deploy prod # helm upgrade --atomic --timeout 120m0s, runs migration pre-hook first -``` +Add a CircleCI deploy job (or manual workflow trigger) that runs `just deploy prod` authenticated as `ml_integration_test_lambda` (already has cluster-admin on `ml-serving-new`). Developers merge to main or manually trigger the job rather than running `just deploy prod` locally. **Files to change:** - **`Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/eks.tf`** (lines 552–561) — scope `ml_infra_admin` cluster-admin to all namespaces except `scale-deploy` -- **`Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/workloads.tf`** — add `scale_deploy_developer` Role + RoleBinding for `scale-deploy` -- **`model-engine-internal/justfile`** — already correct; `just deploy prod` is the right path +- **`Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/workloads.tf`** — add `scale_developer` Role + RoleBinding for `scale-deploy` +- **`model-engine-internal/.circleci/config.yml`** (or equivalent) — add CI deploy job authenticated as `ml_integration_test_lambda` **Owner:** model-engine on-call -**Effort:** ~2 days (Terraform RBAC changes + Atlantis plan/apply + validation) +**Effort:** ~3 days (RBAC Terraform + CI deploy job + Atlantis plan/apply + validation) --- @@ -207,20 +202,12 @@ The staging backing infrastructure already exists in Terracode-ML `scaleapi-ml-s | Route53 DNS (`ml-staging-internal.scale.com`) | **Already exists** | `global/dns.tf` lines 35–49 | | EKS cluster | **Does not exist** | — | -Two options for the missing K8s layer: - -**Option A — New namespace in ml-serving-new (~3–4 days):** Add a `scale-deploy-staging` namespace in the existing production cluster. DB and Redis already point at staging endpoints. Lower isolation (shares cluster with prod) but fastest path to a working staging deploy. +Add a `scale-deploy-staging` namespace in ml-serving-new. DB, Redis, VPC, and DNS already point at staging endpoints — the namespace is the only missing K8s layer. Estimated ~3–4 days: - `values_staging.yaml` pointing at `staging/ml_infra_pg` and staging Redis: ~1 day - K8s namespace + IRSA service account + secrets: ~0.5 day - `justfile` staging env wiring: ~0.5 day - Validation / smoke tests: ~1 day -**Option B — New EKS cluster (~8–11 days):** Full isolation. Dominant cost is the new EKS cluster definition (node groups, GPU operators, Istio, autoscaler IRSA). Everything else reuses existing staging infra. -- New `clusters/ml-staging-new/` Terraform: ~4–5 days -- IAM / IRSA: ~1–2 days -- Config files + justfile: ~2 days -- Validation: ~1–2 days - **Process once staging exists:** ``` @@ -233,11 +220,3 @@ production: just deploy prod (helm upgrade, migration-first, RBAC-enforced) Any change requiring a DB schema update (new ORM column) must have its Alembic migration merged and applied to production before `just deploy prod` is run. ---- - -## Lessons Learned - -1. **Migration hooks only protect you if you use Helm.** The project had the right guardrail (pre-upgrade hook at weight -1); it was bypassed by using `kubectl set image` directly. The operational pattern matters as much as the technical control. -2. **Fail loudly at startup, not silently on every request.** A startup check that raises an exception is strictly better than an app that boots clean and then 500s on every call. The readiness probe would have contained the blast radius to a failed rollout. -3. **Every ORM column needs a migration.** The ORM model is a lie until the migration runs; treat them as inseparable — the migration file is part of the same diff as the column declaration. -4. **After two failed patches, go backward.** The cost of 5 failed patches was 34 extra minutes of outage. When the root cause is unclear, rollback is always faster than forward. From 261f6abfdded5a32719ef953dbdcc6064db4bcac Mon Sep 17 00:00:00 2001 From: lilyz-ai Date: Mon, 4 May 2026 01:13:50 +0000 Subject: [PATCH 11/27] =?UTF-8?q?docs(post-mortem):=20two=20preventative?= =?UTF-8?q?=20controls=20=E2=80=94=20RBAC=20+=20master=20branch=20guard?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Control a: eks.tf scopes ml_infra_admin away from scale-deploy (blocks kubectl set image); just deploy prod compensates by minting a short-lived ml-k8s-admin SA token for helm; scale_developer role gains serviceaccounts/token create so devs can mint it - Control b: justfile _check_master_branch guard — prod deploy aborts if current commit is not merged to origin/master; unmerged local branches cannot reach production - Remove CI deploy requirement; developers still run just deploy prod locally Co-Authored-By: Claude Sonnet 4.6 --- .../post-mortem-mli-6574-model-engine-500s.md | 81 ++++++++++++------- 1 file changed, 54 insertions(+), 27 deletions(-) diff --git a/docs/internal/post-mortem-mli-6574-model-engine-500s.md b/docs/internal/post-mortem-mli-6574-model-engine-500s.md index e735d8c9..6fca5565 100644 --- a/docs/internal/post-mortem-mli-6574-model-engine-500s.md +++ b/docs/internal/post-mortem-mli-6574-model-engine-500s.md @@ -79,49 +79,39 @@ This pre-upgrade hook runs `alembic upgrade head` before the gateway deployment **What:** `kubectl set image` updates pods directly, bypassing all Helm pre-upgrade hooks including the migration job. The fix must be preventative: a `kubectl set image` attempt against the production deployment is rejected by the API server before it takes effect. -**How — Kubernetes RBAC:** +Two preventative controls: -Developer access to `ml-serving-new` is granted via the `ml_infra_admin` EKS access entry in Terracode-ML, which binds `AWSReservedSSO_MLInfraAdmin_*` to `AmazonEKSClusterAdminPolicy` — cluster-admin cluster-wide. That is why `kubectl set image` succeeds from a developer context today. +**a. RBAC — block `kubectl set image` at the API server** -**Important:** scoping away cluster-admin from `scale-deploy` also blocks `just deploy prod` run locally, because it uses the developer's personal kubectl context (`_set_k8s prod` switches to `ml-serving-new`). To make the RBAC change safe, **production deploys must move to CI**, where the CI runner authenticates as a service account that retains cluster-admin (`ml_integration_test_lambda`, already granted `AmazonEKSClusterAdminPolicy` in `eks.tf` lines 529–543). Developers trigger the CI deploy job rather than running `just deploy prod` locally. +Developer access to `ml-serving-new` is via the `ml_infra_admin` EKS access entry, which grants cluster-admin cluster-wide. That is why `kubectl set image` succeeds today. Scoping it away from `scale-deploy` rejects `kubectl set image` before any pod is updated. + +Scoping away cluster-admin from `scale-deploy` would also break `just deploy prod` run locally (it uses the developer's personal kubectl context). The fix: `just deploy prod` mints a short-lived `ml-k8s-admin` SA token for the helm step. That SA already has cluster-admin (via `kubernetes_cluster_role_binding.ml_k8s_admin` in `workloads.tf`), so helm upgrade continues to work. Developers retain read/exec access via the `scale_developer` role below. + +**b. justfile guard — block deploying unmerged code** + +`just deploy prod` checks that the current commit is merged to `origin/master` before proceeding. Deploying a local branch to production is rejected before any image is built or pushed. **Files to change:** **1. `Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/eks.tf` (lines 552–561)** -Kubernetes RBAC is purely additive — a namespace Role cannot deny what a cluster-admin grant allows. The only fix is to remove the cluster-admin grant in `scale-deploy`. Change the `ml_infra_admin` access scope from cluster-wide to all namespaces except `scale-deploy`: - ```hcl # Before: -resource "aws_eks_access_policy_association" "ml_infra_admin" { - for_each = toset(data.aws_iam_roles.ml_infra_admin.arns) - cluster_name = local.cluster_name - policy_arn = "arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy" - principal_arn = each.value - access_scope { type = "cluster" } -} # After: -resource "aws_eks_access_policy_association" "ml_infra_admin" { - for_each = toset(data.aws_iam_roles.ml_infra_admin.arns) - cluster_name = local.cluster_name - policy_arn = "arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy" - principal_arn = each.value - access_scope { type = "namespace" namespaces = [for ns in local.workload_namespaces_default : ns if ns != "scale-deploy"] # ["mlflow", "image-builder", "pyspark", "default", "gpu-operator"] } -} ``` **2. `Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/workloads.tf`** -Add a `scale-deploy`-scoped Role that gives developers read/exec/log access without deployment mutations: +Add a `scale-deploy`-scoped Role (read/exec/log, no deployment mutations) and grant `create` on `serviceaccounts/token` so developers can mint the SA token needed by `just deploy prod`: ```hcl resource "kubernetes_role" "scale_developer" { @@ -144,6 +134,11 @@ resource "kubernetes_role" "scale_developer" { resources = ["jobs"] verbs = ["get", "list", "watch", "create", "delete"] } + rule { + api_groups = [""] + resources = ["serviceaccounts/token"] + verbs = ["create"] # allows: kubectl create token ml-k8s-admin -n scale-deploy + } } resource "kubernetes_role_binding" "scale_developer" { @@ -163,17 +158,49 @@ resource "kubernetes_role_binding" "scale_developer" { } ``` -**3. `model-engine-internal` — move production deploy to CI** - -Add a CircleCI deploy job (or manual workflow trigger) that runs `just deploy prod` authenticated as `ml_integration_test_lambda` (already has cluster-admin on `ml-serving-new`). Developers merge to main or manually trigger the job rather than running `just deploy prod` locally. +**3. `model-engine-internal/justfile`** + +Add a master branch guard and have the prod deploy step mint a short-lived SA token: + +```just +# New recipe — fails if current commit is not merged to origin/master +_check_master_branch env: + #!/bin/bash + if [ "{{ env }}" = "prod" ]; then + git fetch origin master --quiet + if ! git merge-base --is-ancestor HEAD origin/master; then + echo "ERROR: current commit is not merged to origin/master; prod deploy aborted" + exit 1 + fi + fi + +# Updated _deploy: mint ml-k8s-admin token for prod helm upgrade +_deploy env use_service_identifier: build-push-model-engine (_check_master_branch env) (_set_k8s env) (_mount_aws env) + #!/bin/bash + set -euo pipefail + SA_TOKEN="" + if [ "{{ env }}" = "prod" ]; then + SA_TOKEN=$(kubectl create token ml-k8s-admin -n scale-deploy --duration=30m) + fi + pushd ../llm-engine/charts + helm {{ if use_service_identifier == 'true' { 'install model-engine-' + SERVICE_IDENTIFIER } else { 'upgrade model-engine' } }} \ + model-engine \ + -f ../../model-engine-internal/model-engine-internal/resources/values/values_{{ env }}.yaml \ + --description "{{ GIT_MESSAGE }}" \ + {{ if use_service_identifier == 'true' { '--set serviceIdentifier=' + SERVICE_IDENTIFIER } else { '' } }} \ + {{ if env == 'prod' { '--timeout 120m0s' } else { '' } }} \ + --set tag={{ GIT_SHA }} \ + ${SA_TOKEN:+--kube-token="$SA_TOKEN"} \ + --atomic +``` **Files to change:** -- **`Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/eks.tf`** (lines 552–561) — scope `ml_infra_admin` cluster-admin to all namespaces except `scale-deploy` -- **`Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/workloads.tf`** — add `scale_developer` Role + RoleBinding for `scale-deploy` -- **`model-engine-internal/.circleci/config.yml`** (or equivalent) — add CI deploy job authenticated as `ml_integration_test_lambda` +- **`Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/eks.tf`** (lines 552–561) +- **`Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/workloads.tf`** +- **`model-engine-internal/justfile`** **Owner:** model-engine on-call -**Effort:** ~3 days (RBAC Terraform + CI deploy job + Atlantis plan/apply + validation) +**Effort:** ~3 days (Terraform RBAC + Atlantis apply + justfile changes + validation) --- From 18a5bf0d7922cd6c1c71d52214ce4b26d4fb5ca8 Mon Sep 17 00:00:00 2001 From: lilyz-ai Date: Mon, 4 May 2026 01:21:59 +0000 Subject: [PATCH 12/27] docs(post-mortem): explain SA token, add dedicated IAM role as elegant alternative MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Define ServiceAccount (SA) token inline where first used - Option A: mint short-lived ml-k8s-admin token at deploy time (no new AWS infra) - Option B: dedicated ml-serving-deployer IAM role — cleaner separation of concerns, no token minting, easy to audit; ~1 extra day of setup Co-Authored-By: Claude Sonnet 4.6 --- .../post-mortem-mli-6574-model-engine-500s.md | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/docs/internal/post-mortem-mli-6574-model-engine-500s.md b/docs/internal/post-mortem-mli-6574-model-engine-500s.md index 6fca5565..9413401d 100644 --- a/docs/internal/post-mortem-mli-6574-model-engine-500s.md +++ b/docs/internal/post-mortem-mli-6574-model-engine-500s.md @@ -85,7 +85,15 @@ Two preventative controls: Developer access to `ml-serving-new` is via the `ml_infra_admin` EKS access entry, which grants cluster-admin cluster-wide. That is why `kubectl set image` succeeds today. Scoping it away from `scale-deploy` rejects `kubectl set image` before any pod is updated. -Scoping away cluster-admin from `scale-deploy` would also break `just deploy prod` run locally (it uses the developer's personal kubectl context). The fix: `just deploy prod` mints a short-lived `ml-k8s-admin` SA token for the helm step. That SA already has cluster-admin (via `kubernetes_cluster_role_binding.ml_k8s_admin` in `workloads.tf`), so helm upgrade continues to work. Developers retain read/exec access via the `scale_developer` role below. +Scoping away cluster-admin from `scale-deploy` would also break `just deploy prod` run locally, because it uses the developer's personal kubectl context for the helm step too. The fix is to have `just deploy prod` authenticate the helm upgrade using a separate identity that retains cluster-admin, rather than the developer's personal context. + +**What is a ServiceAccount (SA) token?** A Kubernetes ServiceAccount is an in-cluster identity (distinct from a human user). `ml-k8s-admin` is a ServiceAccount in `scale-deploy` that already has cluster-admin. An SA token is a short-lived credential that authenticates as that SA — helm can use it via `--kube-token`, bypassing the developer's personal RBAC entirely. + +**Option A (current proposal) — mint a short-lived SA token at deploy time:** +`just deploy prod` runs `kubectl create token ml-k8s-admin -n scale-deploy --duration=30m` to get a token, then passes it to `helm upgrade --kube-token`. Works without any new AWS infrastructure; requires adding `serviceaccounts/token create` to the `scale_developer` role so developers can mint it. + +**Option B (more elegant) — dedicated `ml-serving-deployer` IAM role:** +Create a new AWS SSO role `ml-serving-deployer` mapped to a k8s group that has deployment-mutation rights in `scale-deploy` only. Developers run `just deploy prod` with `AWS_PROFILE=ml-serving-deployer`, which updates kubeconfig to use this scoped identity. Cleanly separates "developer read access" (ml_infra_admin) from "deploy access" (ml-serving-deployer) at the IAM level — no token minting, no extra role permissions, easy to audit. Requires adding a new IAM role + EKS access entry + k8s RoleBinding in Terracode-ML (~1 extra day of setup). **b. justfile guard — block deploying unmerged code** @@ -111,7 +119,7 @@ Scoping away cluster-admin from `scale-deploy` would also break `just deploy pro **2. `Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/workloads.tf`** -Add a `scale-deploy`-scoped Role (read/exec/log, no deployment mutations) and grant `create` on `serviceaccounts/token` so developers can mint the SA token needed by `just deploy prod`: +Add a `scale-deploy`-scoped Role (read/exec/log, no deployment mutations). The `serviceaccounts/token create` rule is only needed for Option A (SA token minting); remove it if going with Option B: ```hcl resource "kubernetes_role" "scale_developer" { From a3e83380635cd9691eb2107738f39d26368fef52 Mon Sep 17 00:00:00 2001 From: lilyz-ai Date: Mon, 4 May 2026 01:31:13 +0000 Subject: [PATCH 13/27] docs(post-mortem): go with dedicated ml-serving-deployer IAM role (Option B only) - ml_infra_admin: scoped away from scale-deploy; gets scale_developer role (read/exec/log only, cannot kubectl set image or helm upgrade) - New ml-serving-deployer IAM role: scoped to scale-deploy with cluster-admin; only this role can run just deploy prod - justfile: _check_master_branch guard + _mount_aws uses ml-serving-deployer for prod - Files: eks.tf (scope + new access entry), workloads.tf (two roles), iam.tf (new SSO role data source), justfile Co-Authored-By: Claude Sonnet 4.6 --- .../post-mortem-mli-6574-model-engine-500s.md | 129 ++++++++++-------- 1 file changed, 71 insertions(+), 58 deletions(-) diff --git a/docs/internal/post-mortem-mli-6574-model-engine-500s.md b/docs/internal/post-mortem-mli-6574-model-engine-500s.md index 9413401d..4651a0bd 100644 --- a/docs/internal/post-mortem-mli-6574-model-engine-500s.md +++ b/docs/internal/post-mortem-mli-6574-model-engine-500s.md @@ -81,19 +81,14 @@ This pre-upgrade hook runs `alembic upgrade head` before the gateway deployment Two preventative controls: -**a. RBAC — block `kubectl set image` at the API server** +**a. RBAC — block `kubectl set image`, restrict deploy to a dedicated role** -Developer access to `ml-serving-new` is via the `ml_infra_admin` EKS access entry, which grants cluster-admin cluster-wide. That is why `kubectl set image` succeeds today. Scoping it away from `scale-deploy` rejects `kubectl set image` before any pod is updated. +Today `ml_infra_admin` has cluster-admin cluster-wide, which is why `kubectl set image` succeeds. The fix splits access into two roles: -Scoping away cluster-admin from `scale-deploy` would also break `just deploy prod` run locally, because it uses the developer's personal kubectl context for the helm step too. The fix is to have `just deploy prod` authenticate the helm upgrade using a separate identity that retains cluster-admin, rather than the developer's personal context. +- **`ml_infra_admin`** (developers): scoped away from `scale-deploy`; gets read/exec/log only via the `scale_developer` Role. Cannot deploy. +- **`ml-serving-deployer`** (new, deploy-only IAM role): scoped to `scale-deploy` with deployment-mutation rights. Only this role can run `just deploy prod`. -**What is a ServiceAccount (SA) token?** A Kubernetes ServiceAccount is an in-cluster identity (distinct from a human user). `ml-k8s-admin` is a ServiceAccount in `scale-deploy` that already has cluster-admin. An SA token is a short-lived credential that authenticates as that SA — helm can use it via `--kube-token`, bypassing the developer's personal RBAC entirely. - -**Option A (current proposal) — mint a short-lived SA token at deploy time:** -`just deploy prod` runs `kubectl create token ml-k8s-admin -n scale-deploy --duration=30m` to get a token, then passes it to `helm upgrade --kube-token`. Works without any new AWS infrastructure; requires adding `serviceaccounts/token create` to the `scale_developer` role so developers can mint it. - -**Option B (more elegant) — dedicated `ml-serving-deployer` IAM role:** -Create a new AWS SSO role `ml-serving-deployer` mapped to a k8s group that has deployment-mutation rights in `scale-deploy` only. Developers run `just deploy prod` with `AWS_PROFILE=ml-serving-deployer`, which updates kubeconfig to use this scoped identity. Cleanly separates "developer read access" (ml_infra_admin) from "deploy access" (ml-serving-deployer) at the IAM level — no token minting, no extra role permissions, easy to audit. Requires adding a new IAM role + EKS access entry + k8s RoleBinding in Terracode-ML (~1 extra day of setup). +Separating "I can read/debug prod" from "I can deploy to prod" at the IAM level — no token minting, no workarounds, easy to audit who has deploy access. **b. justfile guard — block deploying unmerged code** @@ -101,36 +96,63 @@ Create a new AWS SSO role `ml-serving-deployer` mapped to a k8s group that has d **Files to change:** -**1. `Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/eks.tf` (lines 552–561)** +**1. `Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/eks.tf`** + +Scope `ml_infra_admin` away from `scale-deploy` (lines 552–561), and add a new EKS access entry for `ml-serving-deployer`: ```hcl -# Before: - access_scope { - type = "cluster" - } +# Existing — scope ml_infra_admin away from scale-deploy: +resource "aws_eks_access_policy_association" "ml_infra_admin" { + for_each = toset(data.aws_iam_roles.ml_infra_admin.arns) + cluster_name = local.cluster_name + policy_arn = "arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy" + principal_arn = each.value -# After: access_scope { type = "namespace" namespaces = [for ns in local.workload_namespaces_default : ns if ns != "scale-deploy"] # ["mlflow", "image-builder", "pyspark", "default", "gpu-operator"] } +} + +# New — deploy-only role for scale-deploy: +data "aws_iam_roles" "ml_serving_deployer" { + name_regex = "AWSReservedSSO_MLServingDeployer.*" + path_prefix = "/aws-reserved/sso.amazonaws.com/" +} + +resource "aws_eks_access_entry" "ml_serving_deployer" { + for_each = toset(data.aws_iam_roles.ml_serving_deployer.arns) + cluster_name = local.cluster_name + principal_arn = each.value + type = "STANDARD" +} + +resource "aws_eks_access_policy_association" "ml_serving_deployer" { + for_each = toset(data.aws_iam_roles.ml_serving_deployer.arns) + cluster_name = local.cluster_name + policy_arn = "arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy" + principal_arn = each.value + + access_scope { + type = "namespace" + namespaces = ["scale-deploy"] + } +} ``` **2. `Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/workloads.tf`** -Add a `scale-deploy`-scoped Role (read/exec/log, no deployment mutations). The `serviceaccounts/token create` rule is only needed for Option A (SA token minting); remove it if going with Option B: +Add `scale_developer` (read/exec/log for `ml_infra_admin`, no deploy) and `scale_deployer` (deployment mutations for `ml-serving-deployer`): ```hcl +# Developers: read/exec/log only — cannot kubectl set image or helm upgrade resource "kubernetes_role" "scale_developer" { - metadata { - name = "scale-developer" - namespace = "scale-deploy" - } + metadata { name = "scale-developer"; namespace = "scale-deploy" } rule { api_groups = ["apps"] resources = ["deployments", "replicasets"] - verbs = ["get", "list", "watch"] # no patch/update — blocks kubectl set image + verbs = ["get", "list", "watch"] } rule { api_groups = [""] @@ -142,36 +164,36 @@ resource "kubernetes_role" "scale_developer" { resources = ["jobs"] verbs = ["get", "list", "watch", "create", "delete"] } - rule { - api_groups = [""] - resources = ["serviceaccounts/token"] - verbs = ["create"] # allows: kubectl create token ml-k8s-admin -n scale-deploy - } } resource "kubernetes_role_binding" "scale_developer" { - metadata { - name = "scale-developer-binding" - namespace = "scale-deploy" - } + metadata { name = "scale-developer-binding"; namespace = "scale-deploy" } role_ref { api_group = "rbac.authorization.k8s.io" kind = "Role" name = kubernetes_role.scale_developer.metadata[0].name } - subject { - kind = "Group" - name = "ml-infra-admin" + subject { kind = "Group"; name = "ml-infra-admin" } +} + +# Deployers: full deployment mutation rights — only ml-serving-deployer can deploy +resource "kubernetes_cluster_role_binding" "scale_deployer" { + metadata { name = "scale-deployer-cluster-admin" } + role_ref { + api_group = "rbac.authorization.k8s.io" + kind = "ClusterRole" + name = "cluster-admin" } + subject { kind = "Group"; name = "ml-serving-deployer" } } ``` **3. `model-engine-internal/justfile`** -Add a master branch guard and have the prod deploy step mint a short-lived SA token: +Add a master branch guard; update `_mount_aws` to use `ml-serving-deployer` profile for prod: ```just -# New recipe — fails if current commit is not merged to origin/master +# New — fails if current commit is not merged to origin/master _check_master_branch env: #!/bin/bash if [ "{{ env }}" = "prod" ]; then @@ -182,33 +204,24 @@ _check_master_branch env: fi fi -# Updated _deploy: mint ml-k8s-admin token for prod helm upgrade -_deploy env use_service_identifier: build-push-model-engine (_check_master_branch env) (_set_k8s env) (_mount_aws env) - #!/bin/bash - set -euo pipefail - SA_TOKEN="" - if [ "{{ env }}" = "prod" ]; then - SA_TOKEN=$(kubectl create token ml-k8s-admin -n scale-deploy --duration=30m) - fi - pushd ../llm-engine/charts - helm {{ if use_service_identifier == 'true' { 'install model-engine-' + SERVICE_IDENTIFIER } else { 'upgrade model-engine' } }} \ - model-engine \ - -f ../../model-engine-internal/model-engine-internal/resources/values/values_{{ env }}.yaml \ - --description "{{ GIT_MESSAGE }}" \ - {{ if use_service_identifier == 'true' { '--set serviceIdentifier=' + SERVICE_IDENTIFIER } else { '' } }} \ - {{ if env == 'prod' { '--timeout 120m0s' } else { '' } }} \ - --set tag={{ GIT_SHA }} \ - ${SA_TOKEN:+--kube-token="$SA_TOKEN"} \ - --atomic +# Updated _mount_aws: prod deploys use ml-serving-deployer profile +_mount_aws env: _aws-mountable_folder + {{ if env == 'training' { '...' } else { '' } }} + {{ if env == 'prod' { 'cd .. && python3 scripts_py3/scale_scripts/exe/generate_mountable_aws_credentials.py -p ml-serving-deployer -e ~/.aws-mountable/credentials' } else { '' } }} + {{ if env == 'launch' { '...' } else { '' } }} + +# Updated deploy: add master branch check +deploy env='training': (_check_master_branch env) (_deploy env 'false') (_emit_deploy_event "model-engine" env) ``` **Files to change:** -- **`Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/eks.tf`** (lines 552–561) -- **`Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/workloads.tf`** -- **`model-engine-internal/justfile`** +- **`Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/eks.tf`** — scope `ml_infra_admin`, add `ml_serving_deployer` access entry +- **`Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/workloads.tf`** — `scale_developer` + `scale_deployer` roles +- **`Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/iam.tf`** — add `AWSReservedSSO_MLServingDeployer` IAM role data source and aws-auth mapping +- **`model-engine-internal/justfile`** — `_check_master_branch`, `_mount_aws` prod profile **Owner:** model-engine on-call -**Effort:** ~3 days (Terraform RBAC + Atlantis apply + justfile changes + validation) +**Effort:** ~3–4 days (new IAM SSO role + Terraform RBAC + Atlantis apply + justfile + validation) --- From ebdf6a11867409fe3d98e545e96e5c9041a1ddb3 Mon Sep 17 00:00:00 2001 From: lilyz-ai Date: Mon, 4 May 2026 01:46:01 +0000 Subject: [PATCH 14/27] docs(post-mortem): note cross-service impact of scale-deploy RBAC change scale-deploy is shared; ml_infra_admin scoping affects all teams deploying there. Add pre-flight check: enumerate live deployments and confirm with other teams before applying the RBAC change. Co-Authored-By: Claude Sonnet 4.6 --- docs/internal/post-mortem-mli-6574-model-engine-500s.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/internal/post-mortem-mli-6574-model-engine-500s.md b/docs/internal/post-mortem-mli-6574-model-engine-500s.md index 4651a0bd..4c98454a 100644 --- a/docs/internal/post-mortem-mli-6574-model-engine-500s.md +++ b/docs/internal/post-mortem-mli-6574-model-engine-500s.md @@ -90,6 +90,8 @@ Today `ml_infra_admin` has cluster-admin cluster-wide, which is why `kubectl set Separating "I can read/debug prod" from "I can deploy to prod" at the IAM level — no token minting, no workarounds, easy to audit who has deploy access. +**Cross-service impact:** `scale-deploy` is a shared namespace. This RBAC change affects every team that currently deploys to it using `ml_infra_admin` credentials — not just model-engine. Terraform-managed resources (KEDA, Istio configs, audio-redis secret) are deployed via Atlantis and are unaffected. But any other application services deployed manually to `scale-deploy` would also lose deploy access for `ml_infra_admin` users. Before applying: run `kubectl get deployments -n scale-deploy` on `ml-serving-new` to enumerate all services, and confirm with those teams that they are either (a) also migrating to `ml-serving-deployer`, or (b) getting a separate deployer role for their service. + **b. justfile guard — block deploying unmerged code** `just deploy prod` checks that the current commit is merged to `origin/master` before proceeding. Deploying a local branch to production is rejected before any image is built or pushed. From 5902d3f8c47bf565d3763530bdb69d620dd48781 Mon Sep 17 00:00:00 2001 From: lilyz-ai Date: Mon, 4 May 2026 02:00:10 +0000 Subject: [PATCH 15/27] docs(post-mortem): identify concrete cross-service impact from kubectl inspection - Ran kubectl get deployments -n scale-deploy on ml-serving-new - auto-hillclimb-ui (genai team, manual deploy, ml_infra_admin) would break - k8s-startup-watcher is part of llm-engine helm chart, unaffected - launch-endpoint-id-* are model-engine celery-managed via ml-k8s-admin SA, unaffected - Add pre-requisite: coordinate with genai team before applying RBAC change Co-Authored-By: Claude Sonnet 4.6 --- docs/internal/post-mortem-mli-6574-model-engine-500s.md | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/docs/internal/post-mortem-mli-6574-model-engine-500s.md b/docs/internal/post-mortem-mli-6574-model-engine-500s.md index 4c98454a..895ecba7 100644 --- a/docs/internal/post-mortem-mli-6574-model-engine-500s.md +++ b/docs/internal/post-mortem-mli-6574-model-engine-500s.md @@ -90,7 +90,14 @@ Today `ml_infra_admin` has cluster-admin cluster-wide, which is why `kubectl set Separating "I can read/debug prod" from "I can deploy to prod" at the IAM level — no token minting, no workarounds, easy to audit who has deploy access. -**Cross-service impact:** `scale-deploy` is a shared namespace. This RBAC change affects every team that currently deploys to it using `ml_infra_admin` credentials — not just model-engine. Terraform-managed resources (KEDA, Istio configs, audio-redis secret) are deployed via Atlantis and are unaffected. But any other application services deployed manually to `scale-deploy` would also lose deploy access for `ml_infra_admin` users. Before applying: run `kubectl get deployments -n scale-deploy` on `ml-serving-new` to enumerate all services, and confirm with those teams that they are either (a) also migrating to `ml-serving-deployer`, or (b) getting a separate deployer role for their service. +**Cross-service impact:** `scale-deploy` is a shared namespace. Running `kubectl get deployments -n scale-deploy` on `ml-serving-new` shows two non-model-engine deployments: + +- `k8s-startup-watcher` — part of the llm-engine Helm chart, managed by `just deploy prod`; unaffected. +- `auto-hillclimb-ui` — a Streamlit app owned by the **genai team** (`genai/genai/applications/auto_hillclimb/`), deployed manually with `ml_infra_admin` credentials, no Terraform backing. **This deploy would break.** + +The hundreds of `launch-endpoint-id-*` deployments are model-engine user endpoints created by the celery workers via the `ml-k8s-admin` service account — not by developer kubectl — so those are unaffected. + +Before applying the RBAC change: coordinate with the genai team. They either need to be granted `ml-serving-deployer` access, or their `auto-hillclimb-ui` deploy process needs to be moved to a separate namespace or a dedicated deployer role. **b. justfile guard — block deploying unmerged code** From 988bc12bf064ef7b953bf7a22fc41b50bc02776d Mon Sep 17 00:00:00 2001 From: lilyz-ai Date: Mon, 4 May 2026 02:08:19 +0000 Subject: [PATCH 16/27] =?UTF-8?q?docs(post-mortem):=20backward-compatible?= =?UTF-8?q?=20RBAC=20=E2=80=94=20additive=20only,=20ml=5Finfra=5Fadmin=20u?= =?UTF-8?q?ntouched?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Drop all changes to ml_infra_admin; genai team and others unaffected - New ml-serving-deployer IAM role (eks.tf + iam.tf): cluster-admin scoped to scale-deploy; only model-engine prod deploys use this role - kubectl set image prevention is now process-enforced (just deploy prod + master branch guard), not RBAC-enforced; acknowledge this tradeoff explicitly - justfile: _check_master_branch + _mount_aws uses ml-serving-deployer for prod Co-Authored-By: Claude Sonnet 4.6 --- .../post-mortem-mli-6574-model-engine-500s.md | 101 ++++-------------- 1 file changed, 20 insertions(+), 81 deletions(-) diff --git a/docs/internal/post-mortem-mli-6574-model-engine-500s.md b/docs/internal/post-mortem-mli-6574-model-engine-500s.md index 895ecba7..f0d748fd 100644 --- a/docs/internal/post-mortem-mli-6574-model-engine-500s.md +++ b/docs/internal/post-mortem-mli-6574-model-engine-500s.md @@ -79,25 +79,13 @@ This pre-upgrade hook runs `alembic upgrade head` before the gateway deployment **What:** `kubectl set image` updates pods directly, bypassing all Helm pre-upgrade hooks including the migration job. The fix must be preventative: a `kubectl set image` attempt against the production deployment is rejected by the API server before it takes effect. -Two preventative controls: +Two preventative controls. All changes are **additive** — `ml_infra_admin` is left completely untouched, so no other teams are affected. -**a. RBAC — block `kubectl set image`, restrict deploy to a dedicated role** +**a. New `ml-serving-deployer` IAM role — dedicated deploy identity for model-engine** -Today `ml_infra_admin` has cluster-admin cluster-wide, which is why `kubectl set image` succeeds. The fix splits access into two roles: +Create a new AWS SSO role `ml-serving-deployer` with cluster-admin scoped to `scale-deploy`. Model-engine prod deploys use this role exclusively. `ml_infra_admin` remains unchanged with its existing cluster-admin grant — other teams (e.g. genai's `auto-hillclimb-ui`) continue to work without modification. -- **`ml_infra_admin`** (developers): scoped away from `scale-deploy`; gets read/exec/log only via the `scale_developer` Role. Cannot deploy. -- **`ml-serving-deployer`** (new, deploy-only IAM role): scoped to `scale-deploy` with deployment-mutation rights. Only this role can run `just deploy prod`. - -Separating "I can read/debug prod" from "I can deploy to prod" at the IAM level — no token minting, no workarounds, easy to audit who has deploy access. - -**Cross-service impact:** `scale-deploy` is a shared namespace. Running `kubectl get deployments -n scale-deploy` on `ml-serving-new` shows two non-model-engine deployments: - -- `k8s-startup-watcher` — part of the llm-engine Helm chart, managed by `just deploy prod`; unaffected. -- `auto-hillclimb-ui` — a Streamlit app owned by the **genai team** (`genai/genai/applications/auto_hillclimb/`), deployed manually with `ml_infra_admin` credentials, no Terraform backing. **This deploy would break.** - -The hundreds of `launch-endpoint-id-*` deployments are model-engine user endpoints created by the celery workers via the `ml-k8s-admin` service account — not by developer kubectl — so those are unaffected. - -Before applying the RBAC change: coordinate with the genai team. They either need to be granted `ml-serving-deployer` access, or their `auto-hillclimb-ui` deploy process needs to be moved to a separate namespace or a dedicated deployer role. +`kubectl set image` can still be run by anyone with `ml_infra_admin` access. The prevention is process-enforced: prod deploys must go through `just deploy prod` (which uses `ml-serving-deployer` and enforces the master branch guard below). Direct `kubectl set image` by a developer is a process violation, not an RBAC rejection. **b. justfile guard — block deploying unmerged code** @@ -107,24 +95,10 @@ Before applying the RBAC change: coordinate with the genai team. They either nee **1. `Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/eks.tf`** -Scope `ml_infra_admin` away from `scale-deploy` (lines 552–561), and add a new EKS access entry for `ml-serving-deployer`: +Add new EKS access entry for `ml-serving-deployer` (no changes to existing `ml_infra_admin` block): ```hcl -# Existing — scope ml_infra_admin away from scale-deploy: -resource "aws_eks_access_policy_association" "ml_infra_admin" { - for_each = toset(data.aws_iam_roles.ml_infra_admin.arns) - cluster_name = local.cluster_name - policy_arn = "arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy" - principal_arn = each.value - - access_scope { - type = "namespace" - namespaces = [for ns in local.workload_namespaces_default : ns if ns != "scale-deploy"] - # ["mlflow", "image-builder", "pyspark", "default", "gpu-operator"] - } -} - -# New — deploy-only role for scale-deploy: +# New — model-engine deploy role, scoped to scale-deploy only: data "aws_iam_roles" "ml_serving_deployer" { name_regex = "AWSReservedSSO_MLServingDeployer.*" path_prefix = "/aws-reserved/sso.amazonaws.com/" @@ -150,56 +124,22 @@ resource "aws_eks_access_policy_association" "ml_serving_deployer" { } ``` -**2. `Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/workloads.tf`** +**2. `Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/iam.tf`** -Add `scale_developer` (read/exec/log for `ml_infra_admin`, no deploy) and `scale_deployer` (deployment mutations for `ml-serving-deployer`): +Add `ml-serving-deployer` to the aws-auth ConfigMap so the IAM role maps to a k8s group: ```hcl -# Developers: read/exec/log only — cannot kubectl set image or helm upgrade -resource "kubernetes_role" "scale_developer" { - metadata { name = "scale-developer"; namespace = "scale-deploy" } - rule { - api_groups = ["apps"] - resources = ["deployments", "replicasets"] - verbs = ["get", "list", "watch"] - } - rule { - api_groups = [""] - resources = ["pods", "pods/log", "pods/exec", "configmaps", "services", "endpoints"] - verbs = ["get", "list", "watch", "create", "delete"] - } - rule { - api_groups = ["batch"] - resources = ["jobs"] - verbs = ["get", "list", "watch", "create", "delete"] - } -} - -resource "kubernetes_role_binding" "scale_developer" { - metadata { name = "scale-developer-binding"; namespace = "scale-deploy" } - role_ref { - api_group = "rbac.authorization.k8s.io" - kind = "Role" - name = kubernetes_role.scale_developer.metadata[0].name - } - subject { kind = "Group"; name = "ml-infra-admin" } -} - -# Deployers: full deployment mutation rights — only ml-serving-deployer can deploy -resource "kubernetes_cluster_role_binding" "scale_deployer" { - metadata { name = "scale-deployer-cluster-admin" } - role_ref { - api_group = "rbac.authorization.k8s.io" - kind = "ClusterRole" - name = "cluster-admin" - } - subject { kind = "Group"; name = "ml-serving-deployer" } +# Add to local.cluster_role_map: +{ + rolearn = one(data.aws_iam_roles.ml_serving_deployer.arns) + username = "ml-serving-deployer" + groups = ["ml-serving-deployer"] } ``` **3. `model-engine-internal/justfile`** -Add a master branch guard; update `_mount_aws` to use `ml-serving-deployer` profile for prod: +Add a master branch guard and switch the prod deploy to use `ml-serving-deployer`: ```just # New — fails if current commit is not merged to origin/master @@ -215,22 +155,21 @@ _check_master_branch env: # Updated _mount_aws: prod deploys use ml-serving-deployer profile _mount_aws env: _aws-mountable_folder - {{ if env == 'training' { '...' } else { '' } }} + {{ if env == 'training' { 'cd .. && python3 scripts_py3/scale_scripts/exe/generate_mountable_aws_credentials.py -p ml-admin -e ~/.aws-mountable/credentials' } else { '' } }} {{ if env == 'prod' { 'cd .. && python3 scripts_py3/scale_scripts/exe/generate_mountable_aws_credentials.py -p ml-serving-deployer -e ~/.aws-mountable/credentials' } else { '' } }} - {{ if env == 'launch' { '...' } else { '' } }} + {{ if env == 'launch' { 'cd .. && python3 scripts_py3/scale_scripts/exe/generate_mountable_aws_credentials.py -p ml-serving-admin -e ~/.aws-mountable/credentials' } else { '' } }} # Updated deploy: add master branch check deploy env='training': (_check_master_branch env) (_deploy env 'false') (_emit_deploy_event "model-engine" env) ``` **Files to change:** -- **`Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/eks.tf`** — scope `ml_infra_admin`, add `ml_serving_deployer` access entry -- **`Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/workloads.tf`** — `scale_developer` + `scale_deployer` roles -- **`Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/iam.tf`** — add `AWSReservedSSO_MLServingDeployer` IAM role data source and aws-auth mapping -- **`model-engine-internal/justfile`** — `_check_master_branch`, `_mount_aws` prod profile +- **`Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/eks.tf`** — add `ml_serving_deployer` access entry (no changes to `ml_infra_admin`) +- **`Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/iam.tf`** — add `ml-serving-deployer` aws-auth mapping +- **`model-engine-internal/justfile`** — `_check_master_branch`, update `_mount_aws` prod profile **Owner:** model-engine on-call -**Effort:** ~3–4 days (new IAM SSO role + Terraform RBAC + Atlantis apply + justfile + validation) +**Effort:** ~2–3 days (new IAM SSO role + Terraform + Atlantis apply + justfile + validation) --- From e1960c718478acf3c4f724131014f2d116640f0c Mon Sep 17 00:00:00 2001 From: lilyz-ai Date: Mon, 4 May 2026 02:11:33 +0000 Subject: [PATCH 17/27] docs(post-mortem): add Option A vs B comparison table for deploy identity approach Option A (scope ml_infra_admin) vs Option B (new ml-serving-deployer role); table shows hard vs soft kubectl prevention, backward compat, and known breakage (auto-hillclimb-ui); Option B chosen with explicit tradeoff acknowledged. Co-Authored-By: Claude Sonnet 4.6 --- .../post-mortem-mli-6574-model-engine-500s.md | 20 +++++++++++++++---- 1 file changed, 16 insertions(+), 4 deletions(-) diff --git a/docs/internal/post-mortem-mli-6574-model-engine-500s.md b/docs/internal/post-mortem-mli-6574-model-engine-500s.md index f0d748fd..78ef15f3 100644 --- a/docs/internal/post-mortem-mli-6574-model-engine-500s.md +++ b/docs/internal/post-mortem-mli-6574-model-engine-500s.md @@ -79,13 +79,25 @@ This pre-upgrade hook runs `alembic upgrade head` before the gateway deployment **What:** `kubectl set image` updates pods directly, bypassing all Helm pre-upgrade hooks including the migration job. The fix must be preventative: a `kubectl set image` attempt against the production deployment is rejected by the API server before it takes effect. -Two preventative controls. All changes are **additive** — `ml_infra_admin` is left completely untouched, so no other teams are affected. +Two preventative controls: -**a. New `ml-serving-deployer` IAM role — dedicated deploy identity for model-engine** +**a. Dedicated deploy identity for model-engine (chosen approach)** -Create a new AWS SSO role `ml-serving-deployer` with cluster-admin scoped to `scale-deploy`. Model-engine prod deploys use this role exclusively. `ml_infra_admin` remains unchanged with its existing cluster-admin grant — other teams (e.g. genai's `auto-hillclimb-ui`) continue to work without modification. +Two options were considered for preventing unauthorized production deploys: -`kubectl set image` can still be run by anyone with `ml_infra_admin` access. The prevention is process-enforced: prod deploys must go through `just deploy prod` (which uses `ml-serving-deployer` and enforces the master branch guard below). Direct `kubectl set image` by a developer is a process violation, not an RBAC rejection. +| | Option A: Scope down `ml_infra_admin` | Option B: New `ml-serving-deployer` role ✓ | +|---|---|---| +| **How** | Remove `patch`/`update` on `apps/deployments` in `scale-deploy` from `ml_infra_admin` | Add a new AWS SSO role with cluster-admin scoped to `scale-deploy`; prod deploy uses this role | +| **`kubectl set image` prevention** | Hard — RBAC rejects it at the API server | Soft — process-enforced; `ml_infra_admin` users can still do it | +| **Backward compatible** | No — breaks any other team deploying to `scale-deploy` with `ml_infra_admin` | Yes — `ml_infra_admin` unchanged | +| **Known breakage** | `auto-hillclimb-ui` (genai team, `genai/justfile deploy-auto-hillclimb-ui`) uses `kubectl apply` in `scale-deploy` via `ml_infra_admin`; would immediately break | None | +| **Coordination required** | Yes — must migrate every team deploying to `scale-deploy` | No | + +**Option B is chosen** because `scale-deploy` is a shared namespace and `ml_infra_admin` is used by multiple teams. Scoping it down is a cross-team breaking change. The tradeoff is that `kubectl set image` is no longer RBAC-enforced — it relies on the process (`just deploy prod` + master branch guard) instead. + +**b. justfile guard — block deploying unmerged code** + +`just deploy prod` checks that the current commit is merged to `origin/master` before proceeding. Deploying a local branch to production is rejected before any image is built or pushed. **b. justfile guard — block deploying unmerged code** From 636d7142edb07955aa350b07ae67f98680e5b37d Mon Sep 17 00:00:00 2001 From: lilyz-ai Date: Mon, 4 May 2026 02:21:51 +0000 Subject: [PATCH 18/27] docs(post-mortem): reframe deploy identity choice as open design decision Co-Authored-By: Claude Sonnet 4.6 --- docs/internal/post-mortem-mli-6574-model-engine-500s.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/internal/post-mortem-mli-6574-model-engine-500s.md b/docs/internal/post-mortem-mli-6574-model-engine-500s.md index 78ef15f3..6aaae755 100644 --- a/docs/internal/post-mortem-mli-6574-model-engine-500s.md +++ b/docs/internal/post-mortem-mli-6574-model-engine-500s.md @@ -93,7 +93,7 @@ Two options were considered for preventing unauthorized production deploys: | **Known breakage** | `auto-hillclimb-ui` (genai team, `genai/justfile deploy-auto-hillclimb-ui`) uses `kubectl apply` in `scale-deploy` via `ml_infra_admin`; would immediately break | None | | **Coordination required** | Yes — must migrate every team deploying to `scale-deploy` | No | -**Option B is chosen** because `scale-deploy` is a shared namespace and `ml_infra_admin` is used by multiple teams. Scoping it down is a cross-team breaking change. The tradeoff is that `kubectl set image` is no longer RBAC-enforced — it relies on the process (`just deploy prod` + master branch guard) instead. +**Open question for discussion:** Option A gives a hard guarantee (RBAC rejects `kubectl set image` at the API server) but requires coordinating a breaking change across every team using `ml_infra_admin` in `scale-deploy`. Option B is backward compatible and unblocks model-engine immediately, but relies on process discipline — a developer can still bypass it with a direct `kubectl set image`. Which guarantee level is acceptable? **b. justfile guard — block deploying unmerged code** From 25b347bd32862e681b5031bcd279bec88f79ee47 Mon Sep 17 00:00:00 2001 From: lilyz-ai Date: Mon, 4 May 2026 02:24:17 +0000 Subject: [PATCH 19/27] docs(post-mortem): remove duplicate justfile guard section Co-Authored-By: Claude Sonnet 4.6 --- docs/internal/post-mortem-mli-6574-model-engine-500s.md | 4 ---- 1 file changed, 4 deletions(-) diff --git a/docs/internal/post-mortem-mli-6574-model-engine-500s.md b/docs/internal/post-mortem-mli-6574-model-engine-500s.md index 6aaae755..f8aff7ff 100644 --- a/docs/internal/post-mortem-mli-6574-model-engine-500s.md +++ b/docs/internal/post-mortem-mli-6574-model-engine-500s.md @@ -99,10 +99,6 @@ Two options were considered for preventing unauthorized production deploys: `just deploy prod` checks that the current commit is merged to `origin/master` before proceeding. Deploying a local branch to production is rejected before any image is built or pushed. -**b. justfile guard — block deploying unmerged code** - -`just deploy prod` checks that the current commit is merged to `origin/master` before proceeding. Deploying a local branch to production is rejected before any image is built or pushed. - **Files to change:** **1. `Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/eks.tf`** From 26a02b0b67d4b26f4def7a14f51ac7d0bdad8351 Mon Sep 17 00:00:00 2001 From: lilyz-ai Date: Tue, 5 May 2026 01:15:30 +0000 Subject: [PATCH 20/27] docs(post-mortem): address Bill's review comments - Clarify Option B assumes ml-serving-deployer for model-engine only - Correct Known breakage row: auto-hillclimb-ui deploys to ml-training-new, not ml-serving-new; no confirmed breakage on serving cluster - Add P1 action item: tighten Envoy 5xx PagerDuty alert threshold/window Co-Authored-By: Claude Sonnet 4.6 --- .../post-mortem-mli-6574-model-engine-500s.md | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/docs/internal/post-mortem-mli-6574-model-engine-500s.md b/docs/internal/post-mortem-mli-6574-model-engine-500s.md index f8aff7ff..b16a37ce 100644 --- a/docs/internal/post-mortem-mli-6574-model-engine-500s.md +++ b/docs/internal/post-mortem-mli-6574-model-engine-500s.md @@ -87,10 +87,10 @@ Two options were considered for preventing unauthorized production deploys: | | Option A: Scope down `ml_infra_admin` | Option B: New `ml-serving-deployer` role ✓ | |---|---|---| -| **How** | Remove `patch`/`update` on `apps/deployments` in `scale-deploy` from `ml_infra_admin` | Add a new AWS SSO role with cluster-admin scoped to `scale-deploy`; prod deploy uses this role | +| **How** | Remove `patch`/`update` on `apps/deployments` in `scale-deploy` from `ml_infra_admin` | Add a new AWS SSO role scoped to `scale-deploy` on `ml-serving-new`; model-engine's `just deploy prod` (and CircleCI) assumes this role — SMIG and other services use different namespaces and are unaffected | | **`kubectl set image` prevention** | Hard — RBAC rejects it at the API server | Soft — process-enforced; `ml_infra_admin` users can still do it | -| **Backward compatible** | No — breaks any other team deploying to `scale-deploy` with `ml_infra_admin` | Yes — `ml_infra_admin` unchanged | -| **Known breakage** | `auto-hillclimb-ui` (genai team, `genai/justfile deploy-auto-hillclimb-ui`) uses `kubectl apply` in `scale-deploy` via `ml_infra_admin`; would immediately break | None | +| **Backward compatible** | No — breaks any other team deploying to `scale-deploy` on `ml-serving-new` with `ml_infra_admin` | Yes — `ml_infra_admin` unchanged | +| **Known breakage** | None confirmed on `ml-serving-new` — `auto-hillclimb-ui` (genai team) deploys to `scale-deploy` on `ml-training-new`, not `ml-serving-new`; a full audit is recommended before applying | None | | **Coordination required** | Yes — must migrate every team deploying to `scale-deploy` | No | **Open question for discussion:** Option A gives a hard guarantee (RBAC rejects `kubectl set image` at the API server) but requires coordinating a breaking change across every team using `ml_infra_admin` in `scale-deploy`. Option B is backward compatible and unblocks model-engine immediately, but relies on process discipline — a developer can still bypass it with a direct `kubectl set image`. Which guarantee level is acceptable? @@ -224,3 +224,12 @@ production: just deploy prod (helm upgrade, migration-first, RBAC-enforced) Any change requiring a DB schema update (new ORM column) must have its Alembic migration merged and applied to production before `just deploy prod` is run. +--- + +### P1 — Tighten 5xx alerting + +**What:** PagerDuty did not fire until 00:37Z — 47 min into the incident. Tighten the Envoy 5xx alert by lowering the error ratio threshold or reducing the evaluation window so that an outage of this magnitude pages within minutes of onset. + +**Owner:** model-engine on-call +**Effort:** ~0.5 days (Datadog alert config update) + From dbdb6492ceb044065ec7d68a2d8a70b913008bdb Mon Sep 17 00:00:00 2001 From: lilyz-ai Date: Tue, 5 May 2026 01:19:17 +0000 Subject: [PATCH 21/27] docs(post-mortem): switch to Option A; complete breakage audit MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Audit of all justfiles confirmed no other service deploys to scale-deploy on ml-serving-new via ml_infra_admin — Option A has no coordination burden and gives a hard RBAC guarantee vs Option B's soft process enforcement. Co-Authored-By: Claude Sonnet 4.6 --- .../internal/post-mortem-mli-6574-model-engine-500s.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/internal/post-mortem-mli-6574-model-engine-500s.md b/docs/internal/post-mortem-mli-6574-model-engine-500s.md index b16a37ce..d0a2315f 100644 --- a/docs/internal/post-mortem-mli-6574-model-engine-500s.md +++ b/docs/internal/post-mortem-mli-6574-model-engine-500s.md @@ -85,15 +85,15 @@ Two preventative controls: Two options were considered for preventing unauthorized production deploys: -| | Option A: Scope down `ml_infra_admin` | Option B: New `ml-serving-deployer` role ✓ | +| | Option A: Scope down `ml_infra_admin` ✓ | Option B: New `ml-serving-deployer` role | |---|---|---| | **How** | Remove `patch`/`update` on `apps/deployments` in `scale-deploy` from `ml_infra_admin` | Add a new AWS SSO role scoped to `scale-deploy` on `ml-serving-new`; model-engine's `just deploy prod` (and CircleCI) assumes this role — SMIG and other services use different namespaces and are unaffected | | **`kubectl set image` prevention** | Hard — RBAC rejects it at the API server | Soft — process-enforced; `ml_infra_admin` users can still do it | -| **Backward compatible** | No — breaks any other team deploying to `scale-deploy` on `ml-serving-new` with `ml_infra_admin` | Yes — `ml_infra_admin` unchanged | -| **Known breakage** | None confirmed on `ml-serving-new` — `auto-hillclimb-ui` (genai team) deploys to `scale-deploy` on `ml-training-new`, not `ml-serving-new`; a full audit is recommended before applying | None | -| **Coordination required** | Yes — must migrate every team deploying to `scale-deploy` | No | +| **Backward compatible** | Yes — audit confirmed no other service deploys to `scale-deploy` on `ml-serving-new` via `ml_infra_admin`; services using `ml-admin` on `ml-serving-new` (`dagster`, `scaletrain-ui`, `ml-orchestration-internal`, `research_evals`) all use the `dagster` namespace; genai services (`auto-hillclimb-ui` etc.) target `ml-training-new` | Yes — `ml_infra_admin` unchanged | +| **Known breakage** | None | None | +| **Coordination required** | No | No | -**Open question for discussion:** Option A gives a hard guarantee (RBAC rejects `kubectl set image` at the API server) but requires coordinating a breaking change across every team using `ml_infra_admin` in `scale-deploy`. Option B is backward compatible and unblocks model-engine immediately, but relies on process discipline — a developer can still bypass it with a direct `kubectl set image`. Which guarantee level is acceptable? +**Chosen approach: Option A.** It gives a hard RBAC guarantee (the API server rejects `kubectl set image` outright) and the audit shows no other service deploys to `scale-deploy` on `ml-serving-new` via `ml_infra_admin`, so there is no coordination burden. **b. justfile guard — block deploying unmerged code** From 46b0d2b70a25b8a4e4bfcd1290fda5109c861f0e Mon Sep 17 00:00:00 2001 From: lilyz-ai Date: Tue, 5 May 2026 01:22:18 +0000 Subject: [PATCH 22/27] docs(post-mortem): drop Option B, update files-to-change for Option A MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Remove comparison table and Option B content. Rewrite a. as a direct description of the chosen approach (scope down ml_infra_admin). Update files-to-change: eks.tf RBAC narrowing + justfile _check_master_branch only — no new IAM role, no iam.tf change, no profile switch. Co-Authored-By: Claude Sonnet 4.6 --- .../post-mortem-mli-6574-model-engine-500s.md | 73 ++----------------- 1 file changed, 7 insertions(+), 66 deletions(-) diff --git a/docs/internal/post-mortem-mli-6574-model-engine-500s.md b/docs/internal/post-mortem-mli-6574-model-engine-500s.md index d0a2315f..d18601e2 100644 --- a/docs/internal/post-mortem-mli-6574-model-engine-500s.md +++ b/docs/internal/post-mortem-mli-6574-model-engine-500s.md @@ -81,19 +81,11 @@ This pre-upgrade hook runs `alembic upgrade head` before the gateway deployment Two preventative controls: -**a. Dedicated deploy identity for model-engine (chosen approach)** +**a. Scope down `ml_infra_admin` in `scale-deploy` (chosen approach)** -Two options were considered for preventing unauthorized production deploys: +Remove `patch`/`update` on `apps/deployments` in the `scale-deploy` namespace from `ml_infra_admin`. This gives a hard RBAC guarantee — the API server rejects `kubectl set image` outright, with no reliance on process discipline. -| | Option A: Scope down `ml_infra_admin` ✓ | Option B: New `ml-serving-deployer` role | -|---|---|---| -| **How** | Remove `patch`/`update` on `apps/deployments` in `scale-deploy` from `ml_infra_admin` | Add a new AWS SSO role scoped to `scale-deploy` on `ml-serving-new`; model-engine's `just deploy prod` (and CircleCI) assumes this role — SMIG and other services use different namespaces and are unaffected | -| **`kubectl set image` prevention** | Hard — RBAC rejects it at the API server | Soft — process-enforced; `ml_infra_admin` users can still do it | -| **Backward compatible** | Yes — audit confirmed no other service deploys to `scale-deploy` on `ml-serving-new` via `ml_infra_admin`; services using `ml-admin` on `ml-serving-new` (`dagster`, `scaletrain-ui`, `ml-orchestration-internal`, `research_evals`) all use the `dagster` namespace; genai services (`auto-hillclimb-ui` etc.) target `ml-training-new` | Yes — `ml_infra_admin` unchanged | -| **Known breakage** | None | None | -| **Coordination required** | No | No | - -**Chosen approach: Option A.** It gives a hard RBAC guarantee (the API server rejects `kubectl set image` outright) and the audit shows no other service deploys to `scale-deploy` on `ml-serving-new` via `ml_infra_admin`, so there is no coordination burden. +Audit confirmed no other service deploys to `scale-deploy` on `ml-serving-new` via `ml_infra_admin`: services using `ml-admin` on `ml-serving-new` (`dagster`, `scaletrain-ui`, `ml-orchestration-internal`, `research_evals`) all use the `dagster` namespace; genai services (`auto-hillclimb-ui` etc.) target `ml-training-new`. No coordination required. **b. justfile guard — block deploying unmerged code** @@ -103,51 +95,11 @@ Two options were considered for preventing unauthorized production deploys: **1. `Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/eks.tf`** -Add new EKS access entry for `ml-serving-deployer` (no changes to existing `ml_infra_admin` block): - -```hcl -# New — model-engine deploy role, scoped to scale-deploy only: -data "aws_iam_roles" "ml_serving_deployer" { - name_regex = "AWSReservedSSO_MLServingDeployer.*" - path_prefix = "/aws-reserved/sso.amazonaws.com/" -} - -resource "aws_eks_access_entry" "ml_serving_deployer" { - for_each = toset(data.aws_iam_roles.ml_serving_deployer.arns) - cluster_name = local.cluster_name - principal_arn = each.value - type = "STANDARD" -} - -resource "aws_eks_access_policy_association" "ml_serving_deployer" { - for_each = toset(data.aws_iam_roles.ml_serving_deployer.arns) - cluster_name = local.cluster_name - policy_arn = "arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy" - principal_arn = each.value - - access_scope { - type = "namespace" - namespaces = ["scale-deploy"] - } -} -``` - -**2. `Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/iam.tf`** - -Add `ml-serving-deployer` to the aws-auth ConfigMap so the IAM role maps to a k8s group: +Narrow the `ml_infra_admin` access policy association so it no longer covers `scale-deploy`, or add a restrictive Kubernetes `Role` in `scale-deploy` that removes `patch`/`update` on `apps/deployments` for the `ml_infra_admin` group. -```hcl -# Add to local.cluster_role_map: -{ - rolearn = one(data.aws_iam_roles.ml_serving_deployer.arns) - username = "ml-serving-deployer" - groups = ["ml-serving-deployer"] -} -``` - -**3. `model-engine-internal/justfile`** +**2. `model-engine-internal/justfile`** -Add a master branch guard and switch the prod deploy to use `ml-serving-deployer`: +Add a master branch guard so `just deploy prod` is rejected if the current commit is not merged to `origin/master`: ```just # New — fails if current commit is not merged to origin/master @@ -161,23 +113,12 @@ _check_master_branch env: fi fi -# Updated _mount_aws: prod deploys use ml-serving-deployer profile -_mount_aws env: _aws-mountable_folder - {{ if env == 'training' { 'cd .. && python3 scripts_py3/scale_scripts/exe/generate_mountable_aws_credentials.py -p ml-admin -e ~/.aws-mountable/credentials' } else { '' } }} - {{ if env == 'prod' { 'cd .. && python3 scripts_py3/scale_scripts/exe/generate_mountable_aws_credentials.py -p ml-serving-deployer -e ~/.aws-mountable/credentials' } else { '' } }} - {{ if env == 'launch' { 'cd .. && python3 scripts_py3/scale_scripts/exe/generate_mountable_aws_credentials.py -p ml-serving-admin -e ~/.aws-mountable/credentials' } else { '' } }} - # Updated deploy: add master branch check deploy env='training': (_check_master_branch env) (_deploy env 'false') (_emit_deploy_event "model-engine" env) ``` -**Files to change:** -- **`Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/eks.tf`** — add `ml_serving_deployer` access entry (no changes to `ml_infra_admin`) -- **`Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/iam.tf`** — add `ml-serving-deployer` aws-auth mapping -- **`model-engine-internal/justfile`** — `_check_master_branch`, update `_mount_aws` prod profile - **Owner:** model-engine on-call -**Effort:** ~2–3 days (new IAM SSO role + Terraform + Atlantis apply + justfile + validation) +**Effort:** ~1–2 days (Terraform RBAC change + Atlantis apply + justfile + validation) --- From d3cf507ef47c3405e1a363e15b39e31916104ea5 Mon Sep 17 00:00:00 2001 From: lilyz-ai Date: Tue, 5 May 2026 01:30:10 +0000 Subject: [PATCH 23/27] docs(post-mortem): add concrete code changes for eks.tf RBAC and alerting - eks.tf: change ml_infra_admin access scope from cluster-wide to namespace-scoped with all namespaces except scale-deploy (enumerated from kubectl namespace audit); note RBAC is additive so scoping is the only way to block kubectl set image - P1 alert: add prod_5m entry to launch_v1_route_monitor_params in launch.tf for a 5-min evaluation window alongside the existing 15m Co-Authored-By: Claude Sonnet 4.6 --- .../post-mortem-mli-6574-model-engine-500s.md | 58 ++++++++++++++++++- 1 file changed, 55 insertions(+), 3 deletions(-) diff --git a/docs/internal/post-mortem-mli-6574-model-engine-500s.md b/docs/internal/post-mortem-mli-6574-model-engine-500s.md index d18601e2..e80e3641 100644 --- a/docs/internal/post-mortem-mli-6574-model-engine-500s.md +++ b/docs/internal/post-mortem-mli-6574-model-engine-500s.md @@ -95,7 +95,38 @@ Audit confirmed no other service deploys to `scale-deploy` on `ml-serving-new` v **1. `Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/eks.tf`** -Narrow the `ml_infra_admin` access policy association so it no longer covers `scale-deploy`, or add a restrictive Kubernetes `Role` in `scale-deploy` that removes `patch`/`update` on `apps/deployments` for the `ml_infra_admin` group. +Change `ml_infra_admin`'s access scope from cluster-wide to namespace-scoped, listing every namespace except `scale-deploy`. Because Kubernetes RBAC is additive (no deny rules), scoping the access entry is the only way to prevent `kubectl set image` in `scale-deploy`: + +```hcl +resource "aws_eks_access_policy_association" "ml_infra_admin" { + for_each = toset(data.aws_iam_roles.ml_infra_admin.arns) + cluster_name = local.cluster_name + policy_arn = "arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy" + principal_arn = each.value + + access_scope { + type = "namespace" + namespaces = [ + "dagster", + "default", + "gpu-operator", + "istio-ingress", + "istio-system", + "keda", + "kube-node-lease", + "kube-public", + "kube-system", + "kubecost", + "litellm-rebuild", + "local-path-storage", + "service", + "snowflake-postgres-connector", + "workload", + # scale-deploy intentionally omitted — ml_infra_admin must not mutate deployments here + ] + } +} +``` **2. `model-engine-internal/justfile`** @@ -169,8 +200,29 @@ Any change requiring a DB schema update (new ORM column) must have its Alembic m ### P1 — Tighten 5xx alerting -**What:** PagerDuty did not fire until 00:37Z — 47 min into the incident. Tighten the Envoy 5xx alert by lowering the error ratio threshold or reducing the evaluation window so that an outage of this magnitude pages within minutes of onset. +**What:** PagerDuty did not fire until 00:37Z — 47 min into the incident. The existing monitor (`datadog_monitor.model_engine_error_rate_alert` in `scaleapi-ml-datadog/launch.tf`) evaluates a 15-minute window at a 5% threshold. Add a second, faster-firing entry with a 5-minute window so a sustained 5xx spike pages within minutes of onset. + +**File to change: `Terracode-ML/scaleapi-ml-datadog/launch.tf`** + +```hcl +# In locals: +launch_v1_route_monitor_params = { + prod_15m = { + severity = "sev0" + cluster_name = "ml-serving-new" + time_window = "15m" + critical_rate = 0.05 + } + # New — fires within ~5 min of a sustained 5xx spike + prod_5m = { + severity = "sev0" + cluster_name = "ml-serving-new" + time_window = "5m" + critical_rate = 0.05 + } +} +``` **Owner:** model-engine on-call -**Effort:** ~0.5 days (Datadog alert config update) +**Effort:** ~0.5 days (Datadog Terraform change + Atlantis apply) From 5226e9a9fcac362b57efe1410a1bda476dd375c1 Mon Sep 17 00:00:00 2001 From: lilyz-ai Date: Tue, 5 May 2026 01:40:10 +0000 Subject: [PATCH 24/27] =?UTF-8?q?docs(post-mortem):=20fix=20deploy=20ident?= =?UTF-8?q?ity=20=E2=80=94=20both=20new=20role=20and=20scope-down=20requir?= =?UTF-8?q?ed?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ml-serving-admin resolves to AWSReservedSSO_MLInfraAdmin (same as ml_infra_admin), so scoping ml_infra_admin out of scale-deploy without a replacement breaks just deploy prod. Correct implementation requires: 1. New ml-serving-deployer role scoped to scale-deploy 2. ml_infra_admin scoped down to exclude scale-deploy 3. justfile prod profile switched to ml-serving-deployer Co-Authored-By: Claude Sonnet 4.6 --- .../post-mortem-mli-6574-model-engine-500s.md | 64 ++++++++++++++++--- 1 file changed, 56 insertions(+), 8 deletions(-) diff --git a/docs/internal/post-mortem-mli-6574-model-engine-500s.md b/docs/internal/post-mortem-mli-6574-model-engine-500s.md index e80e3641..02db674d 100644 --- a/docs/internal/post-mortem-mli-6574-model-engine-500s.md +++ b/docs/internal/post-mortem-mli-6574-model-engine-500s.md @@ -81,11 +81,14 @@ This pre-upgrade hook runs `alembic upgrade head` before the gateway deployment Two preventative controls: -**a. Scope down `ml_infra_admin` in `scale-deploy` (chosen approach)** +**a. Create a dedicated deploy role + scope down `ml_infra_admin`** -Remove `patch`/`update` on `apps/deployments` in the `scale-deploy` namespace from `ml_infra_admin`. This gives a hard RBAC guarantee — the API server rejects `kubectl set image` outright, with no reliance on process discipline. +`just deploy prod` uses the `ml-serving-admin` AWS SSO profile, which resolves to `AWSReservedSSO_MLInfraAdmin` — the same IAM role as `ml_infra_admin`. Scoping `ml_infra_admin` out of `scale-deploy` without a replacement would break deploys. The correct implementation requires both steps: -Audit confirmed no other service deploys to `scale-deploy` on `ml-serving-new` via `ml_infra_admin`: services using `ml-admin` on `ml-serving-new` (`dagster`, `scaletrain-ui`, `ml-orchestration-internal`, `research_evals`) all use the `dagster` namespace; genai services (`auto-hillclimb-ui` etc.) target `ml-training-new`. No coordination required. +1. **Create `ml-serving-deployer`** — a new AWS SSO role scoped to `scale-deploy` on `ml-serving-new`; `just deploy prod` switches to it. +2. **Scope down `ml_infra_admin`** — remove its access to `scale-deploy`. Because Kubernetes RBAC is additive (no deny rules), scoping the EKS access entry is the only way to enforce this at the API server. + +Audit confirmed no other service deploys to `scale-deploy` on `ml-serving-new` via `ml_infra_admin`: services using `ml-admin` on `ml-serving-new` (`dagster`, `scaletrain-ui`, `ml-orchestration-internal`, `research_evals`) all use the `dagster` namespace; genai services (`auto-hillclimb-ui` etc.) target `ml-training-new`. **b. justfile guard — block deploying unmerged code** @@ -95,9 +98,35 @@ Audit confirmed no other service deploys to `scale-deploy` on `ml-serving-new` v **1. `Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/eks.tf`** -Change `ml_infra_admin`'s access scope from cluster-wide to namespace-scoped, listing every namespace except `scale-deploy`. Because Kubernetes RBAC is additive (no deny rules), scoping the access entry is the only way to prevent `kubectl set image` in `scale-deploy`: +Add the `ml-serving-deployer` access entry scoped to `scale-deploy`, and narrow `ml_infra_admin` to every namespace except `scale-deploy`: ```hcl +# New — dedicated deploy role for model-engine, scoped to scale-deploy only +data "aws_iam_roles" "ml_serving_deployer" { + name_regex = "AWSReservedSSO_MLServingDeployer.*" + path_prefix = "/aws-reserved/sso.amazonaws.com/" +} + +resource "aws_eks_access_entry" "ml_serving_deployer" { + for_each = toset(data.aws_iam_roles.ml_serving_deployer.arns) + cluster_name = local.cluster_name + principal_arn = each.value + type = "STANDARD" +} + +resource "aws_eks_access_policy_association" "ml_serving_deployer" { + for_each = toset(data.aws_iam_roles.ml_serving_deployer.arns) + cluster_name = local.cluster_name + policy_arn = "arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy" + principal_arn = each.value + + access_scope { + type = "namespace" + namespaces = ["scale-deploy"] + } +} + +# Updated — remove scale-deploy from ml_infra_admin scope resource "aws_eks_access_policy_association" "ml_infra_admin" { for_each = toset(data.aws_iam_roles.ml_infra_admin.arns) cluster_name = local.cluster_name @@ -122,15 +151,28 @@ resource "aws_eks_access_policy_association" "ml_infra_admin" { "service", "snowflake-postgres-connector", "workload", - # scale-deploy intentionally omitted — ml_infra_admin must not mutate deployments here + # scale-deploy intentionally omitted — use ml-serving-deployer for model-engine deploys ] } } ``` -**2. `model-engine-internal/justfile`** +**2. `Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/iam.tf`** -Add a master branch guard so `just deploy prod` is rejected if the current commit is not merged to `origin/master`: +Add `ml-serving-deployer` to the aws-auth ConfigMap: + +```hcl +# Add to local.cluster_role_map: +{ + rolearn = one(data.aws_iam_roles.ml_serving_deployer.arns) + username = "ml-serving-deployer" + groups = ["ml-serving-deployer"] +} +``` + +**3. `model-engine-internal/justfile`** + +Switch the prod deploy profile to `ml-serving-deployer` and add a master branch guard: ```just # New — fails if current commit is not merged to origin/master @@ -144,12 +186,18 @@ _check_master_branch env: fi fi +# Updated _mount_aws: prod deploys use ml-serving-deployer +_mount_aws env: _aws-mountable_folder + {{ if env == 'training' { 'cd .. && python3 scripts_py3/scale_scripts/exe/generate_mountable_aws_credentials.py -p ml-admin -e ~/.aws-mountable/credentials' } else { '' } }} + {{ if env == 'prod' { 'cd .. && python3 scripts_py3/scale_scripts/exe/generate_mountable_aws_credentials.py -p ml-serving-deployer -e ~/.aws-mountable/credentials' } else { '' } }} + {{ if env == 'launch' { 'cd .. && python3 scripts_py3/scale_scripts/exe/generate_mountable_aws_credentials.py -p ml-serving-admin -e ~/.aws-mountable/credentials' } else { '' } }} + # Updated deploy: add master branch check deploy env='training': (_check_master_branch env) (_deploy env 'false') (_emit_deploy_event "model-engine" env) ``` **Owner:** model-engine on-call -**Effort:** ~1–2 days (Terraform RBAC change + Atlantis apply + justfile + validation) +**Effort:** ~2–3 days (new IAM SSO role + Terraform + Atlantis apply + justfile + validation) --- From 84b5b4ceb442371954f9604c493bd29dae5e6baa Mon Sep 17 00:00:00 2001 From: lilyz-ai Date: Tue, 5 May 2026 01:45:10 +0000 Subject: [PATCH 25/27] docs(post-mortem): replace prod_5m with targeted high-error-rate monitor Swap the noisy prod_5m ratio tweak for a dedicated monitor: fires when >50% of all model-engine Istio requests are 5xx over 5 min. Uses broad istio.mesh.request.count.total (not the narrow Envoy trace query) so a complete outage like this one pages within 5 min. Co-Authored-By: Claude Sonnet 4.6 --- .../post-mortem-mli-6574-model-engine-500s.md | 39 +++++++++++-------- 1 file changed, 23 insertions(+), 16 deletions(-) diff --git a/docs/internal/post-mortem-mli-6574-model-engine-500s.md b/docs/internal/post-mortem-mli-6574-model-engine-500s.md index 02db674d..be46c94e 100644 --- a/docs/internal/post-mortem-mli-6574-model-engine-500s.md +++ b/docs/internal/post-mortem-mli-6574-model-engine-500s.md @@ -246,28 +246,35 @@ Any change requiring a DB schema update (new ORM column) must have its Alembic m --- -### P1 — Tighten 5xx alerting +### P1 — Add high-severity 5xx outage alert -**What:** PagerDuty did not fire until 00:37Z — 47 min into the incident. The existing monitor (`datadog_monitor.model_engine_error_rate_alert` in `scaleapi-ml-datadog/launch.tf`) evaluates a 15-minute window at a 5% threshold. Add a second, faster-firing entry with a 5-minute window so a sustained 5xx spike pages within minutes of onset. +**What:** PagerDuty did not fire until 00:37Z — 47 min into the incident. The existing monitor (`datadog_monitor.model_engine_error_rate_alert`) uses a narrow Envoy trace query and a 5% threshold over 15 minutes — too slow to catch a sudden complete outage. Add a dedicated monitor using the broader Istio mesh metrics that fires when >50% of all model-engine requests are 5xx over a 5-minute window. This would have paged within 5 minutes of the incident. **File to change: `Terracode-ML/scaleapi-ml-datadog/launch.tf`** ```hcl -# In locals: -launch_v1_route_monitor_params = { - prod_15m = { - severity = "sev0" - cluster_name = "ml-serving-new" - time_window = "15m" - critical_rate = 0.05 - } - # New — fires within ~5 min of a sustained 5xx spike - prod_5m = { - severity = "sev0" - cluster_name = "ml-serving-new" - time_window = "5m" - critical_rate = 0.05 +resource "datadog_monitor" "model_engine_high_error_rate_alert" { + name = "[model engine] Model Engine has >50% 5xx error rate on prod" + type = "query alert" + query = "sum(last_5m):sum:istio.mesh.request.count.total{cluster_name:ml-serving-new AND destination_app:model-engine AND response_code:5*}.as_count() / sum:istio.mesh.request.count.total{cluster_name:ml-serving-new AND destination_app:model-engine}.as_count() > 0.5" + message = <50% of all requests — likely a widespread outage. + +Runbook: https://scale.atlassian.net/wiki/spaces/EPD/pages/340230155/Launch+has+an+elevated+error+rate + +@slack-ml-alerts @pagerduty-PagerdutyRouting +EOF + + monitor_thresholds { + critical = 0.5 } + + notify_audit = false + require_full_window = false + notify_no_data = false + renotify_interval = 0 + include_tags = true + tags = ["env:prod", "service:model-engine", "severity:sev0", "team:ml"] } ``` From 5056a47d8a9448ef7dd8f48eaefeae128b4d6e5c Mon Sep 17 00:00:00 2001 From: lilyz-ai Date: Tue, 5 May 2026 01:54:48 +0000 Subject: [PATCH 26/27] docs(post-mortem): correct genai role claim in audit note MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ml-admin is MLInfraAdmin on the ML R&D account (692474966980), not the Serving account — different IAM role, no access to ml-serving-new. Co-Authored-By: Claude Sonnet 4.6 --- docs/internal/post-mortem-mli-6574-model-engine-500s.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/internal/post-mortem-mli-6574-model-engine-500s.md b/docs/internal/post-mortem-mli-6574-model-engine-500s.md index be46c94e..67f16a1b 100644 --- a/docs/internal/post-mortem-mli-6574-model-engine-500s.md +++ b/docs/internal/post-mortem-mli-6574-model-engine-500s.md @@ -88,7 +88,7 @@ Two preventative controls: 1. **Create `ml-serving-deployer`** — a new AWS SSO role scoped to `scale-deploy` on `ml-serving-new`; `just deploy prod` switches to it. 2. **Scope down `ml_infra_admin`** — remove its access to `scale-deploy`. Because Kubernetes RBAC is additive (no deny rules), scoping the EKS access entry is the only way to enforce this at the API server. -Audit confirmed no other service deploys to `scale-deploy` on `ml-serving-new` via `ml_infra_admin`: services using `ml-admin` on `ml-serving-new` (`dagster`, `scaletrain-ui`, `ml-orchestration-internal`, `research_evals`) all use the `dagster` namespace; genai services (`auto-hillclimb-ui` etc.) target `ml-training-new`. +Audit confirmed no other service deploys to `scale-deploy` on `ml-serving-new` via `ml_infra_admin`: services on `ml-serving-new` that use `ml_infra_admin` (`dagster`, `scaletrain-ui`, `ml-orchestration-internal`, `research_evals`) all deploy to the `dagster` namespace; genai services (`auto-hillclimb-ui` etc.) deploy to `ml-training-new`, which is a separate AWS account and has no access to `ml-serving-new`. **b. justfile guard — block deploying unmerged code** From 95bb78319334b8e3fcbd5d43969fede4328b20f4 Mon Sep 17 00:00:00 2001 From: lilyz-ai Date: Tue, 5 May 2026 01:55:23 +0000 Subject: [PATCH 27/27] docs(post-mortem): drop genai from audit note Co-Authored-By: Claude Sonnet 4.6 --- docs/internal/post-mortem-mli-6574-model-engine-500s.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/internal/post-mortem-mli-6574-model-engine-500s.md b/docs/internal/post-mortem-mli-6574-model-engine-500s.md index 67f16a1b..989e7820 100644 --- a/docs/internal/post-mortem-mli-6574-model-engine-500s.md +++ b/docs/internal/post-mortem-mli-6574-model-engine-500s.md @@ -88,7 +88,7 @@ Two preventative controls: 1. **Create `ml-serving-deployer`** — a new AWS SSO role scoped to `scale-deploy` on `ml-serving-new`; `just deploy prod` switches to it. 2. **Scope down `ml_infra_admin`** — remove its access to `scale-deploy`. Because Kubernetes RBAC is additive (no deny rules), scoping the EKS access entry is the only way to enforce this at the API server. -Audit confirmed no other service deploys to `scale-deploy` on `ml-serving-new` via `ml_infra_admin`: services on `ml-serving-new` that use `ml_infra_admin` (`dagster`, `scaletrain-ui`, `ml-orchestration-internal`, `research_evals`) all deploy to the `dagster` namespace; genai services (`auto-hillclimb-ui` etc.) deploy to `ml-training-new`, which is a separate AWS account and has no access to `ml-serving-new`. +Audit confirmed no other service deploys to `scale-deploy` on `ml-serving-new` via `ml_infra_admin`: all other services using `ml_infra_admin` on `ml-serving-new` (`dagster`, `scaletrain-ui`, `ml-orchestration-internal`, `research_evals`) deploy to the `dagster` namespace. **b. justfile guard — block deploying unmerged code**