Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
9a10495
chore: bump model-engine image tag to latest main (2e9d0078)
lilyz-ai Apr 21, 2026
4332cad
docs: add SEV1 post-mortem for model-engine 500s incident (MLI-6574)
lilyz-ai Apr 29, 2026
de0c354
docs: expand post-mortem with concrete action item plans (MLI-6574)
lilyz-ai Apr 29, 2026
413ea98
docs: add kubectl ban enforcement options and image testing workflow …
lilyz-ai Apr 29, 2026
ab7cee4
docs(post-mortem): clarify terms, fix action items, replace P1/P2 wit…
lilyz-ai May 3, 2026
97fa123
docs(post-mortem): streamline action items and expand rollout strategy
lilyz-ai May 4, 2026
cb1b3a3
docs(post-mortem): fix RBAC location, drop scripts/deploy.sh, expand …
lilyz-ai May 4, 2026
a6a7994
docs(post-mortem): pin actual RBAC files and add real staging effort …
lilyz-ai May 4, 2026
6e084d2
docs(post-mortem): write concrete RBAC diffs and fix staging estimate
lilyz-ai May 4, 2026
5805dbb
docs(post-mortem): fix deploy impact of RBAC change, rename role, tri…
lilyz-ai May 4, 2026
261f6ab
docs(post-mortem): two preventative controls β€” RBAC + master branch g…
lilyz-ai May 4, 2026
18a5bf0
docs(post-mortem): explain SA token, add dedicated IAM role as elegan…
lilyz-ai May 4, 2026
a3e8338
docs(post-mortem): go with dedicated ml-serving-deployer IAM role (Op…
lilyz-ai May 4, 2026
ebdf6a1
docs(post-mortem): note cross-service impact of scale-deploy RBAC change
lilyz-ai May 4, 2026
5902d3f
docs(post-mortem): identify concrete cross-service impact from kubect…
lilyz-ai May 4, 2026
988bc12
docs(post-mortem): backward-compatible RBAC β€” additive only, ml_infra…
lilyz-ai May 4, 2026
e1960c7
docs(post-mortem): add Option A vs B comparison table for deploy iden…
lilyz-ai May 4, 2026
636d714
docs(post-mortem): reframe deploy identity choice as open design deci…
lilyz-ai May 4, 2026
25b347b
docs(post-mortem): remove duplicate justfile guard section
lilyz-ai May 4, 2026
26a02b0
docs(post-mortem): address Bill's review comments
lilyz-ai May 5, 2026
dbdb649
docs(post-mortem): switch to Option A; complete breakage audit
lilyz-ai May 5, 2026
46b0d2b
docs(post-mortem): drop Option B, update files-to-change for Option A
lilyz-ai May 5, 2026
d3cf507
docs(post-mortem): add concrete code changes for eks.tf RBAC and aler…
lilyz-ai May 5, 2026
5226e9a
docs(post-mortem): fix deploy identity β€” both new role and scope-down…
lilyz-ai May 5, 2026
84b5b4c
docs(post-mortem): replace prod_5m with targeted high-error-rate monitor
lilyz-ai May 5, 2026
5056a47
docs(post-mortem): correct genai role claim in audit note
lilyz-ai May 5, 2026
95bb783
docs(post-mortem): drop genai from audit note
lilyz-ai May 5, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion charts/model-engine/values_sample.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ celery_broker_type_redis: null
# - ALL

# tag [required] is the LLM Engine docker image tag
tag: e360bfb1d21d9d4e7b7fcb6b29ca752095b4d0f4
tag: 2e9d00786419ef44ec5c9d3305d8d6451d6aabfb
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Undocumented production image tag change

The PR description mentions only the post-mortem doc addition, but values_sample.yaml is quietly updated to a new image tag (2e9d00786419ef44ec5c9d3305d8d6451d6aabfb). The post-mortem itself notes that values_sample.yaml targets production (no separate staging config exists). Deploying a new image tag here without an explicit mention in the PR description or test plan risks repeating the exact pattern documented in the incident β€” an image promotion that bypasses scrutiny of whether the accompanying Alembic migration is in place. The test plan should confirm that the migration for this new image has been applied to production before this config lands.

Prompt To Fix With AI
This is a comment left during a code review.
Path: charts/model-engine/values_sample.yaml
Line: 27

Comment:
**Undocumented production image tag change**

The PR description mentions only the post-mortem doc addition, but `values_sample.yaml` is quietly updated to a new image tag (`2e9d00786419ef44ec5c9d3305d8d6451d6aabfb`). The post-mortem itself notes that `values_sample.yaml` targets production (no separate staging config exists). Deploying a new image tag here without an explicit mention in the PR description or test plan risks repeating the exact pattern documented in the incident β€” an image promotion that bypasses scrutiny of whether the accompanying Alembic migration is in place. The test plan should confirm that the migration for this new image has been applied to production before this config lands.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Cursor Fix in Claude Code Fix in Codex

# context is a user-specified deployment tag. Can be used to
context: production
image:
Expand Down
283 changes: 283 additions & 0 deletions docs/internal/post-mortem-mli-6574-model-engine-500s.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,283 @@
# SEV1 Post-Mortem: model-engine Deployment Caused 500s on All Endpoint Operations

**Incident ID:** MLI-6574
**Severity:** SEV1
**Date:** Apr 24–25, 2026
**Duration:** ~58 min (23:50Z deployment β†’ 00:48Z rollback)
**Customer-facing 500s:** ~47 min
**Status:** Resolved; follow-up actions tracked below

---

## Summary

At 23:50Z on Apr 24, a direct `kubectl set image` pushed a model-engine image (`04729cef…`) that declared a new ORM column `endpoints.temporal_task_queue` with no corresponding Alembic migration applied to production. Every `list_model_endpoints` and `get_model_endpoint` call immediately returned 500. Five patch images over ~34 minutes failed to resolve the issue before a rollback to `f395ffa6…` cleared errors at 00:48Z.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add some context on who the entity was who ran the set-image - human user/some automated process


---
Comment thread
greptile-apps[bot] marked this conversation as resolved.

## Timeline

| Time (UTC) | Event |
|---|---|
| Feb 27, 05:54Z | Stable image `f395ffa6…` deployed; ran cleanly for ~57 days |
| **Apr 24, 23:50:25Z** | **`kubectl set image` β†’ `04729cef…` (rev 293); ORM references missing `temporal_task_queue` column; all endpoint list/get β†’ 500** |
| 23:55:43Z | `-internal` tag rolled (rev 294); same bug, same 500s |
| Apr 25, 00:09:32Z | `-patch` (rev 296) pushed; does not include migration; 500s continue |
| 00:15:29Z | `-patch2` (rev 297); same |
| 00:20:27Z | `-patch3` (rev 298); same |
| 00:24:11Z | `-patch4` (rev 299); same |
| **00:37Z** | **PagerDuty fires; Envoy 5xx ratio = 0.052** |
| **00:48Z** | **Rollback to `f395ffa6…` (rev 300); errors clear; alert auto-resolves** |
| 01:22Z | 0 errors in last 2 min; incident closed |

---

## Root Cause

**Background terms:**
- **ORM column** β€” a Python class attribute like `temporal_task_queue = Column(Text)` on the `Endpoint` model. SQLAlchemy (our ORM) translates these attributes into SQL; every declared column is included in the `SELECT` it generates when loading `Endpoint` objects.
- **Alembic migration** β€” a versioned SQL script (e.g. `ALTER TABLE endpoints ADD COLUMN temporal_task_queue TEXT`) that updates the live database schema to match what the application code expects. Without it, the DB is missing the column the ORM declares.

The `temporal_task_queue` column was added to the `Endpoint` ORM model (`hosted_model_inference.py`) as part of the temporal endpoint type feature (MLI-6425), but **no Alembic migration was written for it**. Because SQLAlchemy includes every ORM-declared column in its generated `SELECT` statement, all endpoint operations β€” list, get, create, update, delete β€” issued a query referencing `temporal_task_queue`. PostgreSQL returned `column "temporal_task_queue" does not exist` on every call, causing a 500 across the board.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have alert ready to capture if similar things happen in the future?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a new section for alerting on failure over 50% in data plane Istio mesh metrics.


### Why the existing migration gate didn't protect us

The Helm chart already enforces migration-first ordering:

```yaml
# charts/model-engine/templates/database_migration_job.yaml
annotations:
"helm.sh/hook": pre-install,pre-upgrade
"helm.sh/hook-weight": "-1" # runs before any pod rollout
```

This pre-upgrade hook runs `alembic upgrade head` before the gateway deployment starts. **The incident bypassed this entirely because the deployment used `kubectl set image`, not `helm upgrade`.** `kubectl set image` updates the pod spec directly, skipping all Helm hooks.
Comment thread
greptile-apps[bot] marked this conversation as resolved.

### Contributing factors

1. **`kubectl set image` used for production image updates** β€” bypasses the migration pre-hook built into the Helm chart.
2. **No ORM/DB schema validation at startup** β€” the app started successfully and served traffic with an invalid schema; errors only appeared at query time. The `/readyz` probe passes as long as the process is up, regardless of DB state.
3. **No migration written for the new column** β€” the ORM model was updated without a paired migration file.
4. **No rollback SLA** β€” 5 patch attempts over 34 minutes before rollback was chosen.
5. **Delayed alerting** β€” PagerDuty did not fire until 00:37Z, 47 min into the incident.

---

## Impact

| Dimension | Detail |
|---|---|
| Customer-facing 500s | ~47 min (23:50Z β†’ 00:37Z alert; resolved 00:48Z) |
| Affected operations | All endpoint list, get, create, update, delete |
Comment thread
greptile-apps[bot] marked this conversation as resolved.
| Affected users | All model-engine users |

---

## Action Items

### P0 β€” Prevent `kubectl set image` in production (preventative, not reactive)

**What:** `kubectl set image` updates pods directly, bypassing all Helm pre-upgrade hooks including the migration job. The fix must be preventative: a `kubectl set image` attempt against the production deployment is rejected by the API server before it takes effect.

Two preventative controls:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick chat told me it's possible to have a guardrail which ensures alembic migrations is in sync with ORM model

#!/usr/bin/env bash
set -euo pipefail

# use a clean test DB
alembic upgrade head

# Make sure alembic/env.py points at your ORM metadata:
# fail if models contain schema changes not represented by migrations
alembic check


**a. Create a dedicated deploy role + scope down `ml_infra_admin`**

`just deploy prod` uses the `ml-serving-admin` AWS SSO profile, which resolves to `AWSReservedSSO_MLInfraAdmin` β€” the same IAM role as `ml_infra_admin`. Scoping `ml_infra_admin` out of `scale-deploy` without a replacement would break deploys. The correct implementation requires both steps:

1. **Create `ml-serving-deployer`** β€” a new AWS SSO role scoped to `scale-deploy` on `ml-serving-new`; `just deploy prod` switches to it.
2. **Scope down `ml_infra_admin`** β€” remove its access to `scale-deploy`. Because Kubernetes RBAC is additive (no deny rules), scoping the EKS access entry is the only way to enforce this at the API server.

Audit confirmed no other service deploys to `scale-deploy` on `ml-serving-new` via `ml_infra_admin`: all other services using `ml_infra_admin` on `ml-serving-new` (`dagster`, `scaletrain-ui`, `ml-orchestration-internal`, `research_evals`) deploy to the `dagster` namespace.

**b. justfile guard β€” block deploying unmerged code**
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can work, but if we're going to do this, then we'd also want to make sure that people can force merge things in the models repo. Otherwise, there's no available way to do emergency deploys.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think another issue is: while this is good in the sense that it would force a review of the PR, it assumes that everyone follows the same release checklist. Would something like greptile caught the fact that we were trying to push something without a corresponding migration?


`just deploy prod` checks that the current commit is merged to `origin/master` before proceeding. Deploying a local branch to production is rejected before any image is built or pushed.

**Files to change:**

**1. `Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/eks.tf`**

Add the `ml-serving-deployer` access entry scoped to `scale-deploy`, and narrow `ml_infra_admin` to every namespace except `scale-deploy`:

```hcl
# New β€” dedicated deploy role for model-engine, scoped to scale-deploy only
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do other teams do this?

data "aws_iam_roles" "ml_serving_deployer" {
name_regex = "AWSReservedSSO_MLServingDeployer.*"
path_prefix = "/aws-reserved/sso.amazonaws.com/"
}

resource "aws_eks_access_entry" "ml_serving_deployer" {
for_each = toset(data.aws_iam_roles.ml_serving_deployer.arns)
cluster_name = local.cluster_name
principal_arn = each.value
type = "STANDARD"
}

resource "aws_eks_access_policy_association" "ml_serving_deployer" {
for_each = toset(data.aws_iam_roles.ml_serving_deployer.arns)
cluster_name = local.cluster_name
policy_arn = "arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy"
principal_arn = each.value

access_scope {
type = "namespace"
namespaces = ["scale-deploy"]
}
}

# Updated β€” remove scale-deploy from ml_infra_admin scope
resource "aws_eks_access_policy_association" "ml_infra_admin" {
for_each = toset(data.aws_iam_roles.ml_infra_admin.arns)
cluster_name = local.cluster_name
policy_arn = "arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy"
principal_arn = each.value

access_scope {
type = "namespace"
namespaces = [
"dagster",
"default",
"gpu-operator",
"istio-ingress",
"istio-system",
"keda",
"kube-node-lease",
"kube-public",
"kube-system",
"kubecost",
"litellm-rebuild",
"local-path-storage",
"service",
"snowflake-postgres-connector",
"workload",
# scale-deploy intentionally omitted β€” use ml-serving-deployer for model-engine deploys
]
}
}
Comment thread
greptile-apps[bot] marked this conversation as resolved.
```

**2. `Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/iam.tf`**

Add `ml-serving-deployer` to the aws-auth ConfigMap:

```hcl
# Add to local.cluster_role_map:
{
rolearn = one(data.aws_iam_roles.ml_serving_deployer.arns)
username = "ml-serving-deployer"
groups = ["ml-serving-deployer"]
}
```

**3. `model-engine-internal/justfile`**

Switch the prod deploy profile to `ml-serving-deployer` and add a master branch guard:

```just
# New β€” fails if current commit is not merged to origin/master
_check_master_branch env:
#!/bin/bash
if [ "{{ env }}" = "prod" ]; then
git fetch origin master --quiet
if ! git merge-base --is-ancestor HEAD origin/master; then
echo "ERROR: current commit is not merged to origin/master; prod deploy aborted"
exit 1
fi
fi
Comment thread
greptile-apps[bot] marked this conversation as resolved.

# Updated _mount_aws: prod deploys use ml-serving-deployer
_mount_aws env: _aws-mountable_folder
{{ if env == 'training' { 'cd .. && python3 scripts_py3/scale_scripts/exe/generate_mountable_aws_credentials.py -p ml-admin -e ~/.aws-mountable/credentials' } else { '' } }}
{{ if env == 'prod' { 'cd .. && python3 scripts_py3/scale_scripts/exe/generate_mountable_aws_credentials.py -p ml-serving-deployer -e ~/.aws-mountable/credentials' } else { '' } }}
{{ if env == 'launch' { 'cd .. && python3 scripts_py3/scale_scripts/exe/generate_mountable_aws_credentials.py -p ml-serving-admin -e ~/.aws-mountable/credentials' } else { '' } }}

# Updated deploy: add master branch check
deploy env='training': (_check_master_branch env) (_deploy env 'false') (_emit_deploy_event "model-engine" env)
```

**Owner:** model-engine on-call
**Effort:** ~2–3 days (new IAM SSO role + Terraform + Atlantis apply + justfile + validation)

---

### Rollout Strategy

**Existing environments β€” all carry production traffic:**

| Env | Cluster | Purpose |
|---|---|---|
| `training` | `ml-training-new` | production training workloads |
| `launch` | `ml-launch-new` | production Launch API |
| `prod` | `ml-serving-new` | production serving |
| `circleci` | minikube (ephemeral) | CI integration tests only |
Comment on lines +210 to +213
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are this related to the scope of this doc? ml-training-new and circelci seems irrelavant


There is no staging environment for model-engine. One needs to be created.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, need canary environment to test production changes before actually release to production

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't disagree but it seems like once we're running kubectl commands directly, that's already incorrect, regardless of where the command is pointed.


**Staging environment β€” effort estimate:**

The staging backing infrastructure already exists in Terracode-ML `scaleapi-ml-serving/global/` β€” it was provisioned but never wired up to a model-engine deployment:

| Component | Status | File |
|---|---|---|
| RDS Aurora PostgreSQL (`ml-infra-staging`, secret `staging/ml_infra_pg`) | **Already exists** | `global/db.tf` lines 85–164 |
| ElastiCache Redis (`staging-celery-redis-rg-1`) | **Already exists** | `global/celery-elasticache.tf` lines 44–80 |
| VPC / subnets | **Already exists** (shared ml-serving VPC) | β€” |
| Route53 DNS (`ml-staging-internal.scale.com`) | **Already exists** | `global/dns.tf` lines 35–49 |
| EKS cluster | **Does not exist** | β€” |

Add a `scale-deploy-staging` namespace in ml-serving-new. DB, Redis, VPC, and DNS already point at staging endpoints β€” the namespace is the only missing K8s layer. Estimated ~3–4 days:
- `values_staging.yaml` pointing at `staging/ml_infra_pg` and staging Redis: ~1 day
- K8s namespace + IRSA service account + secrets: ~0.5 day
- `justfile` staging env wiring: ~0.5 day
- Validation / smoke tests: ~1 day

**Process once staging exists:**

```
staging: any method β€” kubectl set image, just deploy staging, etc.
↓
validate: migrations apply cleanly, smoke tests pass, no 500s
↓
production: just deploy prod (helm upgrade, migration-first, RBAC-enforced)
Copy link
Copy Markdown
Collaborator

@billyang-scale billyang-scale May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should propose a release process to make sure release is ready for prod release. e.g

  • no auto deployment in production.
  • make cutline to release candidate in staging env, and let it back for certain time e.g 24h
  • have monitor/test running make sure no regression
  • release to prod expicitly with release candidate version

```

Any change requiring a DB schema update (new ORM column) must have its Alembic migration merged and applied to production before `just deploy prod` is run.

---

### P1 β€” Add high-severity 5xx outage alert
Copy link
Copy Markdown
Collaborator

@sandeshghanta sandeshghanta May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirming - can we do this today without being spammy?
Ex: Does model engine throw 5xx if user request is overwhelming a model?


**What:** PagerDuty did not fire until 00:37Z β€” 47 min into the incident. The existing monitor (`datadog_monitor.model_engine_error_rate_alert`) uses a narrow Envoy trace query and a 5% threshold over 15 minutes β€” too slow to catch a sudden complete outage. Add a dedicated monitor using the broader Istio mesh metrics that fires when >50% of all model-engine requests are 5xx over a 5-minute window. This would have paged within 5 minutes of the incident.

**File to change: `Terracode-ML/scaleapi-ml-datadog/launch.tf`**

```hcl
resource "datadog_monitor" "model_engine_high_error_rate_alert" {
name = "[model engine] Model Engine has >50% 5xx error rate on prod"
type = "query alert"
query = "sum(last_5m):sum:istio.mesh.request.count.total{cluster_name:ml-serving-new AND destination_app:model-engine AND response_code:5*}.as_count() / sum:istio.mesh.request.count.total{cluster_name:ml-serving-new AND destination_app:model-engine}.as_count() > 0.5"
message = <<EOF
Model Engine is returning 5xx on >50% of all requests β€” likely a widespread outage.

Runbook: https://scale.atlassian.net/wiki/spaces/EPD/pages/340230155/Launch+has+an+elevated+error+rate

@slack-ml-alerts @pagerduty-PagerdutyRouting
EOF

monitor_thresholds {
critical = 0.5
}

notify_audit = false
require_full_window = false
notify_no_data = false
renotify_interval = 0
include_tags = true
tags = ["env:prod", "service:model-engine", "severity:sev0", "team:ml"]
}
```

**Owner:** model-engine on-call
**Effort:** ~0.5 days (Datadog Terraform change + Atlantis apply)