-
Notifications
You must be signed in to change notification settings - Fork 74
docs: SEV1 post-mortem for model-engine 500s incident (MLI-6574) #817
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
9a10495
4332cad
de0c354
413ea98
ab7cee4
97fa123
cb1b3a3
a6a7994
6e084d2
5805dbb
261f6ab
18a5bf0
a3e8338
ebdf6a1
5902d3f
988bc12
e1960c7
636d714
25b347b
26a02b0
dbdb649
46b0d2b
d3cf507
5226e9a
84b5b4c
5056a47
95bb783
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,283 @@ | ||
| # SEV1 Post-Mortem: model-engine Deployment Caused 500s on All Endpoint Operations | ||
|
|
||
| **Incident ID:** MLI-6574 | ||
| **Severity:** SEV1 | ||
| **Date:** Apr 24β25, 2026 | ||
| **Duration:** ~58 min (23:50Z deployment β 00:48Z rollback) | ||
| **Customer-facing 500s:** ~47 min | ||
| **Status:** Resolved; follow-up actions tracked below | ||
|
|
||
| --- | ||
|
|
||
| ## Summary | ||
|
|
||
| At 23:50Z on Apr 24, a direct `kubectl set image` pushed a model-engine image (`04729cefβ¦`) that declared a new ORM column `endpoints.temporal_task_queue` with no corresponding Alembic migration applied to production. Every `list_model_endpoints` and `get_model_endpoint` call immediately returned 500. Five patch images over ~34 minutes failed to resolve the issue before a rollback to `f395ffa6β¦` cleared errors at 00:48Z. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. could you add some context on who the entity was who ran the set-image - human user/some automated process |
||
|
|
||
| --- | ||
|
greptile-apps[bot] marked this conversation as resolved.
|
||
|
|
||
| ## Timeline | ||
|
|
||
| | Time (UTC) | Event | | ||
| |---|---| | ||
| | Feb 27, 05:54Z | Stable image `f395ffa6β¦` deployed; ran cleanly for ~57 days | | ||
| | **Apr 24, 23:50:25Z** | **`kubectl set image` β `04729cefβ¦` (rev 293); ORM references missing `temporal_task_queue` column; all endpoint list/get β 500** | | ||
| | 23:55:43Z | `-internal` tag rolled (rev 294); same bug, same 500s | | ||
| | Apr 25, 00:09:32Z | `-patch` (rev 296) pushed; does not include migration; 500s continue | | ||
| | 00:15:29Z | `-patch2` (rev 297); same | | ||
| | 00:20:27Z | `-patch3` (rev 298); same | | ||
| | 00:24:11Z | `-patch4` (rev 299); same | | ||
| | **00:37Z** | **PagerDuty fires; Envoy 5xx ratio = 0.052** | | ||
| | **00:48Z** | **Rollback to `f395ffa6β¦` (rev 300); errors clear; alert auto-resolves** | | ||
| | 01:22Z | 0 errors in last 2 min; incident closed | | ||
|
|
||
| --- | ||
|
|
||
| ## Root Cause | ||
|
|
||
| **Background terms:** | ||
| - **ORM column** β a Python class attribute like `temporal_task_queue = Column(Text)` on the `Endpoint` model. SQLAlchemy (our ORM) translates these attributes into SQL; every declared column is included in the `SELECT` it generates when loading `Endpoint` objects. | ||
| - **Alembic migration** β a versioned SQL script (e.g. `ALTER TABLE endpoints ADD COLUMN temporal_task_queue TEXT`) that updates the live database schema to match what the application code expects. Without it, the DB is missing the column the ORM declares. | ||
|
|
||
| The `temporal_task_queue` column was added to the `Endpoint` ORM model (`hosted_model_inference.py`) as part of the temporal endpoint type feature (MLI-6425), but **no Alembic migration was written for it**. Because SQLAlchemy includes every ORM-declared column in its generated `SELECT` statement, all endpoint operations β list, get, create, update, delete β issued a query referencing `temporal_task_queue`. PostgreSQL returned `column "temporal_task_queue" does not exist` on every call, causing a 500 across the board. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we have alert ready to capture if similar things happen in the future?
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. added a new section for alerting on failure over 50% in data plane Istio mesh metrics. |
||
|
|
||
| ### Why the existing migration gate didn't protect us | ||
|
|
||
| The Helm chart already enforces migration-first ordering: | ||
|
|
||
| ```yaml | ||
| # charts/model-engine/templates/database_migration_job.yaml | ||
| annotations: | ||
| "helm.sh/hook": pre-install,pre-upgrade | ||
| "helm.sh/hook-weight": "-1" # runs before any pod rollout | ||
| ``` | ||
|
|
||
| This pre-upgrade hook runs `alembic upgrade head` before the gateway deployment starts. **The incident bypassed this entirely because the deployment used `kubectl set image`, not `helm upgrade`.** `kubectl set image` updates the pod spec directly, skipping all Helm hooks. | ||
|
greptile-apps[bot] marked this conversation as resolved.
|
||
|
|
||
| ### Contributing factors | ||
|
|
||
| 1. **`kubectl set image` used for production image updates** β bypasses the migration pre-hook built into the Helm chart. | ||
| 2. **No ORM/DB schema validation at startup** β the app started successfully and served traffic with an invalid schema; errors only appeared at query time. The `/readyz` probe passes as long as the process is up, regardless of DB state. | ||
| 3. **No migration written for the new column** β the ORM model was updated without a paired migration file. | ||
| 4. **No rollback SLA** β 5 patch attempts over 34 minutes before rollback was chosen. | ||
| 5. **Delayed alerting** β PagerDuty did not fire until 00:37Z, 47 min into the incident. | ||
|
|
||
| --- | ||
|
|
||
| ## Impact | ||
|
|
||
| | Dimension | Detail | | ||
| |---|---| | ||
| | Customer-facing 500s | ~47 min (23:50Z β 00:37Z alert; resolved 00:48Z) | | ||
| | Affected operations | All endpoint list, get, create, update, delete | | ||
|
greptile-apps[bot] marked this conversation as resolved.
|
||
| | Affected users | All model-engine users | | ||
|
|
||
| --- | ||
|
|
||
| ## Action Items | ||
|
|
||
| ### P0 β Prevent `kubectl set image` in production (preventative, not reactive) | ||
|
|
||
| **What:** `kubectl set image` updates pods directly, bypassing all Helm pre-upgrade hooks including the migration job. The fix must be preventative: a `kubectl set image` attempt against the production deployment is rejected by the API server before it takes effect. | ||
|
|
||
| Two preventative controls: | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A quick chat told me it's possible to have a guardrail which ensures alembic migrations is in sync with ORM model |
||
|
|
||
| **a. Create a dedicated deploy role + scope down `ml_infra_admin`** | ||
|
|
||
| `just deploy prod` uses the `ml-serving-admin` AWS SSO profile, which resolves to `AWSReservedSSO_MLInfraAdmin` β the same IAM role as `ml_infra_admin`. Scoping `ml_infra_admin` out of `scale-deploy` without a replacement would break deploys. The correct implementation requires both steps: | ||
|
|
||
| 1. **Create `ml-serving-deployer`** β a new AWS SSO role scoped to `scale-deploy` on `ml-serving-new`; `just deploy prod` switches to it. | ||
| 2. **Scope down `ml_infra_admin`** β remove its access to `scale-deploy`. Because Kubernetes RBAC is additive (no deny rules), scoping the EKS access entry is the only way to enforce this at the API server. | ||
|
|
||
| Audit confirmed no other service deploys to `scale-deploy` on `ml-serving-new` via `ml_infra_admin`: all other services using `ml_infra_admin` on `ml-serving-new` (`dagster`, `scaletrain-ui`, `ml-orchestration-internal`, `research_evals`) deploy to the `dagster` namespace. | ||
|
|
||
| **b. justfile guard β block deploying unmerged code** | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This can work, but if we're going to do this, then we'd also want to make sure that people can force merge things in the
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Think another issue is: while this is good in the sense that it would force a review of the PR, it assumes that everyone follows the same release checklist. Would something like |
||
|
|
||
| `just deploy prod` checks that the current commit is merged to `origin/master` before proceeding. Deploying a local branch to production is rejected before any image is built or pushed. | ||
|
|
||
| **Files to change:** | ||
|
|
||
| **1. `Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/eks.tf`** | ||
|
|
||
| Add the `ml-serving-deployer` access entry scoped to `scale-deploy`, and narrow `ml_infra_admin` to every namespace except `scale-deploy`: | ||
|
|
||
| ```hcl | ||
| # New β dedicated deploy role for model-engine, scoped to scale-deploy only | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do other teams do this? |
||
| data "aws_iam_roles" "ml_serving_deployer" { | ||
| name_regex = "AWSReservedSSO_MLServingDeployer.*" | ||
| path_prefix = "/aws-reserved/sso.amazonaws.com/" | ||
| } | ||
|
|
||
| resource "aws_eks_access_entry" "ml_serving_deployer" { | ||
| for_each = toset(data.aws_iam_roles.ml_serving_deployer.arns) | ||
| cluster_name = local.cluster_name | ||
| principal_arn = each.value | ||
| type = "STANDARD" | ||
| } | ||
|
|
||
| resource "aws_eks_access_policy_association" "ml_serving_deployer" { | ||
| for_each = toset(data.aws_iam_roles.ml_serving_deployer.arns) | ||
| cluster_name = local.cluster_name | ||
| policy_arn = "arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy" | ||
| principal_arn = each.value | ||
|
|
||
| access_scope { | ||
| type = "namespace" | ||
| namespaces = ["scale-deploy"] | ||
| } | ||
| } | ||
|
|
||
| # Updated β remove scale-deploy from ml_infra_admin scope | ||
| resource "aws_eks_access_policy_association" "ml_infra_admin" { | ||
| for_each = toset(data.aws_iam_roles.ml_infra_admin.arns) | ||
| cluster_name = local.cluster_name | ||
| policy_arn = "arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy" | ||
| principal_arn = each.value | ||
|
|
||
| access_scope { | ||
| type = "namespace" | ||
| namespaces = [ | ||
| "dagster", | ||
| "default", | ||
| "gpu-operator", | ||
| "istio-ingress", | ||
| "istio-system", | ||
| "keda", | ||
| "kube-node-lease", | ||
| "kube-public", | ||
| "kube-system", | ||
| "kubecost", | ||
| "litellm-rebuild", | ||
| "local-path-storage", | ||
| "service", | ||
| "snowflake-postgres-connector", | ||
| "workload", | ||
| # scale-deploy intentionally omitted β use ml-serving-deployer for model-engine deploys | ||
| ] | ||
| } | ||
| } | ||
|
greptile-apps[bot] marked this conversation as resolved.
|
||
| ``` | ||
|
|
||
| **2. `Terracode-ML/scaleapi-ml-serving/clusters/ml-serving-new/iam.tf`** | ||
|
|
||
| Add `ml-serving-deployer` to the aws-auth ConfigMap: | ||
|
|
||
| ```hcl | ||
| # Add to local.cluster_role_map: | ||
| { | ||
| rolearn = one(data.aws_iam_roles.ml_serving_deployer.arns) | ||
| username = "ml-serving-deployer" | ||
| groups = ["ml-serving-deployer"] | ||
| } | ||
| ``` | ||
|
|
||
| **3. `model-engine-internal/justfile`** | ||
|
|
||
| Switch the prod deploy profile to `ml-serving-deployer` and add a master branch guard: | ||
|
|
||
| ```just | ||
| # New β fails if current commit is not merged to origin/master | ||
| _check_master_branch env: | ||
| #!/bin/bash | ||
| if [ "{{ env }}" = "prod" ]; then | ||
| git fetch origin master --quiet | ||
| if ! git merge-base --is-ancestor HEAD origin/master; then | ||
| echo "ERROR: current commit is not merged to origin/master; prod deploy aborted" | ||
| exit 1 | ||
| fi | ||
| fi | ||
|
greptile-apps[bot] marked this conversation as resolved.
|
||
|
|
||
| # Updated _mount_aws: prod deploys use ml-serving-deployer | ||
| _mount_aws env: _aws-mountable_folder | ||
| {{ if env == 'training' { 'cd .. && python3 scripts_py3/scale_scripts/exe/generate_mountable_aws_credentials.py -p ml-admin -e ~/.aws-mountable/credentials' } else { '' } }} | ||
| {{ if env == 'prod' { 'cd .. && python3 scripts_py3/scale_scripts/exe/generate_mountable_aws_credentials.py -p ml-serving-deployer -e ~/.aws-mountable/credentials' } else { '' } }} | ||
| {{ if env == 'launch' { 'cd .. && python3 scripts_py3/scale_scripts/exe/generate_mountable_aws_credentials.py -p ml-serving-admin -e ~/.aws-mountable/credentials' } else { '' } }} | ||
|
|
||
| # Updated deploy: add master branch check | ||
| deploy env='training': (_check_master_branch env) (_deploy env 'false') (_emit_deploy_event "model-engine" env) | ||
| ``` | ||
|
|
||
| **Owner:** model-engine on-call | ||
| **Effort:** ~2β3 days (new IAM SSO role + Terraform + Atlantis apply + justfile + validation) | ||
|
|
||
| --- | ||
|
|
||
| ### Rollout Strategy | ||
|
|
||
| **Existing environments β all carry production traffic:** | ||
|
|
||
| | Env | Cluster | Purpose | | ||
| |---|---|---| | ||
| | `training` | `ml-training-new` | production training workloads | | ||
| | `launch` | `ml-launch-new` | production Launch API | | ||
| | `prod` | `ml-serving-new` | production serving | | ||
| | `circleci` | minikube (ephemeral) | CI integration tests only | | ||
|
Comment on lines
+210
to
+213
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. are this related to the scope of this doc? |
||
|
|
||
| There is no staging environment for model-engine. One needs to be created. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1, need canary environment to test production changes before actually release to production
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't disagree but it seems like once we're running |
||
|
|
||
| **Staging environment β effort estimate:** | ||
|
|
||
| The staging backing infrastructure already exists in Terracode-ML `scaleapi-ml-serving/global/` β it was provisioned but never wired up to a model-engine deployment: | ||
|
|
||
| | Component | Status | File | | ||
| |---|---|---| | ||
| | RDS Aurora PostgreSQL (`ml-infra-staging`, secret `staging/ml_infra_pg`) | **Already exists** | `global/db.tf` lines 85β164 | | ||
| | ElastiCache Redis (`staging-celery-redis-rg-1`) | **Already exists** | `global/celery-elasticache.tf` lines 44β80 | | ||
| | VPC / subnets | **Already exists** (shared ml-serving VPC) | β | | ||
| | Route53 DNS (`ml-staging-internal.scale.com`) | **Already exists** | `global/dns.tf` lines 35β49 | | ||
| | EKS cluster | **Does not exist** | β | | ||
|
|
||
| Add a `scale-deploy-staging` namespace in ml-serving-new. DB, Redis, VPC, and DNS already point at staging endpoints β the namespace is the only missing K8s layer. Estimated ~3β4 days: | ||
| - `values_staging.yaml` pointing at `staging/ml_infra_pg` and staging Redis: ~1 day | ||
| - K8s namespace + IRSA service account + secrets: ~0.5 day | ||
| - `justfile` staging env wiring: ~0.5 day | ||
| - Validation / smoke tests: ~1 day | ||
|
|
||
| **Process once staging exists:** | ||
|
|
||
| ``` | ||
| staging: any method β kubectl set image, just deploy staging, etc. | ||
| β | ||
| validate: migrations apply cleanly, smoke tests pass, no 500s | ||
| β | ||
| production: just deploy prod (helm upgrade, migration-first, RBAC-enforced) | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we should propose a release process to make sure release is ready for prod release. e.g
|
||
| ``` | ||
|
|
||
| Any change requiring a DB schema update (new ORM column) must have its Alembic migration merged and applied to production before `just deploy prod` is run. | ||
|
|
||
| --- | ||
|
|
||
| ### P1 β Add high-severity 5xx outage alert | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Confirming - can we do this today without being spammy? |
||
|
|
||
| **What:** PagerDuty did not fire until 00:37Z β 47 min into the incident. The existing monitor (`datadog_monitor.model_engine_error_rate_alert`) uses a narrow Envoy trace query and a 5% threshold over 15 minutes β too slow to catch a sudden complete outage. Add a dedicated monitor using the broader Istio mesh metrics that fires when >50% of all model-engine requests are 5xx over a 5-minute window. This would have paged within 5 minutes of the incident. | ||
|
|
||
| **File to change: `Terracode-ML/scaleapi-ml-datadog/launch.tf`** | ||
|
|
||
| ```hcl | ||
| resource "datadog_monitor" "model_engine_high_error_rate_alert" { | ||
| name = "[model engine] Model Engine has >50% 5xx error rate on prod" | ||
| type = "query alert" | ||
| query = "sum(last_5m):sum:istio.mesh.request.count.total{cluster_name:ml-serving-new AND destination_app:model-engine AND response_code:5*}.as_count() / sum:istio.mesh.request.count.total{cluster_name:ml-serving-new AND destination_app:model-engine}.as_count() > 0.5" | ||
| message = <<EOF | ||
| Model Engine is returning 5xx on >50% of all requests β likely a widespread outage. | ||
|
|
||
| Runbook: https://scale.atlassian.net/wiki/spaces/EPD/pages/340230155/Launch+has+an+elevated+error+rate | ||
|
|
||
| @slack-ml-alerts @pagerduty-PagerdutyRouting | ||
| EOF | ||
|
|
||
| monitor_thresholds { | ||
| critical = 0.5 | ||
| } | ||
|
|
||
| notify_audit = false | ||
| require_full_window = false | ||
| notify_no_data = false | ||
| renotify_interval = 0 | ||
| include_tags = true | ||
| tags = ["env:prod", "service:model-engine", "severity:sev0", "team:ml"] | ||
| } | ||
| ``` | ||
|
|
||
| **Owner:** model-engine on-call | ||
| **Effort:** ~0.5 days (Datadog Terraform change + Atlantis apply) | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR description mentions only the post-mortem doc addition, but
values_sample.yamlis quietly updated to a new image tag (2e9d00786419ef44ec5c9d3305d8d6451d6aabfb). The post-mortem itself notes thatvalues_sample.yamltargets production (no separate staging config exists). Deploying a new image tag here without an explicit mention in the PR description or test plan risks repeating the exact pattern documented in the incident β an image promotion that bypasses scrutiny of whether the accompanying Alembic migration is in place. The test plan should confirm that the migration for this new image has been applied to production before this config lands.Prompt To Fix With AI