Skip to content

refactor: version PyTorch directory structure for multi-version support#6091

Open
Eren-Jeager123 wants to merge 34 commits into
mainfrom
refactor/pytorch-versioned-structure
Open

refactor: version PyTorch directory structure for multi-version support#6091
Eren-Jeager123 wants to merge 34 commits into
mainfrom
refactor/pytorch-versioned-structure

Conversation

@Eren-Jeager123
Copy link
Copy Markdown
Contributor

@Eren-Jeager123 Eren-Jeager123 commented May 12, 2026

Summary

Restructure PyTorch to support maintaining multiple versions concurrently (1-year support window). Versioned directories isolate each version's build artifacts. Workflows are version-agnostic — no new workflow files needed when adding a future version.

1. Versioned directory structure

docker/pytorch/2.11/
├── Dockerfile.cuda
├── Dockerfile.cpu
├── versions-cuda.env      # Build-time dependency versions (sourced by workflows + Dockerfiles)
├── versions-cpu.env
├── cuda/pyproject.toml
├── cuda/uv.lock
├── cpu/pyproject.toml
└── cpu/uv.lock

.github/config/image/
├── pytorch-2.11-ec2-cuda.yml       # Workflow metadata (framework, labels, prod_image)
├── pytorch-2.11-ec2-cpu.yml
├── pytorch-2.11-sagemaker-cuda.yml
└── pytorch-2.11-sagemaker-cpu.yml

Shared scripts (scripts/pytorch/) and tests (test/pytorch/) stay in place.

2. PR workflows — configs matrix (zero changes to reusable workflows)

gatekeeper → detect-versions → build-images / unit-test / sanity / security / telemetry / single-gpu / efa

Key design: detect-versions reads each version's config YAML and outputs a rich configs matrix where each entry carries all metadata fields. Downstream test jobs iterate over this matrix and pass matrix.* fields directly as individual inputs to unmodified reusable workflows.

# detect-versions outputs:
#   configs: [{"version":"2.11", "framework":"pytorch_runtime", ..., "prod_image":"pytorch:2.11-cu130-amzn2023"}]

sanity-test:
  strategy:
    matrix:
      include: ${{ fromJson(needs.detect-versions.outputs.configs) }}
  uses: ./.github/workflows/reusable-sanity-tests.yml
  with:
    image-uri: ${{ needs.build-images.result == 'success' && CI_IMAGE || PROD_IMAGE }}
    framework: ${{ matrix.framework }}
    framework-version: ${{ matrix.framework_version }}
    ...

This solves the GitHub Actions limitation where matrix job outputs only expose the last-finishing leg's values — metadata travels with the matrix, not as job outputs.

Version detection combines both path patterns in a single grep:

git diff --name-only origin/main...HEAD \
  | grep -oP '(?:docker/pytorch/|pytorch-)\K[0-9]+\.[0-9]+'

Prod image fallback: When only test files change (build-change: false), build-images is skipped and test jobs automatically fall back to the prod image from ECR:

image-uri: ${{ needs.build-images.result == 'success' && CI_IMAGE || PROD_IMAGE }}
aws-account-id: ${{ needs.build-images.result == 'success' && CI_ACCOUNT || PROD_ACCOUNT }}

3. Autorelease workflows — simplified

  • Single load-config job (merged determine-config + load-config to save an instance)
  • version dispatch input instead of full config file path (e.g., 2.11)
  • framework-short-version output (e.g., "2.11") parsed once, used everywhere — no repeated cut -d. -f1,2 in downstream jobs
  • Cron schedules map version → config file internally

4. Dockerfile ARG for version paths

Dockerfiles use ARG DLC_PYTORCH_VERSION=2.11 instead of hardcoding:

COPY docker/pytorch/${DLC_PYTORCH_VERSION}/cuda/pyproject.toml /tmp/build/

5. Additional improvements

  • Unit tests as separate job (parallel with sanity/security/telemetry)
  • No unused matrix outputs — removed dead runtime-image-uri / sagemaker-image-uri outputs from build jobs
  • Trailing backslash fix in wheel fetch/upload commands
  • upload_cached_wheels.sh accepts Dockerfile path as parameter
  • test_versions.py reads DLC_PYTORCH_VERSION env var to locate versioned env files
  • libcap security patch (CVE-2026-33814, CVE-2026-39820, CVE-2026-33811, CVE-2026-42499)
  • Zero changes to reusable workflows — no blast radius to other frameworks

Adding PyTorch 2.12

  1. Copy docker/pytorch/2.11/docker/pytorch/2.12/, update version pins
  2. Create 4 config files (pytorch-2.12-{ec2,sagemaker}-{cuda,cpu}.yml)
  3. Autorelease: add one cron line + one case entry per workflow
  4. PR workflows: auto-detect from diff — no changes needed

Test plan

  • PR workflow detects version from docker/pytorch/2.11/** and pytorch-2.11-* changes
  • Configs matrix passes correct metadata per version to reusable workflows
  • Unit tests run as separate job per version
  • Reusable workflows (vLLM, SGLang, Ray, Base) work unchanged
  • DLC_PYTORCH_VERSION build-arg resolves correctly in Dockerfile COPY paths
  • Security scan passes (libcap patched)
  • Prod image fallback works when build is skipped (test-only change PR)
  • Autorelease workflow_dispatch with version input works
  • Multi-version PR builds all versions in parallel

Move all PyTorch build artifacts under docker/pytorch/2.11/ to support
maintaining multiple PyTorch versions concurrently (1-year support
window per version).

Structure change:
  docker/pytorch/Dockerfile.cuda      → docker/pytorch/2.11/Dockerfile.cuda
  docker/pytorch/versions-cuda.env    → docker/pytorch/2.11/versions-cuda.env
  docker/pytorch/cuda/pyproject.toml  → docker/pytorch/2.11/cuda/pyproject.toml
  (same pattern for CPU)

  .github/config/image/pytorch-ec2-cuda.yml → pytorch-2.11-ec2-cuda.yml
  .github/workflows/pr-pytorch-ec2-cuda.yml → pr-pytorch-2.11-ec2-cuda.yml
  (same pattern for all 4 variants × PR + autorelease)

Adding PyTorch 2.12 when it releases means creating docker/pytorch/2.12/,
new configs, and new workflows — without touching 2.11.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Eren-Jeager123 and others added 6 commits May 12, 2026 23:17
The versions env file moved from docker/pytorch/versions-cuda.env to
docker/pytorch/2.11/versions-cuda.env. Use glob to find the file under
any version directory.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
test_versions.py now requires DLC_PYTORCH_VERSION env var (e.g., "2.11")
to locate the correct versions-{cuda,cpu}.env under the versioned
directory. All 4 PR workflows pass it via docker run -e.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Same fix as the PR workflows — autorelease workflows also run unit
tests that need the versioned env file path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Only unit tests (test_versions.py) need this env var. Removed from
single-GPU and multi-GPU test containers in EC2 CUDA workflows.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The script had a hardcoded docker/pytorch/Dockerfile path that no
longer exists after the versioned restructure. Accept the Dockerfile
path as a parameter and update all 3 callers to pass it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PR workflows detect the PyTorch version from changed file paths
(docker/pytorch/X.Y/ or pytorch-X.Y-*.yml), falling back to
LATEST_PYTORCH_VERSION env var for shared file changes.

Autorelease workflows use multi-cron scheduling with a case mapping
from cron expression to version. Staggered 10 min apart per version.
Also supports workflow_dispatch with an explicit pytorch-version input.

Adding PyTorch 2.12: create docker/pytorch/2.12/ + config files, then
add one cron line + one case entry per autorelease workflow. No new
workflow files needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread .github/workflows/autorelease-pytorch-ec2-cpu.yml Outdated
Comment thread .github/workflows/autorelease-pytorch-ec2-cuda.yml Outdated
Comment thread .github/workflows/autorelease-pytorch-sagemaker-cpu.yml Outdated
Comment thread .github/workflows/autorelease-pytorch-sagemaker-cuda.yml Outdated
Comment thread .github/workflows/pr-pytorch-ec2-cpu.yml Outdated
Comment thread .github/workflows/pr-pytorch-ec2-cpu.yml Outdated
Comment thread .github/workflows/pr-pytorch-ec2-cpu.yml
Comment thread .github/workflows/pr-pytorch-ec2-cpu.yml Outdated
Comment thread .github/workflows/pr-pytorch-ec2-cpu.yml Outdated
Comment thread .github/workflows/pr-pytorch-ec2-cuda.yml
Comment thread .github/workflows/pr-pytorch-sagemaker-cpu.yml
Comment thread .github/workflows/pr-pytorch-sagemaker-cuda.yml
Comment thread docker/pytorch/2.11/Dockerfile.cuda Outdated
Eren-Jeager123 and others added 9 commits May 19, 2026 19:56
Per team feedback: remove the redundant version output from the
determine-config job. Map cron directly to config file path. Derive
the docker directory version from load-config's framework-version
output (cut major.minor from "2.11.0" → "2.11").

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Addresses team feedback:
- Multi-version PRs now build ALL changed versions in parallel (not
  just the first detected one)
- Removed separate load-config job — config parsing inlined into
  build-images and detect-versions (eliminates 30s serialization)
- Uses strategy.matrix with fail-fast: false for parallel builds

Structure:
  gatekeeper → detect-versions → build-images (matrix) → test jobs

Only the latest version runs the full test suite (sanity, security,
telemetry, single-gpu). All versions validate that the build compiles.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…s outputs

Per team feedback: detect-versions should only detect versions and
path changes. Config values are now output by the build-images matrix
job (which already loads config per version). Downstream test jobs
reference build-images outputs instead.

Also removes the latest-version guard on unit tests — all matrix
versions now run unit tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…oded paths

Per team feedback: Dockerfiles now use ARG DLC_PYTORCH_VERSION=2.11
for COPY paths instead of hardcoding "docker/pytorch/2.11/..."
throughout. Workflows pass --build-arg DLC_PYTORCH_VERSION to ensure
the value matches the matrix version.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CUDA workflows only trigger on CUDA-related paths:
  docker/pytorch/*/Dockerfile.cuda, docker/pytorch/*/cuda/**, versions-cuda.env

CPU workflows only trigger on CPU-related paths:
  docker/pytorch/*/Dockerfile.cpu, docker/pytorch/*/cpu/**, versions-cpu.env

Previously all 4 workflows used docker/pytorch/** which meant a CUDA
Dockerfile change triggered the CPU workflow (and vice versa).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Docker requires ARGs to be re-declared after each FROM — global ARGs
are only available in FROM lines, not in stage instructions like COPY.
Without the re-declaration, DLC_PYTORCH_VERSION resolves to empty
string causing "not found" errors.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All test jobs (sanity, security, telemetry, single-gpu, efa, sagemaker)
now matrix over detected versions. Each version gets its own test run
with a constructed image-uri based on the version-specific CI tag.

GitHub Actions supports strategy.matrix on reusable workflow calls.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The load-config action installs yq but our inlined config step didn't.
yq is not available by default on CodeBuild runners, causing all config
outputs (framework, container-type, etc.) to be empty.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Eren-Jeager123 and others added 5 commits May 19, 2026 22:50
When a workflow is cancelled mid-EFA-test, the finally block may not
execute, leaking p4d.24xlarge instances and EIPs. This adds a
cleanup_stale_efa_instances() call at the start of each test run that
terminates instances tagged "CI-CD EFA efa-test" older than 4 hours
and releases orphaned EIPs.

Prevents: AddressLimitExceeded errors from accumulated leaked EIPs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Logs LD_LIBRARY_PATH, ofi-nccl lib presence, all_reduce_perf binary,
fi_info output, NCCL lib path, and full NCCL log on failure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Solve the GitHub Actions matrix output clobbering problem without
modifying any reusable workflow. The detect-versions job now reads
each version's config YAML and outputs a rich configs matrix where
each entry carries all metadata fields (framework, python-version,
cuda-version, etc.).

Downstream test jobs iterate over this configs matrix and pass
matrix.* fields directly as individual inputs to the unmodified
reusable workflows. Each matrix leg carries its own correct metadata
— no dependency on build-images outputs.

Also extract unit-test from build-images into its own parallel job.
@Eren-Jeager123 Eren-Jeager123 force-pushed the refactor/pytorch-versioned-structure branch 6 times, most recently from 12bbb31 to 421e17c Compare May 22, 2026 16:14
Solve the matrix output clobbering problem without modifying any
reusable workflow:

- detect-versions: combine both path patterns (docker/pytorch/X.Y and
  pytorch-X.Y-*) in a single grep to avoid missing versions when both
  patterns are present in the same PR. Read each version's config YAML
  and output a rich configs matrix with all metadata fields per entry.

- Downstream test jobs (sanity, security, telemetry) iterate over the
  configs matrix and pass matrix.* fields directly as individual inputs
  to the unmodified reusable workflows.

- Extract unit-test from build-images into its own parallel job.

- Autorelease workflows: accept `version` (e.g., 2.11) instead of
  full config-file path for simpler manual dispatch.
@Eren-Jeager123 Eren-Jeager123 force-pushed the refactor/pytorch-versioned-structure branch from 421e17c to 668831f Compare May 22, 2026 16:42
…ioned-structure

# Conflicts:
#	.github/workflows/pr-pytorch-ec2-cuda.yml
…CVE-2026-33811, CVE-2026-42499

Explicitly upgrade libcap in security patch layer. The --security flag
alone doesn't pick these up yet as they may not be classified as
security advisories in the AL2023 metadata.
Comment thread .github/workflows/autorelease-pytorch-ec2-cpu.yml Outdated
Comment thread .github/workflows/autorelease-pytorch-ec2-cuda.yml
Comment thread .github/workflows/autorelease-pytorch-sagemaker-cpu.yml
Comment thread .github/workflows/autorelease-pytorch-sagemaker-cuda.yml
Comment thread .github/workflows/pr-pytorch-ec2-cpu.yml Outdated
Comment thread .github/workflows/autorelease-pytorch-ec2-cuda.yml Outdated
Comment thread .github/workflows/autorelease-pytorch-sagemaker-cuda.yml Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually curious about this. What's the reason for not combining this with the config file? That way we only have one file to track all the package versions, etc. You dont necessarily to change this in this PR cuz it's out of scope, but I'm curious

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file was added before I implemented the PyTorch 2.11 currency & migration by Junpu. I think maybe those are mainly being used for Dockerfile args, but I don't see any blockers to merge them to config file. Let's discuss this in the future

Comment thread .github/workflows/autorelease-pytorch-ec2-cpu.yml Outdated
Combine the determine-config and load-config jobs into a single job
to avoid spinning up two ubuntu-latest instances sequentially. The
config file path determination is now just the first step in
load-config.
These outputs are unused — downstream jobs construct image URIs
directly from matrix.version. The outputs also suffered from the
matrix clobbering problem (only last leg exposed).
When only test files change (no build-change), build-images is skipped
but test jobs still run. Previously they'd fail trying to pull a
non-existent CI image. Now they fall back to the prod image from ECR
using the conditional:

  needs.build-images.result == 'success' && CI_IMAGE || PROD_IMAGE

Added prod_image field to the configs matrix (read from config YAML).
@Eren-Jeager123 Eren-Jeager123 force-pushed the refactor/pytorch-versioned-structure branch from a78fe00 to 6d41163 Compare May 28, 2026 23:53
Add a `version` output (major.minor) to the load-config job so
downstream jobs can use needs.load-config.outputs.version directly
instead of repeating the cut -d. -f1,2 parsing in every step.
@Eren-Jeager123 Eren-Jeager123 force-pushed the refactor/pytorch-versioned-structure branch from 6d41163 to 1e13cec Compare May 28, 2026 23:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants