refactor: version PyTorch directory structure for multi-version support#6091
Open
Eren-Jeager123 wants to merge 34 commits into
Open
refactor: version PyTorch directory structure for multi-version support#6091Eren-Jeager123 wants to merge 34 commits into
Eren-Jeager123 wants to merge 34 commits into
Conversation
Move all PyTorch build artifacts under docker/pytorch/2.11/ to support maintaining multiple PyTorch versions concurrently (1-year support window per version). Structure change: docker/pytorch/Dockerfile.cuda → docker/pytorch/2.11/Dockerfile.cuda docker/pytorch/versions-cuda.env → docker/pytorch/2.11/versions-cuda.env docker/pytorch/cuda/pyproject.toml → docker/pytorch/2.11/cuda/pyproject.toml (same pattern for CPU) .github/config/image/pytorch-ec2-cuda.yml → pytorch-2.11-ec2-cuda.yml .github/workflows/pr-pytorch-ec2-cuda.yml → pr-pytorch-2.11-ec2-cuda.yml (same pattern for all 4 variants × PR + autorelease) Adding PyTorch 2.12 when it releases means creating docker/pytorch/2.12/, new configs, and new workflows — without touching 2.11. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The versions env file moved from docker/pytorch/versions-cuda.env to docker/pytorch/2.11/versions-cuda.env. Use glob to find the file under any version directory. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
test_versions.py now requires DLC_PYTORCH_VERSION env var (e.g., "2.11")
to locate the correct versions-{cuda,cpu}.env under the versioned
directory. All 4 PR workflows pass it via docker run -e.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Same fix as the PR workflows — autorelease workflows also run unit tests that need the versioned env file path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Only unit tests (test_versions.py) need this env var. Removed from single-GPU and multi-GPU test containers in EC2 CUDA workflows. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The script had a hardcoded docker/pytorch/Dockerfile path that no longer exists after the versioned restructure. Accept the Dockerfile path as a parameter and update all 3 callers to pass it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PR workflows detect the PyTorch version from changed file paths (docker/pytorch/X.Y/ or pytorch-X.Y-*.yml), falling back to LATEST_PYTORCH_VERSION env var for shared file changes. Autorelease workflows use multi-cron scheduling with a case mapping from cron expression to version. Staggered 10 min apart per version. Also supports workflow_dispatch with an explicit pytorch-version input. Adding PyTorch 2.12: create docker/pytorch/2.12/ + config files, then add one cron line + one case entry per autorelease workflow. No new workflow files needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sirutBuasai
reviewed
May 19, 2026
sirutBuasai
reviewed
May 19, 2026
sirutBuasai
reviewed
May 19, 2026
sirutBuasai
reviewed
May 19, 2026
sirutBuasai
reviewed
May 19, 2026
sirutBuasai
reviewed
May 19, 2026
sirutBuasai
reviewed
May 19, 2026
sirutBuasai
reviewed
May 19, 2026
sirutBuasai
reviewed
May 19, 2026
sirutBuasai
reviewed
May 19, 2026
sirutBuasai
reviewed
May 19, 2026
sirutBuasai
reviewed
May 19, 2026
sirutBuasai
reviewed
May 19, 2026
Per team feedback: remove the redundant version output from the determine-config job. Map cron directly to config file path. Derive the docker directory version from load-config's framework-version output (cut major.minor from "2.11.0" → "2.11"). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Addresses team feedback: - Multi-version PRs now build ALL changed versions in parallel (not just the first detected one) - Removed separate load-config job — config parsing inlined into build-images and detect-versions (eliminates 30s serialization) - Uses strategy.matrix with fail-fast: false for parallel builds Structure: gatekeeper → detect-versions → build-images (matrix) → test jobs Only the latest version runs the full test suite (sanity, security, telemetry, single-gpu). All versions validate that the build compiles. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…s outputs Per team feedback: detect-versions should only detect versions and path changes. Config values are now output by the build-images matrix job (which already loads config per version). Downstream test jobs reference build-images outputs instead. Also removes the latest-version guard on unit tests — all matrix versions now run unit tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…oded paths Per team feedback: Dockerfiles now use ARG DLC_PYTORCH_VERSION=2.11 for COPY paths instead of hardcoding "docker/pytorch/2.11/..." throughout. Workflows pass --build-arg DLC_PYTORCH_VERSION to ensure the value matches the matrix version. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CUDA workflows only trigger on CUDA-related paths: docker/pytorch/*/Dockerfile.cuda, docker/pytorch/*/cuda/**, versions-cuda.env CPU workflows only trigger on CPU-related paths: docker/pytorch/*/Dockerfile.cpu, docker/pytorch/*/cpu/**, versions-cpu.env Previously all 4 workflows used docker/pytorch/** which meant a CUDA Dockerfile change triggered the CPU workflow (and vice versa). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Docker requires ARGs to be re-declared after each FROM — global ARGs are only available in FROM lines, not in stage instructions like COPY. Without the re-declaration, DLC_PYTORCH_VERSION resolves to empty string causing "not found" errors. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All test jobs (sanity, security, telemetry, single-gpu, efa, sagemaker) now matrix over detected versions. Each version gets its own test run with a constructed image-uri based on the version-specific CI tag. GitHub Actions supports strategy.matrix on reusable workflow calls. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The load-config action installs yq but our inlined config step didn't. yq is not available by default on CodeBuild runners, causing all config outputs (framework, container-type, etc.) to be empty. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a workflow is cancelled mid-EFA-test, the finally block may not execute, leaking p4d.24xlarge instances and EIPs. This adds a cleanup_stale_efa_instances() call at the start of each test run that terminates instances tagged "CI-CD EFA efa-test" older than 4 hours and releases orphaned EIPs. Prevents: AddressLimitExceeded errors from accumulated leaked EIPs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Logs LD_LIBRARY_PATH, ofi-nccl lib presence, all_reduce_perf binary, fi_info output, NCCL lib path, and full NCCL log on failure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Solve the GitHub Actions matrix output clobbering problem without modifying any reusable workflow. The detect-versions job now reads each version's config YAML and outputs a rich configs matrix where each entry carries all metadata fields (framework, python-version, cuda-version, etc.). Downstream test jobs iterate over this configs matrix and pass matrix.* fields directly as individual inputs to the unmodified reusable workflows. Each matrix leg carries its own correct metadata — no dependency on build-images outputs. Also extract unit-test from build-images into its own parallel job.
12bbb31 to
421e17c
Compare
Solve the matrix output clobbering problem without modifying any reusable workflow: - detect-versions: combine both path patterns (docker/pytorch/X.Y and pytorch-X.Y-*) in a single grep to avoid missing versions when both patterns are present in the same PR. Read each version's config YAML and output a rich configs matrix with all metadata fields per entry. - Downstream test jobs (sanity, security, telemetry) iterate over the configs matrix and pass matrix.* fields directly as individual inputs to the unmodified reusable workflows. - Extract unit-test from build-images into its own parallel job. - Autorelease workflows: accept `version` (e.g., 2.11) instead of full config-file path for simpler manual dispatch.
421e17c to
668831f
Compare
…ioned-structure # Conflicts: # .github/workflows/pr-pytorch-ec2-cuda.yml
…CVE-2026-33811, CVE-2026-42499 Explicitly upgrade libcap in security patch layer. The --security flag alone doesn't pick these up yet as they may not be classified as security advisories in the AL2023 metadata.
sirutBuasai
reviewed
May 28, 2026
Member
There was a problem hiding this comment.
I'm actually curious about this. What's the reason for not combining this with the config file? That way we only have one file to track all the package versions, etc. You dont necessarily to change this in this PR cuz it's out of scope, but I'm curious
Contributor
Author
There was a problem hiding this comment.
This file was added before I implemented the PyTorch 2.11 currency & migration by Junpu. I think maybe those are mainly being used for Dockerfile args, but I don't see any blockers to merge them to config file. Let's discuss this in the future
Combine the determine-config and load-config jobs into a single job to avoid spinning up two ubuntu-latest instances sequentially. The config file path determination is now just the first step in load-config.
These outputs are unused — downstream jobs construct image URIs directly from matrix.version. The outputs also suffered from the matrix clobbering problem (only last leg exposed).
When only test files change (no build-change), build-images is skipped but test jobs still run. Previously they'd fail trying to pull a non-existent CI image. Now they fall back to the prod image from ECR using the conditional: needs.build-images.result == 'success' && CI_IMAGE || PROD_IMAGE Added prod_image field to the configs matrix (read from config YAML).
a78fe00 to
6d41163
Compare
Add a `version` output (major.minor) to the load-config job so downstream jobs can use needs.load-config.outputs.version directly instead of repeating the cut -d. -f1,2 parsing in every step.
6d41163 to
1e13cec
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Restructure PyTorch to support maintaining multiple versions concurrently (1-year support window). Versioned directories isolate each version's build artifacts. Workflows are version-agnostic — no new workflow files needed when adding a future version.
1. Versioned directory structure
Shared scripts (
scripts/pytorch/) and tests (test/pytorch/) stay in place.2. PR workflows — configs matrix (zero changes to reusable workflows)
Key design:
detect-versionsreads each version's config YAML and outputs a richconfigsmatrix where each entry carries all metadata fields. Downstream test jobs iterate over this matrix and passmatrix.*fields directly as individual inputs to unmodified reusable workflows.This solves the GitHub Actions limitation where matrix job outputs only expose the last-finishing leg's values — metadata travels with the matrix, not as job outputs.
Version detection combines both path patterns in a single grep:
Prod image fallback: When only test files change (
build-change: false), build-images is skipped and test jobs automatically fall back to the prod image from ECR:3. Autorelease workflows — simplified
load-configjob (merged determine-config + load-config to save an instance)versiondispatch input instead of full config file path (e.g.,2.11)framework-short-versionoutput (e.g.,"2.11") parsed once, used everywhere — no repeatedcut -d. -f1,2in downstream jobs4. Dockerfile ARG for version paths
Dockerfiles use
ARG DLC_PYTORCH_VERSION=2.11instead of hardcoding:COPY docker/pytorch/${DLC_PYTORCH_VERSION}/cuda/pyproject.toml /tmp/build/5. Additional improvements
runtime-image-uri/sagemaker-image-urioutputs from build jobsDLC_PYTORCH_VERSIONenv var to locate versioned env filesAdding PyTorch 2.12
docker/pytorch/2.11/→docker/pytorch/2.12/, update version pinspytorch-2.12-{ec2,sagemaker}-{cuda,cpu}.yml)Test plan
docker/pytorch/2.11/**andpytorch-2.11-*changesDLC_PYTORCH_VERSIONbuild-arg resolves correctly in Dockerfile COPY pathsworkflow_dispatchwith version input works