Skip to content

Dev/multiple gpu#230

Draft
hzhangxyz wants to merge 5 commits into
mainfrom
dev/multiple-gpu
Draft

Dev/multiple gpu#230
hzhangxyz wants to merge 5 commits into
mainfrom
dev/multiple-gpu

Conversation

@hzhangxyz
Copy link
Copy Markdown
Member

No description provided.

hzhangxyz and others added 5 commits April 17, 2026 12:34
Add `devices` parameter to RuntimeContext and propagate through the
entire computation pipeline. Hamiltonian data is cached per device
and computations are split across GPUs for parallel execution.

- Add `devices` attribute to RuntimeContext (defaults to [device])
- Add device caching in Hamiltonian with _site_dict, _kind_dict, _coef_dict
- Update ModelProto protocol with devices parameter
- Update all Model implementations (fcidump, hubbard, ising, etc.)
- Update all algorithms (haar, vmc, guide, pert, chop_imag)
- Multi-device parallel execution for apply_within, find_relative,
  list_relative, and diagonal_term operations

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Change condition from len(devices) == 1 to len(devices) <= 1
  to handle empty devices list []
- Add empty results check before torch.cat in find_relative,
  list_relative, and diagonal_term
- Return appropriate empty tensors when all chunks are skipped

Fixes issues identified in correctness review.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Use CUDA streams to achieve true parallel execution across multiple GPUs.
Each device gets its own stream, enabling concurrent kernel execution.
Data transfers use non_blocking=True for asynchronous movement.

Key changes:
- Add global _stream_cache and _get_stream() helper for stream management
- All 4 functions (apply_within, find_relative, list_relative, diagonal_term)
  now use `with torch.cuda.stream(stream)` for parallel execution
- Results collected in pending_results, synchronized with stream.synchronize()

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Without torch.cuda.device(device.index), stream operations execute on
the current device rather than the intended device, causing serial
execution on a single GPU. Wrap stream context with device context to
ensure each GPU executes its kernel concurrently.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace sequential CUDA stream launching with ThreadPoolExecutor.
Each thread handles one GPU with its own CUDA stream, enabling
true parallel execution across multiple GPUs.

Key changes:
- Add ThreadPoolExecutor import
- Each compute_chunk runs in a separate thread
- Thread uses torch.cuda.device() + torch.cuda.stream()
- All GPUs launch simultaneously via ThreadPoolExecutor
- Stream.synchronize() called within thread before returning result

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant