Dev/multiple gpu#230
Draft
hzhangxyz wants to merge 5 commits into
Draft
Conversation
Add `devices` parameter to RuntimeContext and propagate through the entire computation pipeline. Hamiltonian data is cached per device and computations are split across GPUs for parallel execution. - Add `devices` attribute to RuntimeContext (defaults to [device]) - Add device caching in Hamiltonian with _site_dict, _kind_dict, _coef_dict - Update ModelProto protocol with devices parameter - Update all Model implementations (fcidump, hubbard, ising, etc.) - Update all algorithms (haar, vmc, guide, pert, chop_imag) - Multi-device parallel execution for apply_within, find_relative, list_relative, and diagonal_term operations Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Change condition from len(devices) == 1 to len(devices) <= 1 to handle empty devices list [] - Add empty results check before torch.cat in find_relative, list_relative, and diagonal_term - Return appropriate empty tensors when all chunks are skipped Fixes issues identified in correctness review. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Use CUDA streams to achieve true parallel execution across multiple GPUs. Each device gets its own stream, enabling concurrent kernel execution. Data transfers use non_blocking=True for asynchronous movement. Key changes: - Add global _stream_cache and _get_stream() helper for stream management - All 4 functions (apply_within, find_relative, list_relative, diagonal_term) now use `with torch.cuda.stream(stream)` for parallel execution - Results collected in pending_results, synchronized with stream.synchronize() Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Without torch.cuda.device(device.index), stream operations execute on the current device rather than the intended device, causing serial execution on a single GPU. Wrap stream context with device context to ensure each GPU executes its kernel concurrently. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace sequential CUDA stream launching with ThreadPoolExecutor. Each thread handles one GPU with its own CUDA stream, enabling true parallel execution across multiple GPUs. Key changes: - Add ThreadPoolExecutor import - Each compute_chunk runs in a separate thread - Thread uses torch.cuda.device() + torch.cuda.stream() - All GPUs launch simultaneously via ThreadPoolExecutor - Stream.synchronize() called within thread before returning result Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.