Dev/multiple gpu by hzhangxyz · Pull Request #230 · USTC-KnowledgeComputingLab/qmp-kit

hzhangxyz · 2026-04-17T05:04:35Z

No description provided.

Add `devices` parameter to RuntimeContext and propagate through the entire computation pipeline. Hamiltonian data is cached per device and computations are split across GPUs for parallel execution. - Add `devices` attribute to RuntimeContext (defaults to [device]) - Add device caching in Hamiltonian with _site_dict, _kind_dict, _coef_dict - Update ModelProto protocol with devices parameter - Update all Model implementations (fcidump, hubbard, ising, etc.) - Update all algorithms (haar, vmc, guide, pert, chop_imag) - Multi-device parallel execution for apply_within, find_relative, list_relative, and diagonal_term operations Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- Change condition from len(devices) == 1 to len(devices) <= 1 to handle empty devices list [] - Add empty results check before torch.cat in find_relative, list_relative, and diagonal_term - Return appropriate empty tensors when all chunks are skipped Fixes issues identified in correctness review. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Use CUDA streams to achieve true parallel execution across multiple GPUs. Each device gets its own stream, enabling concurrent kernel execution. Data transfers use non_blocking=True for asynchronous movement. Key changes: - Add global _stream_cache and _get_stream() helper for stream management - All 4 functions (apply_within, find_relative, list_relative, diagonal_term) now use `with torch.cuda.stream(stream)` for parallel execution - Results collected in pending_results, synchronized with stream.synchronize() Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Without torch.cuda.device(device.index), stream operations execute on the current device rather than the intended device, causing serial execution on a single GPU. Wrap stream context with device context to ensure each GPU executes its kernel concurrently. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Replace sequential CUDA stream launching with ThreadPoolExecutor. Each thread handles one GPU with its own CUDA stream, enabling true parallel execution across multiple GPUs. Key changes: - Add ThreadPoolExecutor import - Each compute_chunk runs in a separate thread - Thread uses torch.cuda.device() + torch.cuda.stream() - All GPUs launch simultaneously via ThreadPoolExecutor - Stream.synchronize() called within thread before returning result Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

hzhangxyz and others added 5 commits April 17, 2026 12:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dev/multiple gpu#230

Dev/multiple gpu#230
hzhangxyz wants to merge 5 commits into
mainfrom
dev/multiple-gpu

hzhangxyz commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hzhangxyz commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant