feat: CUDA IPC zero-copy GPU transport (TD↔SD input, output, ControlNet)#15
Open
forkni wants to merge 17 commits into
Open
feat: CUDA IPC zero-copy GPU transport (TD↔SD input, output, ControlNet)#15forkni wants to merge 17 commits into
forkni wants to merge 17 commits into
Conversation
…code error on Windows
… batch mismatch in calibration
kvo_cache_in_* tensors have ONNX dim 0 = 2 (hard-static K/V pair), not a symbolic batch dim. The previous naïve _max_rows tile pumped sample to 2×_n_itr rows, causing modelopt's CalibrationDataProvider to compute n_itr=2×_n_itr (sample's symbolic dim 0 resolves to 1) and split kvo into chunks of shape (1,...) instead of (2,...) — ORT then rejected them with "Got 1 Expected 2". Fix: compute per-input target_rows = n_itr × resolved_dim0(name), mirroring modelopt's symbolic→1/static-kept substitution, so every input splits into exactly n_itr uniform chunks. Adds regression test in tests/quality/. Fixes SDXL-Turbo + use_cached_attn=True + cfg_type=self + use_controlnet TD config. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…kward-compat return arity
…pes missing in SD venv)
…eam-start YAML Resolves cudaErrorStreamCaptureInvalidated (901) on first CN TRT inference when use_cuda_ipc_controlnet is active. Root cause and runtime fix live in the dotsimulate/StreamDiffusionTD repo (StreamDiffusionTD/td_manager.py: drop stream= arg from CUDAIPCImporter.get_frame, use CPU eager-sync via _wait_for_slot to avoid pending GPU work on the legacy stream). This commit covers the tracked-side changes: - cuda_ipc_exporter: capture mode Global->ThreadLocal (defensive hardening) - cuda_graphs: docstring correction for multi-engine processes - _plans: add 2026-05-17 emitter session + 2026-05-18 capture-fix session YAML emitter (use_cuda_ipc_controlnet + cuda_ipc_control_shm_name keys) was applied 2026-05-17 to Scripts/StreamDiffusionTD__Text__StreamDiffusionExt__td.py (outside this repo, lives in dotsimulate/StreamDiffusionTD). Verified: 19-28 FPS sustained with CN canny SDXL-Turbo 512x512, OSC enable/scale changes accepted, no 901, TD-side Receiver healthy.
…901) ControlNet TRT engines fail cudaStreamEndCapture with 901 (cudaErrorStreamCaptureInvalidated) on cold start when controlnet_scale > 0 in td_config.yaml. Root cause: TRT's internal genericReformat::copyPackedRunKernel submits work to the legacy/NULL stream during execute_async_v3 inside the graph-capture window on the engine's (polygraphy blocking) stream. wrapper.py:2208 hard-coded use_cuda_graph=True for every CN engine. Setting it to False keeps TRT acceleration for CN but skips the CUDA-graph wrapper, eliminating the capture-window conflict. Cost: ~hundreds of us per CN forward on WDDM (no graph batch-submission); steady-state FPS 18-25 vs 19-28. Also: - utilities.py: defensive torch.cuda.current_stream().synchronize() before cudaStreamBeginCapture, gated on first capture per engine. Covers the broader polygraphy blocking-stream / legacy-stream race for future TRT engines. Diagnosis trail: v0 (streamWaitEvent on legacy), v1 (wait_stream bridge), v2 (CPU cudaEventQuery - fixes warm-activation), Stage A (CUDALINK_USE_GRAPHS=0 - disproved), v3 (drain legacy pre-capture - disproved), v4 (this commit). Verified: cold-start with CN scale=0.577 + use_cuda_ipc_controlnet=true, no 901, CN active from frame 1, steady FPS sustained.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Full zero-copy GPU transport between TouchDesigner and StreamDiffusion using CUDA IPC (
cuda-link), replacing legacy shared-memory CPU copies in all three data directions:CUDAIPCExporter): ring-buffer IPC with CUDA graph memcpy, activation barrier, WDDM HW scheduling supportCUDAIPCImporter): zero-copy GPU read of TD's render output; CPUcudaEventQuerysync (no GPU-stream entanglement)CUDAIPCImporter): same zero-copy path for canny/depth control image; activated viause_cuda_ipc_controlnetYAML key emitted by stream-start YAML emitterControlNet TRT 901 fix (core bug resolved in this PR)
cudaErrorStreamCaptureInvalidated (901)fired on every cold-start whencontrolnet_scale > 0was saved intd_config.yaml. Full diagnosis trail:get_frame(stream=current_stream())cudaStreamWaitEventon legacy → 901wait_streamget_frame()no stream= arg, CPUcudaEventQuerypollCUDALINK_USE_GRAPHS=0cudaStreamBeginCaptureuse_cuda_graph=Falsefor CN engines inwrapper.pyRoot cause: TRT's internal
genericReformat::copyPackedRunKernelsubmits work to the legacy/NULL stream duringexecute_async_v3inside the graph-capture window.wrapper.py:2208haduse_cuda_graph=Truehard-coded for every CN engine. Setting it toFalsekeeps TRT acceleration but skips the CUDA-graph wrapper — no capture window, no 901.Key files changed
src/streamdiffusion/wrapper.pyuse_cuda_graph=Falsefor CN engines (v4 fix)src/streamdiffusion/acceleration/tensorrt/utilities.pycudaStreamBeginCapturesrc/streamdiffusion/_compat/cuda_ipc/cuda_ipc_exporter.pysrc/streamdiffusion/_compat/cuda_ipc/cuda_graphs.pyScripts/StreamDiffusionTD__Text__StreamDiffusionExt__td.pyuse_cuda_ipc_controlnet+cuda_ipc_control_shm_nameStreamDiffusionTD/td_manager.pyTest plan
.toewithcontrolnet_scale: 0.577anduse_cuda_ipc_controlnet: true— confirm no 901, CN active from frame 1event=YES,stream_wait < 0.1 ms[E] IExecutionContext::enqueueV3errorscopyCUDAMemory < 0.15 msper frame🤖 Generated with Claude Code