feat: CUDA IPC zero-copy GPU transport (TD↔SD input, output, ControlNet) by forkni · Pull Request #15 · dotsimulate/StreamDiffusion

forkni · 2026-05-18T05:42:54Z

Summary

Full zero-copy GPU transport between TouchDesigner and StreamDiffusion using CUDA IPC (cuda-link), replacing legacy shared-memory CPU copies in all three data directions:

SD→TD output (CUDAIPCExporter): ring-buffer IPC with CUDA graph memcpy, activation barrier, WDDM HW scheduling support
TD→SD input (CUDAIPCImporter): zero-copy GPU read of TD's render output; CPU cudaEventQuery sync (no GPU-stream entanglement)
TD→SD ControlNet (CUDAIPCImporter): same zero-copy path for canny/depth control image; activated via use_cuda_ipc_controlnet YAML key emitted by stream-start YAML emitter

ControlNet TRT 901 fix (core bug resolved in this PR)

cudaErrorStreamCaptureInvalidated (901) fired on every cold-start when controlnet_scale > 0 was saved in td_config.yaml. Full diagnosis trail:

Attempt	What	Result
v0	`get_frame(stream=current_stream())`	`cudaStreamWaitEvent` on legacy → 901
v1	dedicated non-blocking import stream + `wait_stream`	re-coupled legacy → 901
v2	`get_frame()` no stream= arg, CPU `cudaEventQuery` poll	fixes warm-activation; fails cold-start
Stage A	`CUDALINK_USE_GRAPHS=0`	disproved — exporter graphs not involved
v3	drain legacy stream before `cudaStreamBeginCapture`	disproved — problem is inside capture window
v4	`use_cuda_graph=False` for CN engines in `wrapper.py`	verified ✓

Root cause: TRT's internal genericReformat::copyPackedRunKernel submits work to the legacy/NULL stream during execute_async_v3 inside the graph-capture window. wrapper.py:2208 had use_cuda_graph=True hard-coded for every CN engine. Setting it to False keeps TRT acceleration but skips the CUDA-graph wrapper — no capture window, no 901.

Key files changed

File	Change
`src/streamdiffusion/wrapper.py`	`use_cuda_graph=False` for CN engines (v4 fix)
`src/streamdiffusion/acceleration/tensorrt/utilities.py`	defensive legacy-stream drain before `cudaStreamBeginCapture`
`src/streamdiffusion/_compat/cuda_ipc/cuda_ipc_exporter.py`	ThreadLocal capture mode (was Global)
`src/streamdiffusion/_compat/cuda_ipc/cuda_graphs.py`	docstring correction
`Scripts/StreamDiffusionTD__Text__StreamDiffusionExt__td.py`	YAML emitter: emit `use_cuda_ipc_controlnet` + `cuda_ipc_control_shm_name`
`StreamDiffusionTD/td_manager.py`	v2 runtime fix — gitignored; live via Scripts/ sync

Test plan

Cold-start .toe with controlnet_scale: 0.577 and use_cuda_ipc_controlnet: true — confirm no 901, CN active from frame 1
Toggle CN via OSC (enable/disable, scale 0→0.5→0.8) — confirm no 901
Warm-activation path: start with CN scale=0, enable via OSC mid-stream — confirm still works
TD-side IPC Receiver: all 3 slots open, event=YES, stream_wait < 0.1 ms
Sustained 3+ min run: steady FPS ≥ 15, no [E] IExecutionContext::enqueueV3 errors
Output IPC: TD Receiver consuming SD diffusion output — copyCUDAMemory < 0.15 ms per frame

🤖 Generated with Claude Code

…PIPS metrics

…me monkey-patch

…code error on Windows

… batch mismatch in calibration

…tion)

kvo_cache_in_* tensors have ONNX dim 0 = 2 (hard-static K/V pair), not a symbolic batch dim. The previous naïve _max_rows tile pumped sample to 2×_n_itr rows, causing modelopt's CalibrationDataProvider to compute n_itr=2×_n_itr (sample's symbolic dim 0 resolves to 1) and split kvo into chunks of shape (1,...) instead of (2,...) — ORT then rejected them with "Got 1 Expected 2". Fix: compute per-input target_rows = n_itr × resolved_dim0(name), mirroring modelopt's symbolic→1/static-kept substitution, so every input splits into exactly n_itr uniform chunks. Adds regression test in tests/quality/. Fixes SDXL-Turbo + use_cached_attn=True + cfg_type=self + use_controlnet TD config. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…kward-compat return arity

… GPU transport)

…U transport)

…handoff

…pes missing in SD venv)

…eam-start YAML Resolves cudaErrorStreamCaptureInvalidated (901) on first CN TRT inference when use_cuda_ipc_controlnet is active. Root cause and runtime fix live in the dotsimulate/StreamDiffusionTD repo (StreamDiffusionTD/td_manager.py: drop stream= arg from CUDAIPCImporter.get_frame, use CPU eager-sync via _wait_for_slot to avoid pending GPU work on the legacy stream). This commit covers the tracked-side changes: - cuda_ipc_exporter: capture mode Global->ThreadLocal (defensive hardening) - cuda_graphs: docstring correction for multi-engine processes - _plans: add 2026-05-17 emitter session + 2026-05-18 capture-fix session YAML emitter (use_cuda_ipc_controlnet + cuda_ipc_control_shm_name keys) was applied 2026-05-17 to Scripts/StreamDiffusionTD__Text__StreamDiffusionExt__td.py (outside this repo, lives in dotsimulate/StreamDiffusionTD). Verified: 19-28 FPS sustained with CN canny SDXL-Turbo 512x512, OSC enable/scale changes accepted, no 901, TD-side Receiver healthy.

…901) ControlNet TRT engines fail cudaStreamEndCapture with 901 (cudaErrorStreamCaptureInvalidated) on cold start when controlnet_scale > 0 in td_config.yaml. Root cause: TRT's internal genericReformat::copyPackedRunKernel submits work to the legacy/NULL stream during execute_async_v3 inside the graph-capture window on the engine's (polygraphy blocking) stream. wrapper.py:2208 hard-coded use_cuda_graph=True for every CN engine. Setting it to False keeps TRT acceleration for CN but skips the CUDA-graph wrapper, eliminating the capture-window conflict. Cost: ~hundreds of us per CN forward on WDDM (no graph batch-submission); steady-state FPS 18-25 vs 19-28. Also: - utilities.py: defensive torch.cuda.current_stream().synchronize() before cudaStreamBeginCapture, gated on first capture per engine. Covers the broader polygraphy blocking-stream / legacy-stream race for future TRT engines. Diagnosis trail: v0 (streamWaitEvent on legacy), v1 (wait_stream bridge), v2 (CPU cudaEventQuery - fixes warm-activation), Stage A (CUDALINK_USE_GRAPHS=0 - disproved), v3 (drain legacy pre-capture - disproved), v4 (this commit). Verified: cold-start with CN scale=0.577 + use_cuda_ipc_controlnet=true, no 901, CN active from frame 1, steady FPS sustained.

forkni and others added 17 commits May 16, 2026 08:36

chore: gitignore local cgw tooling; add PR audit doc

5c5d5a6

feat: add FP8 QDQ finite-scale gate and fused-MHA layer count

275f293

feat: add quality regression harness with FP16-TRT goldens and SSIM/L…

90091d5

…PIPS metrics

feat: port varshith15 kvo_cache patch onto diffusers 0.38.0 via runti…

d1763bb

…me monkey-patch

fix: replace emoji chars in SDXL ONNX size warning to avoid cp1252 en…

62548fb

…code error on Windows

feat: seed quality-harness goldens, manifest, thresholds; fix FP8 CFG…

67c74c2

… batch mismatch in calibration

chore: stage pre-existing formatter diffs (quote/whitespace normalisa…

4b4aaf7

…tion)

fix: kvo_cache patch breaks ControlNet ONNX export — sentinel for bac…

1a8065f

…kward-compat return arity

feat: add CUDA IPC output direction via cuda-link (SD-to-TD zero-copy…

4c2a742

… GPU transport)

feat: add CUDA IPC input direction via cuda-link (TD->SD zero-copy GP…

72dc7cc

…U transport)

docs: add CUDA IPC input direction plan with next-session log review …

52c4a68

…handoff

fix: use relative imports in vendored _compat/cuda_ipc (CUDARuntimeTy…

eecb9f5

…pes missing in SD venv)

docs: add plans for CUDARuntimeTypes fix and zero-copy GPU input

02911e5

docs: add plan for ControlNet zero-copy GPU input

59f2caa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: CUDA IPC zero-copy GPU transport (TD↔SD input, output, ControlNet)#15

feat: CUDA IPC zero-copy GPU transport (TD↔SD input, output, ControlNet)#15
forkni wants to merge 17 commits into
SDTD_031_devfrom
feat/cuda-ipc-output

forkni commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

forkni commented May 18, 2026

Summary

ControlNet TRT 901 fix (core bug resolved in this PR)

Key files changed

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant