An XML-RPC server that exposes a GPU-accelerated colorization pipeline for black-and-white images and video frames.
Built on top of the Nunchaku SVDQuant FP4/INT4 transformer and the Qwen-Image-Edit-2511 diffusion model.
Optimized for NVIDIA RTX 50-Series (Blackwell) & CUDA 12.8.
- 🎨 Batch colorization — process entire directories of B&W images via filesystem paths
- 🖼️ Paired inference — colorize two images in a single forward pass (faster, temporally consistent)
- 📡 In-memory RPC — pass raw PNG frames over XML-RPC without touching the filesystem (ideal for video pipelines)
- ⚡ 4-step lightning model — SVDQuant FP4 quantized transformer for maximum throughput
- 🔒 Thread-safe — pipeline loading and stop control are protected by locks; every RPC call runs in its own thread
- ⚙️ Startup preload — optional
--load-pipelineflag loads the model at boot from a JSON config file - 🚀 Shared memory transport — zero-copy image transfer for same-host deployments (~23% faster than standard RPC)
| Requirement | Details |
|---|---|
| OS | Windows 10/11 or Linux |
| Python | 3.12 |
| GPU | NVIDIA RTX 3090 / 4090 / 5070 Ti / 5090 (16 GB+ VRAM recommended) |
| CUDA | 12.8 or newer |
| CUDA Toolkit | Must match the PyTorch build (see below) |
RTX 40-Series and older: use
"model_precision": "int4"in the pipeline config file. FP4 quantization requires Blackwell hardware; INT4 is the correct precision for Ampere (RTX 30) and Ada Lovelace (RTX 40) GPUs.
Before creating the virtual envornment is necessary to download this repository as a ZIP file and extract it, once the archive has been extracted: open a terminal, move into the project directory and create the venv there:
cd C:\path\to\DiTServerRPC-main
python -m venv .venv
# Windows
.venv\Scripts\activate
# Linux / macOS
source .venv/bin/activateWindows quick-start: once the venv is active you can run
install.cmdto execute steps 2–6 automatically instead of running them one by one.
Use the stable build for all GPU generations (RTX 30 / 40 / 50):
pip install torch==2.9.1+cu128 torchvision==0.24.1+cu128 torchaudio==2.9.1+cu128 \
--index-url https://download.pytorch.org/whl/cu128Verify the installation:
python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())"
⚠️ Do NOT usepip install nunchaku— that installs an unrelated package from PyPI with the same name that will fail withModuleNotFoundError: No module named 'nunchaku.models'.
Install the correct MIT Han Lab build directly from the GitHub release:
# Windows / Python 3.12 / CUDA 12.8 / PyTorch 2.9
pip install https://github.com/nunchaku-ai/nunchaku/releases/download/v1.2.1/nunchaku-1.2.1+cu12.8torch2.9-cp312-cp312-win_amd64.whlFor other platforms or Python versions, browse the full list of available wheels on the Nunchaku releases page and replace the filename accordingly.
Verify the correct package is installed — the version string must contain the build tags:
pip show nunchaku
# Expected: Version: 1.2.1+cu12.8torch2.9Nunchaku 1.2.1 contains a bug in its transformer forward pass: txt_seq_lens is always
None at the point where it is passed to pos_embed, causing a ValueError with
diffusers >= 0.37.0.dev0. The included patch_nunchaku.py fixes this by deriving
max_txt_seq_len directly from encoder_hidden_states:
python patch_nunchaku.pyOn Windows you can also double-click patch_nunchaku.cmd or run it from a terminal:
patch_nunchaku.cmd # apply the patch
patch_nunchaku.cmd --check # check status without modifying files
patch_nunchaku.cmd --revert # revert to original (.bak backup)
You can verify the patch status at any time:
python patch_nunchaku.py --checkAnd revert to the original if needed (a .bak backup is created automatically):
python patch_nunchaku.py --revert
⚠️ Do NOT install diffusers from GitHub (pip install git+https://...). Nunchaku 1.2.1 requires exactly0.37.0.dev0. Later dev builds (≥ 0.39.0) changed theQwenEmbedRopeAPI in a way that is incompatible even after the nunchaku patch.
A tested compatible wheel is included in the packages/ folder.
Install it directly:
pip install packages\diffusers-0.37.0.dev0-py3-none-any.whlVerify:
python -c "import diffusers; print(diffusers.__version__)"
# Expected: 0.37.0.dev0Pin the versions to match the tested working environment:
pip install \
transformers==4.57.6 \
accelerate==1.12.0 \
huggingface_hub>=0.26.0 \
Pillow>=10.0.0
safetensorsis intentionally not pinned here — diffusers pulls the correct version automatically as a dependency (>=0.8.0-rc.0).
dit-colorize-rpc/
├── dit_rpc_server.py # XML-RPC server (entry point)
├── dit_colorize_main.py # Colorization pipeline and image utilities
├── dit_client_example.py # Example RPC client — single frame
├── dit_client_pair_example.py # Example RPC client — paired inference
├── patch_nunchaku.py # Compatibility patch for nunchaku 1.2.1
├── qwen_config_fp4.json # Config for RTX 50-Series (FP4)
├── qwen_config_int4.json # Config for RTX 30 / 40-Series (INT4)
├── install.cmd # Windows automated installer
├── start_server.cmd # Windows launcher — server
├── run_client_example.cmd # Windows launcher — single frame example
├── run_client_pair_example.cmd # Windows launcher — paired inference example
├── patch_nunchaku.cmd # Windows launcher — nunchaku patch
├── assets/
│ ├── santa_bw.png # Sample B&W image (single frame test)
│ ├── sample1_bw.jpg # Sample B&W image 1 (paired inference test)
│ └── sample2_bw.jpg # Sample B&W image 2 (paired inference test)
├── packages/
│ └── diffusers-0.37.0.dev0-py3-none-any.whl # Tested compatible diffusers build
└── README.md
Two ready-to-use config files are provided. Pick the one that matches your GPU and pass it to --pipeline-config.
{
"model_name": "nunchaku-qwen",
"model_precision": "fp4",
"model_rank": "32",
"model_inference_steps": "4",
"cache_dir": "",
"full_model_path": ""
}{
"model_name": "nunchaku-qwen",
"model_precision": "int4",
"model_rank": "32",
"model_inference_steps": "4",
"cache_dir": "",
"full_model_path": ""
}
⚠️ model_precision: use"fp4"only on RTX 50-Series (Blackwell). On RTX 30 / 40-Series use"int4"— FP4 kernels require sm_120 and will fail on older architectures.
| Key | Required | Description |
|---|---|---|
model_name |
✅ | Must be "nunchaku-qwen" |
model_precision |
✅ | "fp4" (RTX 50) or "int4" (RTX 30/40) |
model_rank |
✅ | SVD rank — "32" is a good default |
model_inference_steps |
✅ | Diffusion steps used to select the model file to download — must be "4" (no 2-step model file exists). To run inference faster, pass steps=2 in the RPC call — this is independent of the downloaded model and reduces latency by ~40% |
cache_dir |
➖ | HuggingFace cache directory. Omit or set to "" to use the default (~/.cache/huggingface) |
full_model_path |
➖ | Absolute path to a local .safetensors file. Omit or set to "" to download from HuggingFace |
python dit_rpc_server.py# RTX 50-Series
python dit_rpc_server.py --load-pipeline --pipeline-config qwen_config_fp4.json
# RTX 30 / 40-Series
python dit_rpc_server.py --load-pipeline --pipeline-config qwen_config_int4.jsonOn Windows you can also use the provided start_server.cmd (see Windows launch script).
usage: dit_rpc_server.py [-h] [--host HOST] [--port PORT]
[--logfile LOGFILE] [--module-dir MODULE_DIR]
[--load-pipeline] [--pipeline-config CONFIG.json]
options:
--host HOST Address to listen on (default: 127.0.0.1)
--port PORT TCP port (default: 8765)
--logfile LOGFILE Optional path for a log file
--module-dir MODULE_DIR Directory containing dit_colorize_main.py
(default: same directory as this script)
--load-pipeline Load the colorization pipeline at startup
--pipeline-config CONFIG.json
Path to the JSON pipeline config file
(required when --load-pipeline is set)
Connect from any Python client using xmlrpc.client:
import xmlrpc.client
proxy = xmlrpc.client.ServerProxy("http://127.0.0.1:8765/", use_builtin_types=True)All methods return a dict with at least {"ok": bool, "msg": str}.
| Method | Returns | Description |
|---|---|---|
ping() |
"pong" |
Connectivity check |
| Method | Returns | Description |
|---|---|---|
load_pipeline(model_name, model_precision, model_rank, model_inference_steps, cache_dir="", full_model_path="") |
{"ok", "msg"} |
Load the model into VRAM |
is_pipeline_loaded() |
bool |
True if the pipeline is ready |
unload_pipeline() |
{"ok", "msg"} |
Release VRAM |
| Method | Returns | Description |
|---|---|---|
request_stop() |
bool |
Ask the server to refuse new colorization calls |
clear_stop() |
bool |
Reset the stop flag before a new batch |
is_stop_requested() |
bool |
Check the current stop flag |
| Method | Returns | Description |
|---|---|---|
colorize_image(in_path, out_path, prompt, img_size=0, steps=2) |
{"ok", "elapsed", "skipped", "msg"} |
Single image, paths on the server filesystem |
colorize_image_pair(img1_path, img2_path, out_dir, prompt, gap_px=8) |
{"ok", "elapsed", "msg"} |
Two images, single inference pass |
colorize_single_image(img_path, out_dir, prompt) |
{"ok", "elapsed", "msg"} |
Single image fallback (odd batch end) |
| Method | Returns | Description |
|---|---|---|
colorize_frame(img_data, prompt, img_size=0, steps=2) |
{"ok", "data", "elapsed", "skipped", "msg"} |
Single frame as raw PNG bytes |
colorize_frame_pair(img1_data, img2_data, prompt, gap_px=8) |
{"ok", "data1", "data2", "elapsed", "skipped1", "skipped2", "msg"} |
Two frames, single inference pass |
skipped=Truemeans the frame was too dark to colorize (average brightness < 9/255). The returneddatafield contains the unchanged input in that case.
| Method | Returns | Description |
|---|---|---|
colorize_frame_shm(shm_in, shm_out, h, w, prompt, img_size=0, steps=2) |
{"ok", "elapsed", "skipped", "msg"} |
Single frame via shared memory |
colorize_frame_pair_shm(shm_in1, shm_out1, h1, w1, shm_in2, shm_out2, h2, w2, prompt, gap_px=8) |
{"ok", "elapsed", "skipped1", "skipped2", "msg"} |
Two frames via shared memory, single inference pass |
See Shared Memory Transport for usage details.
Both clients support two transport modes selectable via --use-shm:
| Mode | Flag | When to use | Measured speed (1480×1080 px pair) |
|---|---|---|---|
| Standard RPC | (default) | Any deployment, including remote server | ~5.25s/image |
| Shared memory | --use-shm |
Server and client on the same host only | ~4.06s/image (~23% faster) |
Colorizes assets/santa_bw.png and saves the result as assets/santa_colorized.png.
# standard RPC — works with local and remote server
python dit_client_example.py --pipeline-config qwen_config_fp4.json
# shared memory — same-host only, lower latency
python dit_client_example.py --pipeline-config qwen_config_fp4.json --use-shmWindows: run_client_example.cmd [fp4|int4]
To enable shared memory edit run_client_example.cmd and set USE_SHM=1.
Colorizes assets/sample1_bw.jpg and assets/sample2_bw.jpg in a single forward
pass, saving assets/sample1_colorized.jpg and assets/sample2_colorized.jpg.
Paired inference places the two images side-by-side and runs one inference instead of two, roughly halving the per-image cost (~5.25s/image vs ~11s standalone). Combined with shared memory transport this reaches ~4.06s/image.
# standard RPC
python dit_client_pair_example.py --pipeline-config qwen_config_fp4.json
# shared memory — same-host only
python dit_client_pair_example.py --pipeline-config qwen_config_fp4.json --use-shmWindows: run_client_pair_example.cmd [fp4|int4]
To enable shared memory edit run_client_pair_example.cmd and set USE_SHM=1.
--host HOST Server host (default: 127.0.0.1)
--port PORT Server port (default: 8765)
--pipeline-config CONFIG Load pipeline before colorizing (skipped if already loaded)
--prompt PROMPT Text prompt for the model
--use-shm Use shared memory transport (same-host only)
Additional argument for the paired client:
--gap-px N Separator width in pixels between the two
images in the merged input (default: 8)
The standard RPC transport serializes each image as a PNG byte stream, encodes it in Base64, sends it over a TCP socket, and decodes it on the other side. For a 1480×1080 frame this is roughly 4–5 MB per round trip.
The shared memory transport bypasses the network entirely. The client writes the raw pixel array directly into a shared memory segment; the server attaches to the same segment and reads the pixels without any copy. Only the metadata (segment name, dimensions, prompt) travels over the XML-RPC socket.
Requirement: server and client must run on the same machine.
If the server is on a dedicated GPU machine and the client is on a separate workstation,
shared memory is not available — use the standard RPC transport instead (default).
The clients detect this automatically: passing --use-shm when the host is not
127.0.0.1 / localhost prints a warning and falls back to standard RPC.
Measured on a 1480×1080 pixel pair (RTX 5070 Ti, FP4, paired inference):
| Transport | Per-image time | Round-trip overhead |
|---|---|---|
| Standard RPC (PNG) | ~5.25s | ~1.1s |
| Shared memory | ~4.06s | ~0.16s |
| Gain | ~23% faster | ~7× less overhead |
The round-trip overhead with shared memory is essentially zero — the 0.16s gap between inference time and wall-clock time is just Python function call and numpy overhead.
On a 100k-frame video processed as pairs (50k inference calls) the cumulative saving is:
(5.25 - 4.06) × 50,000 ≈ 16.5 hours
The client owns and manages all shared memory segments. The server is fully stateless with respect to shared memory — it only attaches, reads/writes, and detaches.
Client Server
│ │
│ create shm_in (h × w × 3 bytes) │
│ create shm_out (h × w × 3 bytes) │
│ write raw RGB pixels → shm_in │
│ │
│ RPC(shm_in_name, shm_out_name, h, w, …) ─►│
│ │ attach shm_in → PIL Image
│ │ inference
│ │ result → shm_out
│◄─ return {elapsed, skipped, …} ───────────│
│ │ detach both segments
│ read shm_out → PIL Image │
│ unlink shm_in + shm_out │
From the command line:
python dit_client_pair_example.py --pipeline-config qwen_config_fp4.json --use-shm
python dit_client_example.py --pipeline-config qwen_config_fp4.json --use-shmFrom the Windows .cmd launchers, edit the user configuration block and set:
set USE_SHM=1The banner will confirm the active transport:
Transport : 1 (0=RPC 1=shared memory)
And the Python client will print:
[INFO] Transport: shared memory
import uuid
import numpy as np
from multiprocessing.shared_memory import SharedMemory
from PIL import Image
def colorize_pair_shm(proxy, img1: Image.Image, img2: Image.Image, prompt: str):
arr1, arr2 = np.array(img1), np.array(img2)
h1, w1 = arr1.shape[:2]
h2, w2 = arr2.shape[:2]
uid = uuid.uuid4().hex[:12]
# Create all four segments (client owns them)
segs = {
tag: SharedMemory(name=f"dit_{tag}_{uid}", create=True, size=h*w*3)
for tag, h, w in [("in1",h1,w1),("out1",h1,w1),("in2",h2,w2),("out2",h2,w2)]
}
try:
np.ndarray((h1,w1,3), dtype=np.uint8, buffer=segs["in1"].buf)[:] = arr1
np.ndarray((h2,w2,3), dtype=np.uint8, buffer=segs["in2"].buf)[:] = arr2
result = proxy.colorize_frame_pair_shm(
segs["in1"].name, segs["out1"].name, h1, w1,
segs["in2"].name, segs["out2"].name, h2, w2,
prompt, 8, # gap_px
)
out1 = Image.fromarray(
np.ndarray((h1,w1,3), dtype=np.uint8, buffer=segs["out1"].buf).copy())
out2 = Image.fromarray(
np.ndarray((h2,w2,3), dtype=np.uint8, buffer=segs["out2"].buf).copy())
return result, out1, out2
finally:
for shm in segs.values():
shm.close(); shm.unlink()start_server.cmd is a ready-to-use launcher for Windows.
Edit the variables at the top of the file to match your setup, then double-click it or run it from a terminal.
start_server.cmd [fp4|int4]
If no argument is passed it defaults to fp4. Pass int4 for RTX 30 / 40-Series:
start_server.cmd int4
CUDA out of memory
Close other GPU applications. On 16 GB cards the server automatically enables sequential CPU offload for layers that do not fit in VRAM.
dit_colorize_main.py NOT FOUND
Use --module-dir to point the server to the directory that contains dit_colorize_main.py:
python dit_rpc_server.py --module-dir /path/to/dit_colorize_mainModel 'xxx' is not supported
The only supported value for model_name is "nunchaku-qwen".
Pipeline takes a long time to load
On the first run the model weights (~15–30 GB) are downloaded from HuggingFace.
Subsequent runs load from the local cache. Set cache_dir in the config to control where the cache is stored.
- Model: Qwen/Qwen-Image-Edit-2509
- Quantization: Nunchaku / SVDQuant
- Pipeline: Hugging Face Diffusers