Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
a8296fb
Add Cosmos3 action generation support
yzhautouskay May 28, 2026
2fcef5b
Add README action examples
yzhautouskay May 28, 2026
40ea973
Use do_classifier_free_guidance property
yzhautouskay May 28, 2026
591cd4d
Remove unused method
yzhautouskay May 28, 2026
04efd90
Add action policy example to pipelines doc
yzhautouskay May 28, 2026
362b6eb
Adding model selection for action example doc.
atharvajoshi10 May 28, 2026
5ff2ea9
Remove redundant casts
yzhautouskay May 29, 2026
a01c1c9
Rename _pack_action_tokens to _prepare_action_segment
yzhautouskay May 29, 2026
c12e6b1
Move validation checks to check_inputs
yzhautouskay May 29, 2026
fbcd077
Add action arguments in the __call__ docstring
yzhautouskay May 29, 2026
7c4e2f4
Move action mode check to check_inputs
yzhautouskay May 29, 2026
57d3d07
Rename action to action_tokens
yzhautouskay May 29, 2026
a6e2040
Add warning for num_frames ovewrite attempt
yzhautouskay May 29, 2026
913c24f
Rename action_tokens to raw_actions
yzhautouskay May 29, 2026
3bce946
Remove scheduler config override
yzhautouskay May 29, 2026
efc4a3e
Refactor action to use CosmosActionCondition
yzhautouskay Jun 1, 2026
b0aa026
Fix examples script to support flow_shift arg
yzhautouskay Jun 2, 2026
9aadc26
Apply styling fixes
yzhautouskay Jun 2, 2026
76acf39
Remove CosmosActionCondition properties, move to pipeline
yzhautouskay Jun 2, 2026
df5b239
Replace validate wiht post init
yzhautouskay Jun 2, 2026
ad56293
Set height/width/num_frames to None, raise error if set ofr action
yzhautouskay Jun 2, 2026
4fefb13
Fix action_dim default setting
yzhautouskay Jun 2, 2026
01a8c46
Remove video argument before v2v is added
yzhautouskay Jun 2, 2026
ef8208a
Fix None args
yzhautouskay Jun 2, 2026
b5fa4f7
Add _EMBODIMENT_TO_RAW_ACTION_DIM mapping
yzhautouskay Jun 2, 2026
35530b5
Remove --raw-action-dim from README.md
yzhautouskay Jun 2, 2026
6375b8d
Added prompt upsampler docs and examples
atharvajoshi10 Jun 2, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
449 changes: 224 additions & 225 deletions docs/source/en/api/pipelines/cosmos3.md

Large diffs are not rendered by default.

109 changes: 106 additions & 3 deletions examples/cosmos3/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,16 +48,119 @@ python examples/cosmos3/inference_cosmos3.py \
--enable-sound
```

Action forward dynamics, robot domain (predict video from an observation video and a provided action chunk):

```bash
python examples/cosmos3/inference_cosmos3.py \
--model nano \
--prompt "Put the pot to the left of the purple item. This video is captured from a first-person perspective looking at the scene." \
--vision-path "https://github.com/nvidia-cosmos/cosmos-dependencies/raw/refs/heads/assets/cosmos3/inputs/action/bridge_0.mp4" \
--action-mode forward_dynamics \
--action-path "https://github.com/nvidia-cosmos/cosmos-dependencies/raw/refs/heads/assets/cosmos3/inputs/action/bridge_0.json" \
--action-chunk-size 16 \
--domain-name bridge_orig_lerobot \
--resolution-tier 480 --fps 5 \
--num-inference-steps 30 --guidance-scale 1.0 --flow-shift 5.0 --seed 0 \
--output results/cosmos3_forward_dynamics_robot
```

Action forward dynamics, autonomous-vehicle domain:

```bash
python examples/cosmos3/inference_cosmos3.py \
--model nano \
--prompt "You are an autonomous vehicle planning system. This video is captured from a first-person perspective looking at the scene." \
--vision-path "https://github.com/nvidia-cosmos/cosmos-dependencies/raw/refs/heads/assets/cosmos3/inputs/action/av_vision_25_73d01c91-51f0-46cf-9b76-5682a76fb349.mp4" \
--action-mode forward_dynamics \
--action-path "https://github.com/nvidia-cosmos/cosmos-dependencies/raw/refs/heads/assets/cosmos3/inputs/action/av_action_25.json" \
--action-chunk-size 60 \
--domain-name av \
--resolution-tier 480 --fps 10 \
--num-inference-steps 30 --guidance-scale 1.0 --flow-shift 5.0 --seed 0 \
--output results/cosmos3_forward_dynamics_av
```

Action inverse dynamics, robot domain (predict actions from an observed video):

```bash
python examples/cosmos3/inference_cosmos3.py \
--model nano \
--prompt "Put the pot to the left of the purple item. This video is captured from a first-person perspective looking at the scene." \
--vision-path "https://github.com/nvidia-cosmos/cosmos-dependencies/raw/refs/heads/assets/cosmos3/inputs/action/bridge_0.mp4" \
--action-mode inverse_dynamics \
--action-chunk-size 16 \
--domain-name bridge_orig_lerobot \
--resolution-tier 480 --fps 5 \
--num-inference-steps 30 --guidance-scale 1.0 --flow-shift 5.0 --seed 0 \
--output results/cosmos3_inverse_dynamics_robot
```

Action inverse dynamics, autonomous-vehicle domain:

```bash
python examples/cosmos3/inference_cosmos3.py \
--model nano \
--prompt "You are an autonomous vehicle planning system. This video is captured from a first-person perspective looking at the scene." \
--vision-path "https://github.com/nvidia-cosmos/cosmos-dependencies/raw/refs/heads/assets/cosmos3/inputs/action/av_vision_25_73d01c91-51f0-46cf-9b76-5682a76fb349.mp4" \
--action-mode inverse_dynamics \
--action-chunk-size 60 \
--domain-name av \
--resolution-tier 480 --fps 10 \
--num-inference-steps 30 --guidance-scale 1.0 --flow-shift 5.0 --seed 0 \
--output results/cosmos3_inverse_dynamics_av
```

Action policy, robot domain (predict both future video and actions from the first observation frame):

```bash
python examples/cosmos3/inference_cosmos3.py \
--model nano \
--prompt "Put the pot to the left of the purple item. This video is captured from a first-person perspective looking at the scene." \
--vision-path "https://github.com/nvidia-cosmos/cosmos-dependencies/raw/refs/heads/assets/cosmos3/inputs/action/bridge_0.mp4" \
--action-mode policy \
--action-chunk-size 16 \
--domain-name bridge_orig_lerobot \
--resolution-tier 480 --fps 5 \
--num-inference-steps 30 --guidance-scale 1.0 --flow-shift 5.0 --seed 0 \
--output results/cosmos3_policy_robot
```

Action policy, autonomous-vehicle domain:

```bash
python examples/cosmos3/inference_cosmos3.py \
--model nano \
--prompt "You are an autonomous vehicle planning system. Please go backward. This video is captured from a first-person perspective looking at the scene." \
--vision-path "https://github.com/nvidia-cosmos/cosmos-dependencies/raw/refs/heads/assets/cosmos3/inputs/action/av_vision_25_73d01c91-51f0-46cf-9b76-5682a76fb349.mp4" \
--action-mode policy \
--action-chunk-size 60 \
--domain-name av \
--resolution-tier 480 --fps 10 \
--num-inference-steps 30 --guidance-scale 1.0 --flow-shift 5.0 --seed 0 \
--output results/cosmos3_policy_av
```

Action modes use `action_chunk_size + 1` conditioning frames. `forward_dynamics` consumes `--action-path`; `inverse_dynamics` and `policy` write predicted actions to `sample_action.json` in model-normalized action space. This script loads `--vision-path` as a video for all action modes; `policy` and `forward_dynamics` condition only on the first frame, while `inverse_dynamics` uses the whole video.

`--resolution-tier` is a resolution *tier* (`256`/`480`/`704`/`720`). The tier keys a table of predefined aspect-ratio canvases; the one closest to the input aspect ratio becomes the padded conditioning canvas. It is not the output frame size: the input is downscaled (never upscaled) and padded to fill the canvas, then the padding is cropped from the latents so the decoded output follows the downscaled input content. `--height` / `--width` (and `--num-frames`) are ignored for action modes.

Pick the tier that matches the native resolution of your conditioning input (`480` for ~480p, `720` for ~720p). A tier below your input downscales it and discards detail; a tier above your input gains no resolution (content is never upscaled), wastes compute on padding, and is a train/inference distribution mismatch that can degrade quality.

### Useful flags

| Flag | Default | Description |
|---|---|---|
| `--prompt` | (required) | Text prompt. |
| `--vision-path` | `None` | URL or local path for an image-conditioning frame (image-to-video). |
| `--num-frames` | `189` | `1` = image, otherwise number of video frames (`189` ≈ 7.9 s @ 24 FPS). |
| `--height` / `--width` | `720` / `1280` | Output resolution (must be a multiple of the VAE spatial scale factor). |
| `--vision-path` | `None` | URL or local path for an image-conditioning frame (image-to-video), or the image/video conditioning for action modes. |
| `--num-frames` | `189` | `1` = image, otherwise number of video frames (`189` ≈ 7.9 s @ 24 FPS). Ignored for action modes (derived from `--action-chunk-size`). |
| `--height` / `--width` | `720` / `1280` | Output resolution (must be a multiple of the VAE spatial scale factor). Ignored for action modes; use `--resolution-tier`. |
| `--resolution-tier` | `480` | Action resolution tier (`256`/`480`/`704`/`720`): selects the aspect bin / padded conditioning canvas, not the output size. |
| `--fps` | `24.0` | Frame rate of the generated video. |
| `--enable-sound` | off | Generate a synchronized audio track. |
| `--action-mode` | `None` | Enable action conditioning/generation. One of `forward_dynamics`, `inverse_dynamics`, or `policy`. |
| `--action-path` | `None` | URL or local JSON action path for `forward_dynamics`. |
| `--action-chunk-size` | `None` | Number of action tokens. Action runs generate/use `action_chunk_size + 1` video frames. |
| `--domain-name` | `None` | Action embodiment domain, for example `bridge_orig_lerobot` or `av`. |
| `--no-duration-template` | off | Skip the duration metadata sentence appended to the prompt and negative prompt. Ignored for `--num-frames 1`. |
| `--no-resolution-template` | off | Skip the resolution metadata sentence appended to the prompt and negative prompt. |
| `--output` | `.` | Directory to write `sample.jpg` or `sample.mp4`. |
137 changes: 117 additions & 20 deletions examples/cosmos3/inference_cosmos3.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,13 +23,15 @@
"""

import argparse
import json
import pathlib
import urllib.request

import torch
from huggingface_hub import snapshot_download

from diffusers import Cosmos3OmniPipeline
from diffusers.utils import encode_video, export_to_video, load_image
from diffusers import Cosmos3OmniPipeline, CosmosActionCondition, UniPCMultistepScheduler
from diffusers.utils import encode_video, export_to_video, load_image, load_video


HF_REPOS = {
Expand All @@ -38,6 +40,22 @@
}


def _load_action(path: str | None):
if path is None:
raise ValueError("--action-path is required for forward_dynamics mode.")
if path.startswith(("http://", "https://")):
with urllib.request.urlopen(path) as response:
action = json.loads(response.read().decode("utf-8"))
else:
action = json.loads(pathlib.Path(path).read_text())
tensor = torch.as_tensor(action, dtype=torch.float32)
if tensor.ndim == 3 and tensor.shape[0] == 1:
tensor = tensor.squeeze(0)
if tensor.ndim != 2:
raise ValueError(f"Cosmos3 action must have shape [T, D], got {tuple(tensor.shape)}.")
return tensor


def main():
parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
parser.add_argument("--prompt", required=True, help="Text prompt.")
Expand All @@ -50,24 +68,62 @@ def main():
parser.add_argument(
"--vision-path",
default=None,
help="Optional URL or local path for an image-conditioning frame (enables image-to-video).",
help="Optional URL or local path for an image-conditioning frame, or an action conditioning video.",
)
parser.add_argument("--output", default=".", help="Directory to save generated video/image/audio files.")
parser.add_argument("--height", type=int, default=720)
parser.add_argument("--width", type=int, default=1280)
parser.add_argument(
"--height",
type=int,
default=None,
help="Output height in pixels (default 720). Ignored for action modes; use --resolution-tier instead.",
)
parser.add_argument(
"--width",
type=int,
default=None,
help="Output width in pixels (default 1280). Ignored for action modes; use --resolution-tier instead.",
)
parser.add_argument(
"--num-frames",
type=int,
default=189,
help="Number of frames to generate. Use 1 for text-to-image; defaults to 189 for video (≈ 7.9s @ 24 FPS).",
)
parser.add_argument("--fps", type=float, default=24.0)
parser.add_argument("--guidance-scale", type=float, default=6.0, help="Classifier-free guidance scale.")
parser.add_argument("--num-inference-steps", type=int, default=35, help="Number of denoising steps.")
parser.add_argument(
"--flow-shift",
type=float,
default=None,
help="Override the scheduler's flow-matching shift (UniPCMultistepScheduler.flow_shift).",
)
parser.add_argument("--seed", type=int, default=None, help="Random seed for latent initialization.")
parser.add_argument(
"--enable-sound",
action="store_true",
default=False,
help="Generate sound alongside video (requires a sound-capable checkpoint).",
)
parser.add_argument(
"--action-mode",
choices=["forward_dynamics", "inverse_dynamics", "policy"],
default=None,
help="Enable Cosmos3 action generation with a loaded conditioning video.",
)
parser.add_argument("--action-path", default=None, help="JSON action path for forward_dynamics mode.")
parser.add_argument("--action-chunk-size", type=int, default=None, help="Number of action tokens to generate/use.")
parser.add_argument("--domain-name", default=None, help="Cosmos3 action embodiment domain name.")
parser.add_argument(
"--resolution-tier",
type=int,
default=480,
choices=[256, 480, 704, 720],
help=(
"Action resolution tier (256/480/704/720). Selects the aspect bin / padded conditioning canvas, "
"not the output frame size."
),
)
parser.add_argument(
"--no-duration-template",
dest="add_duration_template",
Expand Down Expand Up @@ -108,23 +164,57 @@ def main():
)
print("Pipeline loaded successfully.")

if args.flow_shift is not None:
pipeline.scheduler = UniPCMultistepScheduler.from_config(pipeline.scheduler.config, flow_shift=args.flow_shift)
print(f"Scheduler flow_shift set to {args.flow_shift}.")

output_dir = pathlib.Path(args.output)
output_dir.mkdir(parents=True, exist_ok=True)

image = load_image(args.vision_path) if args.vision_path is not None else None

result = pipeline(
prompt=args.prompt,
image=image,
num_frames=args.num_frames,
height=args.height,
width=args.width,
fps=args.fps,
enable_sound=args.enable_sound,
add_resolution_template=args.add_resolution_template,
add_duration_template=args.add_duration_template,
enable_safety_check=not args.no_safety_check,
)
generator = torch.Generator().manual_seed(args.seed) if args.seed is not None else None

if args.action_mode is not None:
if args.vision_path is None:
raise ValueError("--vision-path must point to a conditioning video for action modes.")
if args.action_chunk_size is None:
raise ValueError("--action-chunk-size is required for action modes.")
video = load_video(args.vision_path)
raw_actions = _load_action(args.action_path) if args.action_mode == "forward_dynamics" else None
result = pipeline(
prompt=args.prompt,
action=CosmosActionCondition(
mode=args.action_mode,
chunk_size=args.action_chunk_size,
domain_name=args.domain_name,
resolution_tier=args.resolution_tier,
raw_actions=raw_actions,
video=video,
),
fps=args.fps,
num_inference_steps=args.num_inference_steps,
guidance_scale=args.guidance_scale,
generator=generator,
use_system_prompt=False,
add_resolution_template=args.add_resolution_template,
add_duration_template=args.add_duration_template,
enable_safety_check=not args.no_safety_check,
)
else:
image = load_image(args.vision_path) if args.vision_path is not None else None
result = pipeline(
prompt=args.prompt,
image=image,
num_frames=args.num_frames,
height=args.height,
width=args.width,
fps=args.fps,
num_inference_steps=args.num_inference_steps,
enable_sound=args.enable_sound,
guidance_scale=args.guidance_scale,
generator=generator,
add_resolution_template=args.add_resolution_template,
add_duration_template=args.add_duration_template,
enable_safety_check=not args.no_safety_check,
)

if args.num_frames == 1:
save_path = output_dir / "sample.jpg"
Expand All @@ -145,6 +235,13 @@ def main():
export_to_video(result.video, str(save_path), fps=int(args.fps), quality=10, macro_block_size=1)
print(f"Saved: {save_path}")

if result.action is not None:
for action in result.action:
action_path = output_dir / "sample_action.json"
with open(action_path, "w") as f:
json.dump(action.tolist(), f)
print(f"Saved: {action_path}")


if __name__ == "__main__":
main()
2 changes: 2 additions & 0 deletions src/diffusers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -546,6 +546,7 @@
"Cosmos2TextToImagePipeline",
"Cosmos2VideoToWorldPipeline",
"Cosmos3OmniPipeline",
"CosmosActionCondition",
"CosmosTextToWorldPipeline",
"CosmosVideoToWorldPipeline",
"CycleDiffusionPipeline",
Expand Down Expand Up @@ -1357,6 +1358,7 @@
Cosmos2TextToImagePipeline,
Cosmos2VideoToWorldPipeline,
Cosmos3OmniPipeline,
CosmosActionCondition,
CosmosTextToWorldPipeline,
CosmosVideoToWorldPipeline,
CycleDiffusionPipeline,
Expand Down
Loading
Loading