-
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
|
Thanks for the great question! To clarify: all post-training rows in Extended Data Table 1 (except the SFT row) are closed-loop RL, including "post-training on common logs". The difference between rows is not open-loop vs. closed-loop, but rather the source of scenarios and the Other agents behaviour model. The reward structure is the same across all configurations: a hard-constraint gate for collision / drivable-area compliance multiplied by a weighted average of soft objectives (progress, TTC, comfort). Here is what each row means: Supervised fine-tuning on rare logs — the only non-RL row. Standard imitation learning on rare-event logs, no reward signal. Raw data. Post-training on common logs — closed-loop RL, but scenarios are drawn from ordinary driving logs (not long-tail), raw data. Post-training on rare logs — closed-loop RL on failure-prone scenarios discovered from real logs, raw data. Post-training on rare synthetic replays — RL on rare scenarios where other agents' behaviour is faithfully replayed from logs (non-reactive), synthetic data (3DGS). Post-training on rare rollouts w/o Behaviour WM — RL on rare scenarios which are feasible but failed. Synthetic data (3DGS) + raw data. Post-training with World Engine (full) — builds on the above by adding the Behaviour World Model, which generates diverse counterfactual traffic variations via goal conditioning and optimization guidance. Synthetic data (3DGS) + raw data. Besides, once the arxiv is ready, you may find more details in our preprint paper. Coming soon. |
Beta Was this translation helpful? Give feedback.

Thanks for the great question!
To clarify: all post-training rows in Extended Data Table 1 (except the SFT row) are closed-loop RL, including "post-training on common logs". The difference between rows is not open-loop vs. closed-loop, but rather the source of scenarios and the Other agents behaviour model. The reward structure is the same across all configurations: a hard-constraint gate for collision / drivable-area compliance multiplied by a weighted average of soft objectives (progress, TTC, comfort).
Here is what each row means:
Supervised fine-tuning on rare logs — the only non-RL row. Standard imitation learning on rare-event logs, no reward signal. Raw data.
Post-training on commo…