Questionare about Post-training stage at tables you mentioned #2

karu-veh · 2026-04-12T02:41:22Z

karu-veh
Apr 12, 2026

Thanks for sharing you guys insight to github.
I wonder what is different each stage about "each post-training" at below table rows you mentioned

Can you explain it more in detail?
Especially I wonder all the "post-training" you mentioned below are RL-based finetune in close loop environment.
Or post trainin like "Post-training on common logs" is just trainining with open loop reward function like collision score.

Answered by WCJ-BERT

Apr 12, 2026

Thanks for the great question!

To clarify: all post-training rows in Extended Data Table 1 (except the SFT row) are closed-loop RL, including "post-training on common logs". The difference between rows is not open-loop vs. closed-loop, but rather the source of scenarios and the Other agents behaviour model. The reward structure is the same across all configurations: a hard-constraint gate for collision / drivable-area compliance multiplied by a weighted average of soft objectives (progress, TTC, comfort).

Here is what each row means:

Supervised fine-tuning on rare logs — the only non-RL row. Standard imitation learning on rare-event logs, no reward signal. Raw data.

Post-training on commo…

View full answer

WCJ-BERT · 2026-04-12T08:08:15Z

WCJ-BERT
Apr 12, 2026
Maintainer

Thanks for the great question!

To clarify: all post-training rows in Extended Data Table 1 (except the SFT row) are closed-loop RL, including "post-training on common logs". The difference between rows is not open-loop vs. closed-loop, but rather the source of scenarios and the Other agents behaviour model. The reward structure is the same across all configurations: a hard-constraint gate for collision / drivable-area compliance multiplied by a weighted average of soft objectives (progress, TTC, comfort).

Here is what each row means:

Supervised fine-tuning on rare logs — the only non-RL row. Standard imitation learning on rare-event logs, no reward signal. Raw data.

Post-training on common logs — closed-loop RL, but scenarios are drawn from ordinary driving logs (not long-tail), raw data.

Post-training on rare logs — closed-loop RL on failure-prone scenarios discovered from real logs, raw data.

Post-training on rare synthetic replays — RL on rare scenarios where other agents' behaviour is faithfully replayed from logs (non-reactive), synthetic data (3DGS).

Post-training on rare rollouts w/o Behaviour WM — RL on rare scenarios which are feasible but failed. Synthetic data (3DGS) + raw data.

Post-training with World Engine (full) — builds on the above by adding the Behaviour World Model, which generates diverse counterfactual traffic variations via goal conditioning and optimization guidance. Synthetic data (3DGS) + raw data.

Besides, once the arxiv is ready, you may find more details in our preprint paper. Coming soon.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questionare about Post-training stage at tables you mentioned #2

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Questionare about Post-training stage at tables you mentioned #2

Uh oh!

karu-veh Apr 12, 2026

Replies: 1 comment

Uh oh!

Uh oh!

WCJ-BERT Apr 12, 2026 Maintainer

karu-veh
Apr 12, 2026

WCJ-BERT
Apr 12, 2026
Maintainer