XiaomiAuto WorldModel — Joint World Model for Autonomous Driving

Abstract: This report presents a unified technical system addressing the two core capabilities of world models for autonomous driving: world representation and world generation. For world representation, we propose WorldRec, a feed-forward reconstruction architecture driven by sparse scene queries that initializes structured queries in 3D space, aggregates cross-view and cross-temporal features, and yields compact yet high-fidelity 3D Gaussian scene representations. For world generation, we propose WorldGen, a two-stage training framework of bidirectional pretraining followed by causal fine-tuning through three progressive stages (Teacher Forcing, ODE distillation, and DMD), enabling high-quality online causal video generation in as few as 4 denoising steps. Building on both modules, we further introduce the Joint World Model, which deeply integrates WorldRec and WorldGen to achieve synergistic gains in generation stability, cross-frame consistency, and visual fidelity, providing a solid foundation for closed-loop simulation, data synthesis, and end-to-end training in autonomous driving.

摘要：本报告聚焦自动驾驶世界模型的两项核心能力——世界表征与世界生成，系统介绍我们在这两个方向上的技术方案与实验结果。在世界表征方面，我们提出基于稀疏场景查询的前馈重建架构 WorldRec，在三维空间中初始化结构化查询，聚合跨视角、跨时刻特征，生成紧凑且高保真的 3D Gaussian 场景表征。在世界生成方面，我们提出 WorldGen，采用"双向预训练 → 因果微调"的两阶段训练策略（Teacher Forcing、ODE 蒸馏与 DMD 三阶段递进），实现仅需 4 步去噪的高质量在线因果视频生成。在此基础上，我们进一步提出联合世界模型（Joint World Model）框架，将两者深度融合，在生成稳定性、跨帧一致性与视觉真实性三个维度上实现协同增益，为自动驾驶闭环仿真、数据合成与端到端训练提供坚实基础。

Overview概述

World models have emerged as a foundational paradigm for autonomous driving, enabling data synthesis, closed-loop training, and closed-loop simulation. Recent work has converged toward a hybrid reconstruction-generation paradigm that first constructs a 3D scene representation from multi-view observations, then leverages the geometric prior to condition video generation. Despite promising results, existing methods face three persistent bottlenecks: per-scene 3D Gaussian optimization requires multi-hour training and produces ghosting artifacts at scale; causal generation models lack strong scene priors and require hundreds of denoising steps; and reconstruction and generation remain loosely coupled, making it difficult to reconcile geometric fidelity with generative diversity.

This report presents a unified framework that addresses all three challenges. WorldRec replaces costly per-scene optimization with sparse-query aggregation, completing reconstruction in ~10 seconds while eliminating ghosting artifacts. WorldGen combines bidirectional pretraining with progressive causal fine-tuning (Teacher Forcing → ODE distillation → DMD) to achieve stable generation of up to one-minute videos at 0.19s/frame. The Joint World Model tightly couples both modules through incremental scene fusion and ego-projected rendered-prior conditioning, yielding synergistic gains in temporal stability, cross-view consistency, and visual fidelity.

世界模型已成为自动驾驶的基础范式，支撑数据合成、闭环训练与闭环仿真三类核心应用。近年来，该领域逐步汇聚为"重建-生成"混合范式：先从多视角观测重建 3D 场景，再以几何先验条件化视频生成。然而，现有方法仍面临三大瓶颈：逐场景 3D Gaussian 优化耗时数小时且存在重影伪影；因果生成模型缺乏场景先验且推理步数过多；重建与生成耦合过浅，难以兼顾几何保真度与生成多样性。

本报告提出一个系统解决上述挑战的统一框架。WorldRec 以稀疏查询聚合替代逐场景优化，约 10 秒完成重建并消除重影伪影。WorldGen 通过双向预训练与递进因果微调（Teacher Forcing → ODE 蒸馏 → DMD）实现以 0.19 秒/帧生成最长 1 分钟的稳定视频。联合世界模型通过增量场景融合与自车投影渲染先验条件将两者紧密耦合，在时序稳定性、跨视角一致性和视觉保真度三个维度上实现协同增益。

Figure 1: Comparison of reconstruction-only, generation-only, and our Joint World Model. 图1：重建模型、生成模型与联合世界模型的对比。

Table 1: Comparison of different world modeling paradigms

表1：不同世界建模范式对比

Capability能力	Recon-only仅重建	Gen-only仅生成	NeoVerse	AlpaDreams	OursOurs
Explicit 3D Scene显式 3D 场景	✓	✗	✓	✗	✓
Generative Capability生成能力	✗	✓	✓	✓	✓
Novel View Synthesis新视角合成	✓	✓	✓	✓	✓
Future Prediction未来预测	✗	✓	✓	✓	✓
Geometry Consistency几何一致性	Strong强	Weak弱	Medium中	Weak弱	Strong强
Long-horizon Stability长时序稳定性	Static静态	Drift漂移	Medium中	Medium中	Stable稳定

Method方法

2.1 WorldRec

Most existing feedforward methods predict pixel-aligned 3D Gaussians via a DPT head, causing ghosting artifacts and Gaussian primitive explosion. WorldRec instead represents scenes as compact sparse tokens: each token aggregates features from multi-view, multi-temporal observations through a visibility-aware attention mechanism, explicitly enforcing multi-view consistency and yielding compact high-fidelity 3D Gaussian representations. Scene reconstruction from a 10-second clip is accomplished in ~10 seconds — versus ~4 hours for per-scene optimization baselines.

现有前馈方法通过 DPT 头预测像素对齐的 3D Gaussian，导致重影伪影与基元数量爆炸。WorldRec 通过稀疏 token 表示场景：每个 token 通过可见性感知注意力机制聚合多视角、多时刻特征，天然保证跨视角一致性，输出紧凑的高保真 3D Gaussian 表征。10 秒视频片段的场景重建仅需约 10 秒，而逐场景优化基线需要约 4 小时。

Figure 2: WorldRec architecture — sparse query-driven feed-forward reconstruction pipeline. 图2：WorldRec 架构——稀疏查询驱动的前馈重建流程。

2.2 WorldGen

WorldGen adopts a Diffusion Transformer (DiT) backbone and a two-stage training framework: bidirectional pretraining → causal fine-tuning. Stage 1 trains with full bidirectional temporal attention to learn global spatiotemporal distributions. Stage 2 introduces a causal mask and progressively refines the model through Teacher Forcing (causal constraint), ODE distillation (50 steps → 4 steps, 12× speedup), and DMD (closes train-inference distribution gap) — achieving high-quality real-time causal video generation.

WorldGen 采用 Diffusion Transformer（DiT）骨干网络与两阶段训练框架：双向预训练 → 因果微调。第一阶段以全双向时序注意力训练，学习全局时空分布。第二阶段引入因果掩码，通过 Teacher Forcing（因果约束）、ODE 蒸馏（50 步 → 4 步，提速约 12 倍）、DMD（消除暴露偏差）三阶段递进微调，实现高质量实时因果视频生成。

Figure 3: WorldGen architecture and two-stage training framework (bidirectional pretraining → causal fine-tuning). 图3：WorldGen 架构与两阶段训练框架（双向预训练 → 因果微调）。

2.3 Joint World Model联合世界模型

The Joint World Model deeply integrates WorldRec and WorldGen so that each module's strengths compensate for the other's limitations. WorldRec provides reliable 4D spatial anchors that prevent geometric drift during generation; WorldGen fills unseen regions with coherent content that reconstruction alone cannot synthesize.

联合世界模型将 WorldRec 与 WorldGen 深度融合，使两个模块的优势相互补充。WorldRec 提供可靠的 4D 空间锚点，防止生成过程中的几何漂移；WorldGen 在重建无法覆盖的未观测区域生成一致内容。

Figure 4: Joint World Model architecture. 图4：联合世界模型架构图。

To enable tight coupling, both modules are adapted: WorldRec gains incremental reconstruction via scene fusion — new observations update and extend the existing Gaussian representation, enabling larger scenes as the vehicle moves. WorldGen gains rendered-RGB conditioning — scene tokens are rasterized to target views, producing partial reference images that are injected as additional conditioning, guiding synthesis in unobserved regions.

为实现紧密耦合，两个模块各自适配：WorldRec 通过场景融合支持增量式重建——新帧到来时更新并扩展已有高斯表征，支持更大范围场景重建。WorldGen 支持渲染 RGB 条件——场景 token 被光栅化为目标视角的参考图像，作为额外条件注入 DiT，引导未观测区域的生成。

Figure 5(a) JointWM-WorldRec: incremental reconstruction via scene fusion. 图5（a）JointWM-WorldRec：通过场景融合实现增量式重建。

Figure 5(b) JointWM-WorldGen: RGB conditioning from ego-projected rendered priors. 图5（b）JointWM-WorldGen：基于自车投影渲染先验的 RGB 条件输入。

High stability高稳定性: Deterministic geometric constraints from WorldRec suppress error accumulation and content drift during long-horizon autoregressive generation.：WorldRec 的确定性几何约束，有效抑制长时自回归中的误差累积与内容漂移。
High consistency高一致性: The 4D scene representation acts as shared cross-frame memory, ensuring global consistency of object positions, lighting, and textures.：4D 场景表征作为跨帧共享记忆，确保不同时刻、不同视角下场景内容的全局一致性。
High fidelity高真实性: Rich conditioning signals from WorldGen combined with reconstruction supervision bring synthesized content closer to real sensor observations.：WorldGen 丰富的条件信号结合重建模块的强监督，使合成内容更贴近真实传感器观测。

On H20 GPUs, WorldRec reconstructs a 10-second clip in ~10 seconds; WorldGen generates at 0.19s/frame (single view) and 0.46s/frame (three views).

在 H20 GPU 上，WorldRec 重建 10 秒片段约需 10 秒；WorldGen 的生成速度为单视角 0.19 秒/帧、三视角 0.46 秒/帧。

Results实验结果

3.1 WorldRec

Table 2 presents quantitative comparisons on Waymo and nuScenes. Our method achieves state-of-the-art performance across all settings. Reconstruction from a 10-second clip completes in ~10 seconds vs. ~4 hours for per-scene optimization.

表2展示了在 Waymo 和 nuScenes 上的定量对比，我们的方法在所有设置下均达到最优性能。10 秒片段重建约需 10 秒，而逐场景优化约需 4 小时。

Table 2: Quantitative results on Waymo and nuScenes (PSNR↑ / SSIM↑).

表2：Waymo 与 nuScenes 数据集定量结果（PSNR↑ / SSIM↑）。

Method	Waymo		NuScenes Zero-Shot		NuScenes Fine-Tuning
Method	PSNR↑	SSIM↑	PSNR↑	SSIM↑	PSNR↑	SSIM↑
MVSSplat	20.56	0.697	17.84	0.563	—	—
NoPoSplat	24.31	0.751	19.75	0.545	—	—
DepthSplat	23.26	0.696	19.52	0.601	—	—
STORM	26.38	0.794	17.77	0.669	24.54	0.784
DGGT	27.41	0.846	25.31	0.794	26.63	0.813
Ours	28.48	0.861	26.54	0.821	27.50	0.826

Driving-view reconstruction quality驾驶视角重建效果

Video 1: Driving-view reconstruction (example 1)视频1：驾驶视角效果展示（示例一）

Video 2: Driving-view reconstruction (example 2)视频2：驾驶视角效果展示（示例二）

Video 3: Driving-view reconstruction (example 3)视频3：驾驶视角效果展示（示例三）

Bird's-eye view reconstruction quality鸟瞰视角重建效果

Video 4: Bird's-eye view reconstruction (example 1)视频4：鸟瞰视角效果展示（示例一）

Video 5: Bird's-eye view reconstruction (example 2)视频5：鸟瞰视角效果展示（示例二）

Video 6: Bird's-eye view reconstruction (example 3)视频6：鸟瞰视角效果展示（示例三）

3.2 WorldGen

Table 3 compares WorldGen against leading driving world models on nuScenes. As an autoregressive (AR) model, WorldGen achieves FID 7.04 and FVD 64.97, outperforming all listed models in FVD. It generates 81-frame sequences (far exceeding 8–16 frames of most baselines) at 0.19s/frame — 5.6× faster than the only other AR method (Epona at 1.06s/frame).

表3将 WorldGen 与主流驾驶世界模型在 nuScenes 上进行对比。作为自回归（AR）模型，WorldGen 实现 FID 7.04、FVD 64.97，在 FVD 上超越所有列出的模型。生成 81 帧序列（远超多数基线的 8-16 帧），推理速度 0.19 秒/帧，比唯一另一个 AR 方法 Epona（1.06 秒/帧）快 5.6 倍。

Table 3: Comparison of driving world models on nuScenes dataset.

表3：驾驶世界模型在 nuScenes 数据集上的对比。

Model	Bi / AR双向 / 自回归	Venue	FID↓	FVD↓	Frames	Infer. Time推理时间
MagicDrive	Bi	ICLR'24	16.20	—	1	—
MagicDrive-V2	Bi	ICCV'25	20.91	94.84	16	—
Vista	Bi	NeurIPS'24	6.9	89.4	16	—
DiVE	Bi	arXiv'25	7.14	68.4	8	—
Delphi	Bi	arXiv'24	15.08	113.5	8	—
UniScene	Bi	CVPR'25	6.12	70.52	8	—
Genesis	Bi	NeurIPS'25	6.45	67.87	16	—
Epona	AR	ICCV'25	7.5	82.8	16	1.06s
WorldGen (Ours)	AR	—	7.04	64.97	81	0.19s

Long-tail animal scene generation长尾动物场景生成

Video 7: Long-tail — horse entering the roadway (1)视频7：长尾场景——马匹闯入路面（示例一）

Video 8: Long-tail — horse entering the roadway (2)视频8：长尾场景——马匹闯入路面（示例二）

Video 9: Long-tail — tiger intruding on the road视频9：长尾场景——老虎入侵道路

Extreme weather scene generation极端天气场景生成

Video 10: Extreme weather (example 1)视频10：极端天气场景生成（示例一）

Video 11: Extreme weather (example 2)视频11：极端天气场景生成（示例二）

Video 12: Extreme weather (example 3)视频12：极端天气场景生成（示例三）

Controllable long-horizon video generation (10Hz/30Hz, up to 1 minute)可控长时序视频生成（10Hz/30Hz，时长可达 1 分钟）

Video 18: Controllable long-horizon, 10Hz, ≤1 min (1)视频18：可控长时序，10Hz，时长可达 1 分钟（示例一）

Video 19: Controllable long-horizon, 10Hz, ≤1 min (2)视频19：可控长时序，10Hz，时长可达 1 分钟（示例二）

Video 20: Controllable long-horizon, 30Hz, ≤1 min (3)视频20：可控长时序，30Hz，时长可达 1 分钟（示例三）

3.3 Joint World Model联合世界模型

We evaluate the Joint World Model along three dimensions: long-horizon temporal consistency, multi-view spatial consistency, and multi-run stability. The geometric prior from WorldRec anchors the generative process, preventing drift and hallucination while maintaining globally coherent scene content.

我们从三个维度评估联合世界模型：长时序一致性、多视角空间一致性和多趟稳定性。WorldRec 的几何先验锚定生成过程，防止漂移和幻觉，保持全局一致的场景内容。

Long-horizon temporal consistency长时序一致性

Video 21: Long-horizon consistency — scene geometry and dynamic content remain stable without drift视频21：长时序一致性——场景几何与动态内容保持稳定，无漂移

Video 22: Long-horizon consistency — WorldRec's geometric prior preserves structural coherence throughout视频22：长时序一致性——WorldRec 的几何先验全程保持结构连贯性

Video 23: Long-horizon consistency — road structure, lighting, and dynamic actors maintained视频23：长时序一致性——道路结构、光照和动态主体保持稳定

Multi-view spatial consistency多视角空间一致性

Figure 6: Multi-view spatial consistency — the 4D scene prior from WorldRec ensures globally coherent object positions, lighting, and surface textures across the full camera rig. 图 6：多视角空间一致性——WorldRec 的 4D 场景先验确保全摄像头阵列下物体位置、光照与表面纹理全局一致。

Multi-run stability多趟稳定性

Figure 7: Multi-run stability — repeated inference under identical conditions yields structurally consistent outputs. 图 7：多趟稳定性——相同条件下重复推理，输出结构保持一致。

Conclusion总结

This report has systematically presented the technical designs and experimental results of WorldRec, WorldGen, and the Joint World Model.

WorldRec breaks through two long-standing bottlenecks — multi-hour per-scene optimization and Gaussian primitive explosion — by adopting a sparse-query-driven feed-forward paradigm, compressing reconstruction time from hours to seconds. WorldGen achieves high-quality long-horizon video generation with only 4 denoising steps through bidirectional pretraining followed by causal fine-tuning (Teacher Forcing → ODE distillation → DMD), while setting state-of-the-art FVD of 64.97 on nuScenes at 0.19s/frame inference speed. The Joint World Model organically integrates both: WorldRec's deterministic geometric constraints suppress generative drift, while WorldGen's rich imagination compensates for reconstruction limitations in unseen regions, achieving synergistic gains in stability, consistency, and fidelity.

The technical system presented in this report provides a complete solution for constructing high-quality autonomous driving world models suitable for closed-loop simulation, data synthesis, and end-to-end training.

本报告系统阐述了 WorldRec、WorldGen 及联合世界模型的技术方案与实验结果。

WorldRec 通过稀疏查询驱动的前馈重建范式，突破了逐场景优化耗时长、高斯基元数量爆炸的双重瓶颈，将重建时间从数小时压缩至秒级。WorldGen 通过"双向预训练 → 因果微调"的两阶段策略，仅用 4 步去噪实现高质量长时序视频生成，同时在 nuScenes 上以 0.19 秒/帧的推理速度达到 FVD 64.97 的最优性能。联合世界模型有机整合两者，在稳定性、一致性与保真度三个维度上实现协同增益。

本报告提出的技术体系为构建可用于闭环仿真、数据合成与端到端训练的高质量自动驾驶世界模型提供了完整解决方案。