The Problem: Reactive vs. Proactive Planning
Standard LLM agents are fundamentally reactive, struggling with long-horizon tasks because they lack an internal world model to simulate outcomes before committing to an action. While some models can mimic foresight, they often suffer from a "format-capability gap," where they produce plausible-looking plans without genuine predictive grounding. The authors argue that effective world modeling requires moving beyond simple fine-tuning to a structured, capability-first training pipeline.
A Three-Stage Training Paradigm
To bridge the gap between superficial mimicry and grounded foresight, the authors propose a unified training approach that forces the model to verbalize both a prospective state rollout and a plan-conditioned success estimate (a text-based equivalent of a Q-value):
- World Model Agentic Mid-Training (WM-AMT): This stage focuses on injecting latent predictive capabilities into the policy, ensuring the model learns to represent future states internally.
- Format-Eliciting SFT (FE-SFT): Once the capability is present, this stage structures the output to ensure the model can consistently express its foresight in a usable, textual format.
- Foresight-Conditioned Reinforcement Learning (FC-RL): The final stage refines the model's ability to calibrate its simulations, ensuring that the generated "what-if" scenarios are both accurate and useful for decision-making.
By separating the acquisition of predictive capability from the formatting and calibration stages, the model develops a more robust internal world model. This approach consistently outperforms standard training baselines in search and mathematical reasoning tasks, demonstrating that grounded foresight is achievable through a deliberate, multi-stage training process.