Internalizing Future-Aware Planning in LLM Agents

The Problem: Reactive vs. Proactive Planning

Standard LLM agents are fundamentally reactive, struggling with long-horizon tasks because they lack an internal world model to simulate outcomes before committing to an action. While some models can mimic foresight, they often suffer from a "format-capability gap," where they produce plausible-looking plans without genuine predictive grounding. The authors argue that effective world modeling requires moving beyond simple fine-tuning to a structured, capability-first training pipeline.

A Three-Stage Training Paradigm

To bridge the gap between superficial mimicry and grounded foresight, the authors propose a unified training approach that forces the model to verbalize both a prospective state rollout and a plan-conditioned success estimate (a text-based equivalent of a Q-value):

World Model Agentic Mid-Training (WM-AMT): This stage focuses on injecting latent predictive capabilities into the policy, ensuring the model learns to represent future states internally.
Format-Eliciting SFT (FE-SFT): Once the capability is present, this stage structures the output to ensure the model can consistently express its foresight in a usable, textual format.
Foresight-Conditioned Reinforcement Learning (FC-RL): The final stage refines the model's ability to calibrate its simulations, ensuring that the generated "what-if" scenarios are both accurate and useful for decision-making.

By separating the acquisition of predictive capability from the formatting and calibration stages, the model develops a more robust internal world model. This approach consistently outperforms standard training baselines in search and mathematical reasoning tasks, demonstrating that grounded foresight is achievable through a deliberate, multi-stage training process.

The Problem: Reactive vs. Proactive Planning

A Three-Stage Training Paradigm

More from Agents & Orchestration

Agent-Native Immune System (ANIS): Architecture for Runtime Defense

ATOD: Hybrid Distillation for Autonomous Agent Training

Reducing LLM Agent Hallucinations with Grounded Iterative Planning

Odyssey: A Categorical Framework for Verifiable Foundation Models