Latent Policy Steering with Embodiment-Agnostic Pretrained World Models
Under-review, 2026


Yiqi Wang1
Mrinal Verghese1
Jeff Schneider1


Carnegie Mellon University




Teaser figure.

Left: We propose optical flow as an embodiment-agnostic action representation, allowing robots to make use of existing or cost-efficient data across diverse embodiments. Middle: An image-based World Model (WM) is pretrained using encoded optical flow as an action representation, and then finetuned on a target-embodiment with robot actions. Right: During inference, Latent Policy Steering evaluates multiple candidate plans and executes the best one.




Abstract

The performance of learned robot visuomotor policies is heavily dependent on the size and quality of the training dataset. Although large-scale robot and human datasets are increasingly available, embodiment gaps and mismatched action spaces make them difficult to leverage. Our main insight is that skills performed across different embodiments produce visual similarities in motions that can be captured using off-the-shelf action representations such as optical flow. Moreover, World Models (WMs) can leverage sub-optimal data since they focus on modeling dynamics.

In this work, we aim to improve visuomotor policies in low-data regimes by first pretraining a WM using optical flow as an embodiment-agnostic action representation to leverage accessible or easily collected data from multiple embodiments (robots, humans). Given a small set of demonstrations on a target embodiment, we finetune the WM on this data to better align the WM predictions, train a base policy, and learn a robust value function. Using our finetuned WM and value function, our approach evaluates action candidates from the base policy and selects the best one to improve performance. Our approach, which we term Latent Policy Steering (LPS), improves behavior-cloned policies by 10.6% on average across four Robomimic tasks, even though most of the pretraining data comes from the real world. In the real-world experiments, LPS achieves larger gains: 70% relative improvement with 30-50 target-embodiment demonstrations, and 44% relative improvement with 60-100 demonstrations, compared to a behavior-cloned baseline.




The Base Policy (Left) Receives Corrections From LPS (Right)



We compare the base policy's performance aginst the LPS's performance, given similar task initializations. LPS uses the same policy as the base policy baseline. The red box shows critical moment where the policy may diverge from the expert demonstrations. The green box indicates how LPS makes adjustments to the base policy. In the video, the failed attempt from the base policy to pick the radish up is fixed by the LPS, and LPS manages to improve a failure episode of sweeping salads into a more successful episode (it still fails by leaving few leaves on the board :-( )).




Embodiment-Agnostic VS. -Dependent Pretrained Models

We report the number of successes out of 20 trials in the real world, given 30-50 or 60-100 demonstrations on the target embodiment. We considers the following baselines:
BC: a diffusion policy learned from scratch via behavior cloning on the target-embodiment dataset.
LPS*: Our proposed method without pretraining the WM. It leverages the BC baseline as its base policy.
HPT: A policy pretrained on 20+ embodiments and finetuned on the target-embodiment dataset, including embodiment-specific action head and the low-level encoder.
LPS-mix*: Our proposed method with an embodiment-agnostic pretrained WM. The pretrain mixtures includes 7 different embodiments (6 simulated/real-world robots and human videos from play). It leverages the BC baseline as its base policy.


table1 figure

Thanks to the embodiment-agnostic pretrained WM, LPS-mix* has significantly improved the base policy (BC) performance across tasks. Although HPT was pretrained on a large-scale dataset with more than 20 embodiments, such an embodiment-dependent pretrained policy performs poorly given a small amount of target-embodiment data for finetuning.





Scaling With Embodiments

Results figure

We explores how does LPS and its embodiment-agnostic pretrained WM performance scale with the number of embodiments in the Robomimic benchmark, by varying its pretraining mixtures and compares to a base policy. We observe a promising trend: thanks to the embodiment-agnostic WM, increasing embodiments leads to more usable pretraining data and results in better performance compared to the behavior-cloned baseline. While embodiment-dependent pretrained policies such as HPT could also scale with embodiments, we do not compare to it since HPT pretrained with more than 20 embodiments performs sub-optimally given a small amount of finetuning data on the target embodiment, as shown in real-world experiments above.

Besides, the LPS-mix* achieves to improve the base policy by 10.6% in terms of average success rate, despite the evaluation is in the simulation and the majority of the pretrain data are collected in the real world across different robots and human.



Paper

Paper thumbnail

Latent Policy Steering with Embodiment-Agnostic Pretrained World Models

Yiqi Wang, Mrinal Verghese, Jeff Schneider

Under review.

@article{wang2025latent,
    title = {Latent policy steering with embodiment-agnostic pretrained world models},
    author = {Wang, Yiqi and Verghese, Mrinal and Schneider, Jeff},
    journal={arXiv preprint arXiv:2507.13340},
    year = {2025},
}



Acknowledgements

This template was originally made by Phillip Isola and Richard Zhang for a colorful project, and inherits the modifications made by Jason Zhang. The code can be found here.