Policy Evaluation with Video Prediction Model

Scalable Policy Evaluation with Video World Models

ICRA 2026 Submission
Submission

Video Appendix

Abstract

Training generalist policies for robotic manipulation has shown significant promise and has enabled language-conditioned, multi-task behaviors for robotics. Evaluating such generalist policies, however, is challenging due to the high cost, time, and effort required for real-world testing as well as the safety risks associated with deploying unproven policies on real robots. Manually creating and populating simulation environments with assets for robotic manipulation has not addressed these issues, mainly due to the engineering effort required and due to often large sim-to-real gaps, both in terms of physics and rendering. In this paper, we explore the use of action-conditional video generation models as a scalable, data-driven alternative for policy evaluation. We show how to add action conditioning on existing pre-trained video generation models. This allows leveraging in-the-wild online videos during the pre-training stage, and alleviates the need for a large dataset of paired video-action data, which are expensive to collect for robotic manipulation. Our paper examines the scaling effects of model size, dataset diversity, and pre-training on generalization. Our experiments show that across different metrics, such as policy ranking and correlation between actual policy values and predicted policy values, these models provide a promising avenue to evaluate policies without real-world interactions.

Overview for our Policy Evaluation Pipeline

We propose to use action-conditional video prediction models as world simulator for scalable policy evaluation. This involves (a) deploying the policy into the action-conditional video prediction model to generate rollout videos and (b) prompting a VLM to assess their success or failure and (c) calculating the correlation with actual policy performance to assess the quality of policy evaluation.

Qualtative Results for Policy Evaluation

VLM Annotation: Success

VLM Annotation: Failure

Annotation: Success

Annotation: Failure

VLM Annotation: Success

VLM Annotation: Failure

VLM Annotation: Success

VLM Annotation: Failure

VLM Annotation: Success

VLM Annotation: Failure

VLM Annotation: Success

VLM Annotation: Failure

Quantative Results

Correlation Plots for Policy Evaluation. We plot the correlation between predicted policy performance via evaluation in the world model vs. actual policy performance.

Quantitative Results for Action-conditional Video prediction. We generate next 64 frames in auto-regressive manner and evaluate the results on both RoboMimic (in-distribution) and policy rollout (out-of-distribution) trajectories.

Failure Cases

Multi-View Inconsistency: Our world model takes two different views as inputs and predicts two views for the next timestep. It’s likely that the predicted two views have inconsistencies.

Replicated Prediction: When the displacement of the gripper is significantly small, the diffusion model may simply replicate the conditional image as the predicted image.

Hallucinations: The physical plausibility in the generated outputs is still imperfect, which could impact the effectiveness of policy evaluation.

Additional Results

Benchmarking with Different Pre-trained Video Models

IRASim: a diffusion transformer with 16 frames and frame-level action-conditioning module within each transformer block
Cosmos-Predict-1-7B and Cosmos-Predict-1-14B: We fine-tuned the Cosmos-Predict1-7B-Video2World and Cosmos-Predict1-14B-Video2World to do next-frame prediction based on single action. The action conditioning is applied to the time-embeddings of the model as well.
Wan 2.1 LoRA: We fine-tuned the Wan 2.1 I2V-14B-480P using LoRA. The action conditioning is applied to the time-embeddings of the model.
Cosmos-Predict-2-2B: We fine-tuned the Cosmos-Predict2-2B-Video2World. The action conditioning is applied to the time-embeddings of the model as well.

Scaling Dataset with Policy Rollouts

We assess the impact of amount of policy rollouts used for video model training for policy evaluation performance on Lift and Square tasks. In general, more policy rollouts leads to better policy evaluation performance. The benefits of more policy rollouts are more significant for complicated tasks (Square), compared to simpler tasks (Lift).