Training generalist policies for robotic manipulation has shown significant promise and has enabled language-conditioned, multi-task behaviors for robotics. Evaluating such generalist policies, however, is challenging due to the high cost, time, and effort required for real-world testing as well as the safety risks associated with deploying unproven policies on real robots. Manually creating and populating simulation environments with assets for robotic manipulation has not addressed these issues, mainly due to the engineering effort required and due to often large sim-to-real gaps, both in terms of physics and rendering. In this paper, we explore the use of action-conditional video generation models as a scalable, data-driven alternative for policy evaluation. We show how to add action conditioning on existing pre-trained video generation models. This allows leveraging in-the-wild online videos during the pre-training stage, and alleviates the need for a large dataset of paired video-action data, which are expensive to collect for robotic manipulation. Our paper examines the scaling effects of model size, dataset diversity, and pre-training on generalization. Our experiments show that across different metrics, such as policy ranking and correlation between actual policy values and predicted policy values, these models provide a promising avenue to evaluate policies without real-world interactions.
We propose to use action-conditional video prediction models as world simulator for scalable policy evaluation. This involves (a) deploying the policy into the action-conditional video prediction model to generate rollout videos and (b) prompting a VLM to assess their success or failure and (c) calculating the correlation with actual policy performance to assess the quality of policy evaluation.
VLM Annotation: Success
VLM Annotation: Success
VLM Annotation: Success
VLM Annotation: Failure
Annotation: Success
Annotation: Success
Annotation: Failure
Annotation: Failure
VLM Annotation: Success
VLM Annotation: Failure
VLM Annotation: Success
VLM Annotation: Failure
VLM Annotation: Success
VLM Annotation: Failure
VLM Annotation: Success
VLM Annotation: Failure
Correlation Plots for Policy Evaluation. We plot the correlation between predicted policy performance via evaluation in the world model vs. actual policy performance.
Quantitative Results for Action-conditional Video prediction. We generate next 64 frames in auto-regressive manner and evaluate the results on both RoboMimic (in-distribution) and policy rollout (out-of-distribution) trajectories.
Multi-View Inconsistency: Our world model takes two different views as inputs and predicts two views for the next timestep. It’s likely that the predicted two views have inconsistencies.
Replicated Prediction: When the displacement of the gripper is significantly small, the diffusion model may simply replicate the conditional image as the predicted image.
Hallucinations: The physical plausibility in the generated outputs is still imperfect, which could impact the effectiveness of policy evaluation.
We assess the impact of amount of policy rollouts used for video model training for policy evaluation performance on Lift and Square tasks. In general, more policy rollouts leads to better policy evaluation performance. The benefits of more policy rollouts are more significant for complicated tasks (Square), compared to simpler tasks (Lift).