Uni-Embodied: Towards Unified Evaluation for Embodied Planning, Perception, and Execution

Lingfeng Zhang1,*, Yingbo Tang2,*, Xinyu Zheng3, Qiang Zhang4,6, Yu Liu5, Renjing Xu6, Xiaoshuai Hao7,†
1Shenzhen International Graduate School, Tsinghua University
2Institute of Automation, Chinese Academy of Sciences
3National Maglev Transportation Engineering Research and Development Center, Tongji University
4X-Humanoid
5Hefei University of Technology
6The Hong Kong University of Science and Technology (Guangzhou)
7Beijing Academy of Artificial Intelligence (BAAI)

*Co-first Authors    Corresponding Author
pipeline image

This work introduces Uni-Embodied Benchmark, the first comprehensive benchmark designed to evaluate VLMs across the three key dimensions of embodied intelligence: planning, perception, and execution.

Abstract

Embodied intelligence is a core challenge in the pursuit of artificial general intelligence (AGI), requiring the seamless integration of planning, perception, and execution to enable agents to perform physical tasks effectively. While recent vision-language models (VLMs) have shown strong performance in isolated capabilities, their ability to jointly exhibit all three embodied skills remains unclear, impeding the development of unified embodied systems.

In this paper, we propose Uni-Embodied, the first comprehensive benchmark designed to evaluate VLMs across the three key dimensions of embodied intelligence: planning, perception, and execution. Our benchmark includes nine diverse tasks—ranging from complex and simple embodied planning to trajectory summarization, map understanding, affordance recognition, spatial pointing, manipulation analysis, and execution in both navigation and manipulation contexts.

Extensive experiments on leading open-source and closed-source VLMs demonstrate that current models struggle to achieve balanced performance across all three dimensions. Notably, we observe that enhancing planning and perception often compromises execution, while focusing on execution significantly degrades planning and perception capabilities—revealing fundamental limitations in existing approaches.

We further explore strategies such as chain-of-thought prompting and hybrid training to selectively improve specific embodied capabilities. These findings offer valuable insights for the development of more robust and unified embodied intelligence systems, critical for advancing real-world robotic applications.

Benchmark Definition

Overview of the Uni-Embodied Benchmark. Our benchmark includes nine diverse tasks—ranging from complex and simple embodied planning to trajectory summarization, map understanding, affordance recognition, spatial pointing, manipulation analysis, and execution in both navigation and manipulation contexts.

Experimental Results


Planning Tasks

Performance comparison of VLMs on embodied planning tasks including complex-task (navigation-manipulation combined) and simple-tasks (desktop manipulation only).

Perception Tasks

Performance Comparison of VLMs on Embodied Perception Tasks.

Execution Tasks

Performance comparison of models on navigation and manipulation execution benchmark.

Ablation Study

Ablation experiments on chain-of-thought enhancement and hybrid training.

Dataset Samples

Planning

Complex planning task samples.

Navigation Trajectory Summarization

Navigation trajectory summarization samples.

Navigation Map Understanding

Navigation map understanding samples.

Manipulation Affordance Prediction

Manipulation affordance prediction samples.

Manipulation Trajectory Analysis

Manipulation trajectory analysis samples.


Supplementary Materials

Potential Applications

Our Uni-Embodied dataset and benchmark and its findings are important for advancing practical robotics applications across multiple domains. In the field of home robotics, a unified evaluation framework can guide the development of home robots that can seamlessly integrate planning ("prepare breakfast"), perception ("recognize kitchen items and spatial layout"), and execution ("navigate to appliances and manipulate utensils"). In the field of warehouse automation, robots must coordinate complex multi-step tasks such as inventory management, which requires sophisticated planning to optimize routes, strong perception capabilities to identify and locate items, and precise execution capabilities to manipulate and navigate. In addition, the framework’s comprehensive evaluation approach can accelerate progress in the field of medical robotics, promoting the development of assistive robots that can understand complex care instructions, perceive patient needs and environmental context, and perform sophisticated assistive tasks. The benchmark shows that current VLMs struggle to achieve excellence in all three capabilities simultaneously, highlighting a key research direction for developing more powerful embodied AI systems that are critical to these real-world applications.

Reproducibility, Licencing and Access

To ensure reproducibility and promote future research on embodied intelligence evaluation, we have made the full Uni-Embodied benchmark publicly available on Hugging Face. The released package contains all evaluation samples for the three core components (Planning, Perception, and Execution), complete evaluation code. This open source release enables researchers to reproduce our experimental results and conduct fair comparisons with their own methods using standardized evaluation procedures. All datasets are accompanied by clear question-answer pairs. The datasets and benchmarks are under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 Interational License. Researchers can access the full benchmark suite, contribute improvements, and extend the evaluation framework to new embodied intelligence tasks, thus promoting collaborative development in this key research area.