Vision-Language-Action Models for Robotics:
A Review Towards Real-World Applications

2025

Overview

Overview of Vision-Language-Action Models Survey showing the transition from separate vision, language, and action models to unified VLA architectures for robotics applications
Figure: Overview of Vision-Language-Action Models Survey

Amid growing efforts to leverage advances in large language models (LLMs) and vision-language models (VLMs) for robotics, Vision-Language-Action (VLA) models have recently gained significant attention. By unifying vision, language, and action data at scale, which have traditionally been studied separately, VLA models aim to learn policies that generalise across diverse tasks, objects, embodiments, and environments. This generalisation capability is expected to enable robots to solve novel downstream tasks with minimal or no additional task-specific data, facilitating more flexible and scalable real-world deployment. To better understand the foundations and progress towards this goal, this paper provides a systematic review of VLAs, covering their strategy and architectural transition, architectures and building blocks, modality-specific processing techniques, and learning paradigms. In addition, to support the deployment of VLAs in real-world robotic applications, we also review commonly used robot platforms, data collection strategies, publicly available datasets, data augmentation methods, and evaluation benchmarks. Throughout this comprehensive survey, this paper aims to offer practical guidance for the robotics community in applying VLAs to real-world robotic systems.

VLA modality processing, architecture design, and learning paradigms including vision encoders, language models, and action decoders
Figure: Modality, Architecture and Learning Paradigm of VLA
Robot platforms, data collection methods, datasets, and evaluation benchmarks for Vision-Language-Action models
Figure: Robot, Data Collection, Dataset and Evaluation for VLA

Interactive Survey Table

Explore our comprehensive database of VLA models. Use the filters below to search by category, task type, modality, or robot platform. Click on column headers to sort.

Category Abbreviation Title Conference Paper Website Task Domain Robot Training Evaluation Modality Dataset Backbone Action Gen
Loading CSV data...
Showing 0-0 of 0 records
Page 1 of 1

BibTeX

@misc{vla-survey2025,
    title={Vision-Language-Action Models for Robotics: A Review Towards Real-World Applications},
    author={Kento Kawaharazuka and Jihoon Oh and Jun Yamada and Ingmar Posner and Yuke Zhu},
    year={2025},
    howpublished={\url{https://vla-survey.github.io}}
}
                    

Contact

If you have any questions or suggestions, please feel free to contact Kento Kawaharazuka.