Toggle light / dark theme

New framework helps robots turn complex language into precise 3D actions

Over the past few decades, roboticists worldwide have introduced increasingly advanced robots that can understand human instructions, move in their surroundings and reliably complete basic manual tasks. While they perform well in some scenarios, many of these robots still struggle to translate the instructions of users into precise and executable actions that would allow them to successfully complete desired tasks.

Recently, computer scientists have been trying to improve how robots respond to user commands or queries using vision-language models (VLMs), artificial intelligence (AI) systems trained to process both images and texts. These models can typically interpret basic requests such as “place the bottle onto the plate,” yet they often do not exhibit the spatial reasoning capabilities required to interpret more elaborate instructions and translate them into executable actions in real-world settings.

Researchers at the Chinese University of Hong Kong, the Zhejiang Humanoid Robot Innovation Center Co. Ltd and other institutes recently introduced Retrieval-Augmented Manipulation (RAM), a framework that could improve the ability of robots to connect abstract instructions with three-dimensional (3D) representations of the space around them. The new framework, presented in a Science Robotics paper, was found to improve the spatial reasoning capabilities of robots, allowing them to reliably follow more elaborate instructions, without requiring task-specific training.

Leave a Comment

Lifeboat Foundation respects your privacy! Your email address will not be published.

/* */