MIT develops faster, smarter way to train robots for any task

Researchers filmed multiple instances of a robotic arm feeding co-author Jialiang Zhao's adorable dog, Momo. The videos were included in datasets to train the robot. Credit: MIT.

In The Jetsons, Rosie the robot could seamlessly switch between cooking, cleaning, and taking out the trash.

But in reality, teaching robots to handle a variety of tasks has been tough. Robots are usually trained for specific tasks in controlled environments, which requires lots of data collection and time.

If a robot faces an unfamiliar task or setting, it often struggles to adapt.

MIT researchers are tackling this challenge with a new method that trains robots faster and for a wider range of tasks by combining diverse data from many sources.

Instead of gathering task-specific data every time, this new technique, called Heterogeneous Pretrained Transformers (HPT), pools data from different domains, like simulations and real robots, and from various sensor types, such as vision and motion sensors.

These data are then processed into a shared “language” for the robot, allowing it to quickly learn new tasks.

By combining so much varied information, this approach significantly reduces the amount of task-specific data needed to teach a robot. In tests, HPT improved robot performance by over 20% compared to training from scratch.

Lirui Wang, an MIT graduate student and lead author of the study, explains that a big issue in robotics isn’t just a lack of data but also the diversity of data types and robot designs.

Their approach uses a machine-learning model called a transformer, the same type used in large language models like GPT-4, to align these different data forms into a unified system. This lets the model draw knowledge from a range of sources, which helps robots perform well across various tasks.

HPT starts with a vast set of data, including videos of human demonstrations and simulated robot actions. The data include over 200,000 robot actions, covering many categories.

This model combines visual and proprioceptive inputs (like a robot’s sense of its own movement) into one unified data structure, called a token. The transformer model processes these tokens, making it easier for robots to perform complex, dexterous tasks.

With HPT, a user only needs to provide minimal data about their robot and the desired task. The model then transfers its vast pre-learned knowledge to tackle the new task more efficiently.

The MIT team hopes this approach will eventually lead to a universal “robot brain” that could be downloaded and used on any robot without needing new training. They’re optimistic that scaling up this method, as happened with language models, could lead to a big breakthrough in robot training and flexibility.