A new machine learning model, Unisim, has been jointly developed by Google Deepmind in partnership with UC Berkeley, MIT, and the University of Alberta. The model is designed to generate highly realistic simulations for training diverse AI systems.
Unisim is a generative model capable of replicating human-agent interactions with the environment. It can simulate various visual outcomes for high-level instructions like “move the pen,” “close the door,” etc.
Unisim is trained using data from simulation engines, real-world robot observations, videos of human activities, and image description pairs.
After training properly through deep learning architecture, Unisim can produce diverse photorealistic videos, navigation of environments, and long-horizon simulations like a series of consecutive actions performed by a robot hand.
A Unisim has numerous possible uses, ranging from generating controllable content for video games and films to training embodied agents exclusively in simulated environments for direct real-world applications. Unisim can also enhance the progress in vision language models (VLM), like Deepmind’s recent RT-X models.
However, Unisim use faces a challenge, according to researchers at Google Deepmind and universities. The machine learning model’s exceptional visual quality can aid in reducing the gap between learning in a simulated environment and the real world, giving rise to a precarious situation called the “sim-to-real gap.”