Last month, a multimodal model called CM3Leon was released by Meta AI, which performs both text-to-image and image-to-text creation tasks. CM3Leon can also understand instructions to edit existing images.
Yesterday, Meta took to Twitter to reiterate that the model manages to demonstrate cutting-edge performance for text-to-image generation despite having been trained with five times less computing than previously used transformer-based methods.
The transformer model CM3Leon uses a concept called “attention” to evaluate the usefulness of input data like text or graphics. Model training speed can be increased, and models can be more easily parallelized, thanks to “attention” and other architectural peculiarities of transformers. With significant but not insurmountable increases in computation, larger transformers can be easily trained.
According to the blog post by the company, CM3leon is a first-of-its-kind multimodal model that achieves state-of-the-art performance for text-to-image generation despite being trained with five times less computing than previous transformer-based methods. It also has the versatility and effectiveness of autoregressive models while incurring low training costs and inference efficiency.
According to the company, Meta AI used a dataset of millions of licensed images from Shutterstock to train CM3Leon. The most capable of several versions of CM3Leon that Meta has built has 7 billion parameters, which is over twice as many as DALL-E 2.