GoogleAI has introduced a novel text-to-image synthesizing model, Muse, using a masked image modeling approach with generative transformers. Muse is trained on a masked modeling task in discrete token space using the text embedding derived from a pre-trained large language model (LLM).
Generative image models have advanced significantly over the past few years because of novel training methods and improved deep learning architectures. As a result, many image generation models like DALL-E 2, Midjourney, and Stable Diffusion have been developed. But with Muse, Google takes the technology a step further.
Muse comprises several sub-models, like the VQGAN tokenizer model for encoding and decoding, a base masked image model to predict marginal distributions of tokens, and a superres transformer model to transform low-resolution into high-resolution with T5-XXL embeddings.
Since Muse employs discrete tokens and needs fewer sample iterations than pixel-space diffusion models like Imagen and DALL-E 2, it claims to be more efficient. The model iteratively resamples image tokens based on a language prompt to produce a zero-shot, mask-free editing for free.
The researchers trained multiple Muse models with varying sizes between 632M to 3B parameters. Muse uses parallel decoding architecture, combining several decoded bits to accomplish an instruction. Due to this architecture, Muse outperforms Parti, an autoregressive model. The researchers also claim that Muse is approximately 10 times faster at inference than Imagen 3B or Parti 3B models.
Per the PartiPrompts assessment, Muse generates images better related to the text prompt at least 2.7 times more accurately than Stable Diffusion, as it can generate images using nouns, adjectives, verbs, and other parts of speech.
For more information, refer to the paper.