Meta has decided not to release the model code, similar to its past models.

A multimodal model called “CM3leon” (pronounced like chameleon) has been released by Meta AI which performs both text-to-image and image-to-text creation. Meta has decided not to release the model code, similar to its past models, which has disappointed AI enthusiasts.

With its capacities for text-guided image creation and alteration, the AI model bridges the gap between text and image. The model has the power to fundamentally alter how consumers interact with and manipulate visual material today.

The model’s enhanced ability to generate coherent imagery that closely matches input prompts is one of Meta’s claims. When compared to earlier transformer-based techniques, CM3leon is more efficient and needs just five times as much processing power and training data.

Meta highlights CM3leon’s abilities in a range of vision-language tasks, such as visual question-answering and lengthy captioning. The diffusion method, which is frequently used in image production, is replaced by Meta’s innovative approach. 

The company’s researchers decided to use the Transformer architecture instead, a neural network design well known for its successful application in substantial language models like OpenAI’s GPT-4. Although StyleSwin was the first transformer-based image generator, Meta claims that CM3leon outperforms its rivals in terms of efficiency.

Although the model is praised for being cutting-edge, the research community is miffed by the fact that it is closed source. Similar to this, the tech giant debuted Voicebox last month but withheld the model from the general public due to concerns over potential abuse. 

“While we believe it is important to be open with the AI community and to share our research to advance AI, it’s also necessary to strike the right balance between openness with responsibility,” Meta had said upon releasing the paper.  

