Advances in AI research have led to a plethora of ML and deep learning models capable of generating images from text prompts. The text-to-image tools represent a tremendous cultural shift because this new and democratized form of expression can amplify the magnitude of imagery produced by humans. Researchers at Google and OpenAI have developed text-to-image models that haven’t been released to the public. However, like other automated systems, they also come with ethical and unfair use risks that these companies are yet to solve. DALL.E by OpenAI and Imagen by Google are two popular models in this industry.
Today, we will discuss which text-to-image model is better and more accurate at creating images.
OpenAI’s DALL.E 2
OpenAI launched DALL.E 2, an AI tool that can create realistic images and art from the text. You can test various variations of text to see the results of DALL.E 2. For example, in this text: An astronaut, playing basketball with cats in space, and as a children’s book illustration, DALL.E 2 produces the following image:
You can use a similar text with variations like in watercolors or a minimalistic style. OpenAI launched this new version with renewed capabilities and restrictions to prevent abuse. It can turn a text prompt into an accurate image. This tool could create an image from text within seconds. The new version is better at its job, and the images are bigger and more detailed. It has become faster and can spin out more variations than the last version. DALL-E 2 has an invite-only test environment where developers can try it out in a controlled way. All the prompts are evaluated for violations.
Google’s Imagen
Days after OpenAI launched DALL.E 2, Google introduced Imagen, a competitor to OpenAI’s DALL.E that creates images and artworks using a similar method. After adding a description, Imagen can generate images based on how it interprets the text. It combines different attributes, concepts, and styles. For example, by giving text like ‘a photo of a dog’, Google’s system can create an idea that looks like a picture of a dog. However, by altering the description to ‘an oil painting of a dog,’ the generated image would look more like an oil painting. Like DALL.E, Imagen has also not been released for public use because of the risk associated with biases in extensive language models.
Here is an image of how Imagen works:
DALL.E 2 vs. Imagen: Who Wins?
DALL.E worked on 1.2-billion parameters, and DALL.E 2 works on a 3.5-billion parameter model. It has another 1.5-billion parameter model to enhance the resolution of its images. However, Imagen has surpassed various text-to-image models like DALL.E 2 because of T5-XXL, Google AI’s largest text encoder, which has 4.6 billion parameters.
Imagen can create better images because of high parameters. Scaling the size of the text encoder has been shown to improve text-image alignment to a great extent. On the other hand, scaling the size of the diffusion model improves sample quality, but a bigger text encoder has the highest overall impact. Imagen also uses a diffusion technique called noise conditioning augmentation that helps to higher FID and CLIP scores.
Additionally, images created by DALL.E lacks realism. Here is what Google’s research scientist had to say about it:
Thomas Wolf, the cofounder of Hugging Face, has also written in favor of Google’s text-to-image model. However, he also mentioned that not releasing such models for public use has hindered research in this industry. He also wants the datasets to be made public, so there can be a collective effort to improve the models.
Ethical challenges
Jeff Dean, Senior Vice President of Google AI, states that he “sees AI as having the potential to foster creativity in human-computer collaboration.” However, due to the various ethical challenges of preventing misuse of this technology, both Google and OpenAI haven’t released their models for public use. Although it’s unclear how they can safeguard this technology from misuse, so it’s not used for unethical purposes.