OpenAI created a multimodal generative neural model — DALL·E — that can create images from text prompts given as input. This neural network uses the 12-billion GPT-3 parameter version trained to generate images from text descriptions using a text-image pair dataset.
They put their experience of building GPT-3 and ImageGPT to show that manipulating visual concepts through language is now within reach. The researchers demonstrated that language can instruct a large neural network to perform various text generation tasks and generate high fidelity images.
The samples released for each caption are the top 32 of 512 images after reranking with CLIP. This procedure can be seen as a kind of language-guided search and dramatically impacts sample quality.
The demos that are released by the researchers showcase images of imaginary objects with both modified and preserved attributes. The model understands the three-dimensional visualization of items with their internal and external structure and can infer contextual details independently. However, the model goes beyond and also shows zero-shot visual reasoning, geographic knowledge, and temporal knowledge.
The network architecture uses transformers with simple decoder-only components. In essence, it is a language model that receives both the text and the image as a single stream of data containing up to 1280 tokens. It uses maximum likelihood training to generate all of the tokens.
They use an attention mask at each of its 64 self-attention layers allowing each image token to attend to all text tokens. OpenAI’s DALL·E uses the standard causal mask for the text tokens and sparse attention for the image tokens with either a row, column, or convolutional attention pattern, depending on the layer. The researchers will be publishing a paper soon detailing other elements of DALL·E.
There are always some trade-offs involved. Here are some granular points that were pointed out by the researchers: –
- The phrase determines the success rate
- OpenAI DALL·E confuses the associations between the objects and their colors when more items are present
- DALL·E can draw multiple copies but is unable to count past three reliably
You try creating images from text here.