Ruben Villegas and a few other researchers at Google unveil Phenaki, a system that generates videos from story-like descriptions given as text prompts. There are only a few datasets that can be used for text-to-video generation, but there are many text-to-image pairs available, using which Google has also developed text-to-image frameworks like Imagen.
Now, text-to-video generator Phenaki generates short videos by using images as single-frame videos and clubbing them together with a dataset of short videos having captions.
Phenaki works using some main components. It uses an encoder for video embedding, a language model for text embedding, a MaskGIT bidirectional transformer, and a decoder.
The system uses a “videos less than three seconds long” dataset to train the C-ViViT encoder/decoder to generate embeddings. The encoder is trained to generate non-overlapping patches as vectors by splitting frames. The decoder is trained to convert embeddings into pixels.
Phenaki uses the t5x language model to produce text embedding. MaskGIT generates the masked embeddings at inference using a set of masked video embeddings and text embeddings and then re-masks a portion of them to be generated in subsequent iterations.
To create minute-long videos, the authors repeatedly combined MaskGIT and C-ViViT. They first created a short film from a single sentence, after which they encoded the final k frames. They combined the text after the video embeddings to create more video frames.
Unlike Make-A-Video, which uses several diffusion models to generate short videos and then upscale its resolution, Phenaki bootstraps its own frames to enhance throughput and narrative complexity.