Ruben Villegas and a few other researchers at Google unveil Phenaki, a system that generates videos from story-like descriptions given as text prompts. There are only a few datasets that can be used for text-to-video generation, but there are many text-to-image pairs available, using which Google has also developed text-to-image frameworks like Imagen.
Now, text-to-video generator Phenaki generates short videos by using images as single-frame videos and clubbing them together with a dataset of short videos having captions.
Phenaki works using some main components. It uses an encoder for video embedding, a language model for text embedding, a MaskGIT bidirectional transformer, and a decoder.
The system uses a “videos less than three seconds long” dataset to train the C-ViViT encoder/decoder to generate embeddings. The encoder is trained to generate non-overlapping patches as vectors by splitting frames. The decoder is trained to convert embeddings into pixels.
Read More: Qiskit Launches Quantum Computing course as YouTube series.
Phenaki uses the t5x language model to produce text embedding. MaskGIT generates the masked embeddings at inference using a set of masked video embeddings and text embeddings and then re-masks a portion of them to be generated in subsequent iterations.
To create minute-long videos, the authors repeatedly combined MaskGIT and C-ViViT. They first created a short film from a single sentence, after which they encoded the final k frames. They combined the text after the video embeddings to create more video frames.
Unlike Make-A-Video, which uses several diffusion models to generate short videos and then upscale its resolution, Phenaki bootstraps its own frames to enhance throughput and narrative complexity.