Google Researchers Unveil Phenaki, a System That Generates Videos from Text

The system produces videos of a few seconds using story-like descriptions.

November 3, 2022

Ruben Villegas and a few other researchers at Google unveil Phenaki, a system that generates videos from story-like descriptions given as text prompts. There are only a few datasets that can be used for text-to-video generation, but there are many text-to-image pairs available, using which Google has also developed text-to-image frameworks like Imagen.

Now, text-to-video generator Phenaki generates short videos by using images as single-frame videos and clubbing them together with a dataset of short videos having captions.

1/ From today's AI@ event: we announced our Imagen text-to-image model is coming soon to AI Test Kitchen. And for the 1st time, we shared an AI-generated super-resolution video using Phenaki to generate long, coherent videos from text prompts and Imagen Video to increase quality. pic.twitter.com/WofU5J5eZV
— Sundar Pichai (@sundarpichai) November 2, 2022

Phenaki works using some main components. It uses an encoder for video embedding, a language model for text embedding, a MaskGIT bidirectional transformer, and a decoder.

The system uses a “videos less than three seconds long” dataset to train the C-ViViT encoder/decoder to generate embeddings. The encoder is trained to generate non-overlapping patches as vectors by splitting frames. The decoder is trained to convert embeddings into pixels.

Phenaki uses the t5x language model to produce text embedding. MaskGIT generates the masked embeddings at inference using a set of masked video embeddings and text embeddings and then re-masks a portion of them to be generated in subsequent iterations.

To create minute-long videos, the authors repeatedly combined MaskGIT and C-ViViT. They first created a short film from a single sentence, after which they encoded the final k frames. They combined the text after the video embeddings to create more video frames.
Unlike Make-A-Video, which uses several diffusion models to generate short videos and then upscale its resolution, Phenaki bootstraps its own frames to enhance throughput and narrative complexity.

Google Researchers Unveil Phenaki, a System That Generates Videos from Text

LEAVE A REPLY Cancel reply

Most Popular

GitHub CEO Thomas Dohmke Resigns to Return to Startup Life

GPT-5 Is Not AGI—Why the Hype Mirrors the Self-Driving Car Illusion

Google Researchers Unveil Phenaki, a System That Generates Videos from Text

Subscribe to our newsletter

RELATED ARTICLES

GitHub CEO Thomas Dohmke Resigns to Return to Startup Life

Google Rolls Out Deep Think in Gemini App to Power Ultra‑Reasoning AI

Meta Unveils Vision for Personal Superintelligence

LEAVE A REPLY Cancel reply

Most Popular

GitHub CEO Thomas Dohmke Resigns to Return to Startup Life

GPT-5 Is Not AGI—Why the Hype Mirrors the Self-Driving Car Illusion