ByteDance AI researchers have introduced ‘MagicVideo,’ which is an efficient framework for text-to-video generation based on latent diffusion models.
Magic Video generates videos in the latent space with the help of a pre-trained variational autoencoder, which enables significantly less computational requirement for MagicVideo.
MagicVideo makes use of 2D convolution instead of 3D convolutions to overcome getting video-text paired datasets. Temporal computation operators are used along with 2D convolution operations to process the spatial and temporal information present in the video. Moreover, using 2D convolutions allows MagicVideo to use pre-trained weights of text-to-image models.
Although switching from 3D to 2D convolution reduces the computational complexity significantly, the memory cost is still too much. Thus, MagicVideo shares equal weights for each of the 2D convolution operations.
However, doing so can reduce the generation quality since this approach assumes that all the frames are almost identical, although, in reality, the temporal difference is present. To overcome this, MagicVideo uses a custom lightweight adaptor module to modify the frame distribution.
MagicVideo learns the inter-frame relation with the help of a directed self-attention module. Frames are calculated on the basis of the previous ones, similar to the approach used in video encoding. Finally, produced video clips are enhanced using a post-processing module.