AtomoVideo:

High Fidelity Image-to-Video Generation

Litong Gong*,

Yiran Zhu*,

Weijie Li*,

Xiaoyang Kang*,

Biao Wang,

Tiezheng Ge,

Bo Zheng

Alibaba Inc. ^*Equal Contribution

arXiv YouTube

One-minute Video

AtomoVideo is a novel high-fidelity image-to-video (I2V) generation framework that generates high-fidelity video from input images, achieves better motion intensity and consistency than existing work, and is compatible with various personalized T2I models without specific tuning.

Image-to-Video Examples

Comparisons with Other Methods

Image

AtomoVideo

Gen-2

Pika 1.0

Image

AtomoVideo

Gen-2

Pika 1.0

Image

AtomoVideo

Gen-2

Pika 1.0

Image

AtomoVideo

Gen-2

Pika 1.0

Abstract

Recently, video generation has achieved significant rapid development based on superior text-to-image generation techniques. In this work, we propose a high fidelity framework for image-to-video generation, named AtomoVideo. Based on multi-granularity image injection, we achieve higher fidelity of the generated video to the given image. In addition, thanks to high quality datasets and training strategies, we achieve greater motion intensity while maintaining superior temporal consistency and stability. Our architecture extends flexibly to the video frame prediction task, enabling long sequence prediction through iterative generation. Furthermore, due to the design of adapter training, our approach can be well combined with existing personalised models and controllable modules. By quantitatively and qualitatively evaluation, AtomoVideo achieves superior results compared to popular methods.

Framework

The framework of our image-to-video method. We use the pre-trained T2I model, newly added 1D temporal convolution and temporal attention modules after every spatial convolution and attention layer, with fixed T2I model parameters and only training the added temporal layer. Meanwhile, in order to inject the image information, we modify the input channel to 9 channels, add the image condition latent and binary mask. Since the input concatenate image information is only encoded by VAE, it represents low-level information, which contributes to the enhancement of fidelity of the video with respect to the given image. Meanwhile, we also inject high-level image semantic in the form of cross-attention to achieve more semantic image controllability.

BibTeX

@misc{atomovideo,
      title={AtomoVideo: High Fidelity Image-to-Video Generation},
      author={Gong, Litong and Zhu, Yiran and Li, Weijie and Kang, Xiaoyang and Wang, Biao and Ge, Tiezheng and Zheng, Bo},
      year={2024},
      eprint={arXiv:2403.01800},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}