Text-to-video (T2V) synthesis has gained increasing attention in the community, in which the recently emerged diffusion models (DMs) have promisingly shown stronger performance than the past approaches. While existing state-of-the-art DMs are competent to achieve high-resolution video generation, they may largely suffer from key limitations (e.g., action occurrence disorders, crude video motions) with respect to the intricate temporal dynamics modeling, one of the crux of video synthesis. In this work, we investigate strengthening the awareness of video dynamics for DMs, for high-quality T2V generation. Inspired by human intuition, we design an innovative dynamic scene manager (dubbed as Dysen) module, which includes (step-1) extracting from input text the key actions with proper time-order arrangement, (step-2) transforming the action schedules into the dynamic scene graph (DSG) representations, and (step-3) enriching the scenes in the DSG with sufficient and reasonable details. Taking advantage of the existing powerful LLMs (e.g., ChatGPT) via in-context learning, Dysen realizes (nearly) human-level temporal dynamics understanding. Finally, the resulting video DSG with rich action scene details is encoded as fine-grained spatio-temporal features, integrated into the backbone T2V DM for video generating. Experiments on popular T2V datasets suggest that our Dysen-VDM consistently outperforms prior arts with significant margins, especially in scenarios with complex actions.
Currently, DM-based T2V can still face several common yet non-negligible challenges. As summarized in Fig. 1, four typical issues can be found in a diffusion-based T2V model, such as lower frame resolution, unsmooth video transition, crude video motion and action occurrence disorder. While the latest DM-based T2V explorations paid much effort into enhancing the quality of video frames, i.e., generating high-resolution images, they may largely overlook the modeling of the intricate video temporal dynamics, the real crux of high-quality video synthesis, i.e., for relieving the last three types of aforementioned issues. According to our observation, the key bottleneck is rooted in the nature of video-text modality heterogeneity: language can describe complex actions with few succinct and abstract words (e.g., predicates and modifiers), whereas video, however, requires specific and often redundant frames to render the same actions.
We propose a dynamics-aware T2V diffusion model, namely Dysen-VDM, as shown in Fig.2. We first employ the existing SoTA latent DM for the backbone T2V synthesis, and meanwhile devise an innovative dynamic scene manager (namely Dysen) module for video dynamics modeling.
With Dysen, we realize the human-level temporal dynamics understanding of video. We take advantage of the current most powerful LLM, ChatGPT; we treat ChatGPT as the consultant for action planning and scene imagination in Dysen.
CogVideo
VDM
LVDM
Dysen-VDM (Ours)
You may refer to previous works such as CogVideo, VDM, LVDM, Stable Diffusion, and GPT3.5, which serve as foundational frameworks for our Dysen-VDM framework and code repository.
@inproceedings{fei2024dysen,
title={Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs},
author={Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Tat-Seng Chua},
booktitle={Proceedings of the CVPR},
pages={961--970},
year={2024}
}