Dysen-VDM:

Empowering Dynamics-aware Text-to-Video Diffusion with LLMs

1. National University of Singapore
2. Skywork AI, Singapore
3. Nanyang Technological University
(CVPR 2024)

Abstract

Text-to-video (T2V) synthesis has gained increasing attention in the community, in which the recently emerged diffusion models (DMs) have promisingly shown stronger performance than the past approaches. While existing state-of-the-art DMs are competent to achieve high-resolution video generation, they may largely suffer from key limitations (e.g., action occurrence disorders, crude video motions) with respect to the intricate temporal dynamics modeling, one of the crux of video synthesis. In this work, we investigate strengthening the awareness of video dynamics for DMs, for high-quality T2V generation. Inspired by human intuition, we design an innovative dynamic scene manager (dubbed as Dysen) module, which includes (step-1) extracting from input text the key actions with proper time-order arrangement, (step-2) transforming the action schedules into the dynamic scene graph (DSG) representations, and (step-3) enriching the scenes in the DSG with sufficient and reasonable details. Taking advantage of the existing powerful LLMs (e.g., ChatGPT) via in-context learning, Dysen realizes (nearly) human-level temporal dynamics understanding. Finally, the resulting video DSG with rich action scene details is encoded as fine-grained spatio-temporal features, integrated into the backbone T2V DM for video generating. Experiments on popular T2V datasets suggest that our Dysen-VDM consistently outperforms prior arts with significant margins, especially in scenarios with complex actions.

Motivation

Fig.1 : Common issues in the existing text-to-video (T2V) synthesis.

Currently, DM-based T2V can still face several common yet non-negligible challenges. As summarized in Fig. 1, four typical issues can be found in a diffusion-based T2V model, such as lower frame resolution, unsmooth video transition, crude video motion and action occurrence disorder. While the latest DM-based T2V explorations paid much effort into enhancing the quality of video frames, i.e., generating high-resolution images, they may largely overlook the modeling of the intricate video temporal dynamics, the real crux of high-quality video synthesis, i.e., for relieving the last three types of aforementioned issues. According to our observation, the key bottleneck is rooted in the nature of video-text modality heterogeneity: language can describe complex actions with few succinct and abstract words (e.g., predicates and modifiers), whereas video, however, requires specific and often redundant frames to render the same actions.


Framework

We propose a dynamics-aware T2V diffusion model, namely Dysen-VDM, as shown in Fig.2. We first employ the existing SoTA latent DM for the backbone T2V synthesis, and meanwhile devise an innovative dynamic scene manager (namely Dysen) module for video dynamics modeling.

Teaser

Fig.2 : Our dynamics-aware T2V diffusion framework (Dysen-VDM). The dynamic scene manager (Dysen) module operates over the input text prompt and produces the enriched dynamic scene graph (DSG), which is encoded by the recurrent graph Transformer (RGTrm), and the resulting fine-grained spatio-temporal scene features are integrated into the video generation (denoising) process.

With Dysen, we realize the human-level temporal dynamics understanding of video. We take advantage of the current most powerful LLM, ChatGPT; we treat ChatGPT as the consultant for action planning and scene imagination in Dysen.

  • Step-I: extracting the key actions from the input text, which are properly arranged in occurring orders.
  • Step-II: converting these ordered actions into sequential dynamic scene graph (DSG) representations.
  • Step-III: enriching the scenes in the initial DSG with sufficient and reasonable details elicited from ChatGPT via in-context learning.
Finally, the resulting DSG with well-enriched scene details is encoded with a recurrent graph Transformer, where the learned delicate fine-grained spatio-temporal feature representations are integrated into the backbone T2V DM for generating high-quality fluent video.

Teaser

Fig.3 : Based on the given text, Dysen module carries out three steps of operations to obtain the enriched DSG: 1) action planning, 2) event-to-DSG conversion, and 3) scene imagination, where we take advantage of the ChatGPT with in-context learning.


Demo Examples

▶ High-quality demos (576 x 320)

Teaser
A lady holds an umbrella, walking in the park with her friend.
Teaser
A bustling morning of a market with crowd.
Teaser
In the kitchen, the woman assists her dad in cooking dinner by trimming the vegetables.
Teaser
Friends dance to music at the party.
Teaser
Students listen to the teacher in the classroom, with some raising hands.
Teaser
In the gym, a man and woman are running on treadmills.

▶ Comparisons between baselines (320 x 320)

CogVideo

VDM

LVDM

Dysen-VDM (Ours)

Teaser Teaser Teaser Teaser
A woman is looking after the plant in her garden, and then she raises her head to observe the weather.
Teaser Teaser Teaser Teaser
A man with his dog walks on the countryside road.
Teaser Teaser Teaser Teaser
A man dressed as Santa Claus is riding a motorcycle on a big city road.
Teaser Teaser Teaser Teaser
A clownfish swimming with elegance through coral reefs, presenting the beautiful scenery under the sea.
Teaser Teaser Teaser Teaser
A cowboy who wears a cowboy hat, a t-shirt and jeans is riding a horse to confidently compete in a rodeo competition.
Teaser Teaser Teaser Teaser
A young violin player in a neat shirt with a collar, having a headphone on, is playing the violin.
Teaser Teaser Teaser Teaser
A person in a jacket, facing away from the camera, is walking along a countryside road heading into the distance.
Teaser Teaser Teaser Teaser
A man and another man are standing together in the middle of a tennis court, and speaking to the camera.
Teaser Teaser Teaser Teaser
A thick blanket of snow covers the ground, where an elderly man walks in front of the house.
Teaser Teaser Teaser Teaser
A person in a jacket riding a horse, is walking along the countryside road.
Teaser Teaser Teaser Teaser
A cat eating food out of a bowl while looking around, then the camera moves away to a scene where another cat eats food.
Teaser Teaser Teaser Teaser
On a stage, a woman is rotating and waving her arms to show her belly dance.
Teaser Teaser Teaser Teaser
A band composed of a group of young people is performing live music.
Teaser Teaser Teaser Teaser
A horse in a blue cover walks at a fast pace, and then begins to slow down, taking a walk in the paddock.
Teaser Teaser Teaser Teaser
Two women sit on a park bench, reading books while chatting to each other.
Teaser Teaser Teaser Teaser
A woman hikes up the green mountain reaches the summit, and takes photos of the breathtaking view.
Teaser Teaser Teaser Teaser
A girl is playing on a swing in the park.
Teaser Teaser Teaser Teaser
A group of boys are playing football on the green field, with one passing the ball over another.
Teaser Teaser Teaser Teaser
A woman in a raincoat holding an umbrella steps into the pouring rain, and tries to pass through the traffic.
Teaser Teaser Teaser Teaser
Two girls are holding hands together and joyfully going down the mountain.
Teaser Teaser Teaser Teaser
A boy is riding a skateboard on the city street, and then falling down.

Related Links

You may refer to previous works such as CogVideo, VDM, LVDM, Stable Diffusion, and GPT3.5, which serve as foundational frameworks for our Dysen-VDM framework and code repository.

BibTeX


@inproceedings{fei2024dysen,
  title={Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs},
  author={Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Tat-Seng Chua},
  booktitle={Proceedings of the CVPR},
  pages={961--970},
  year={2024}
}