Existing research of video understanding still struggles to achieve in-depth comprehension and reasoning in complex videos, primarily due to the under-exploration of two key bottlenecks: fine-grained spatial-temporal perceptive understanding and cognitive-level video scene comprehension. This paper bridges the gap by presenting a novel solution. We first introduce a novel video Multimodal Large Language Model (MLLM), MotionEpic, which achieves fine-grained pixel-level spatial-temporal video grounding by integrating video spatial-temporal scene graph (STSG) representation. Building upon MotionEpic, we then develop a Video-of-Thought (VoT) reasoning framework. VoT inherits the Chain-of-Thought (CoT) core, breaking down a complex task into simpler and manageable sub-problems, and addressing them step-by-step from a low-level pixel perception to high-level cognitive interpretation. Extensive experiments across various complex video QA benchmarks demonstrate that our overall framework strikingly boosts existing state-of-the-art. To our knowledge, this is the first attempt at successfully implementing the CoT technique for achieving human-level video reasoning, where we show great potential in extending it to a wider range of video understanding scenarios.
Given an input video and a question, VoT identifies the possible target(s) involved in the question to observe.
After this step, all the possible Target involved in the question will be confirmed.
The system then grounds the temporal tracklet(s), which serves as supporting evidence/rationale for content perception in subsequent analysis.
The yielded grounded Target Tracklet of STSG will serve as low-level evidence (i.e., supporting rationale) for the next step of behavior analysis.
Combined with factual commonsense, VoT next interprets the target object's trajectory and its interactions with neighboring scenes to thoroughly understand the action dynamics and semantics.
This step yields the target action's Observation and Implication.
With in-depth understanding of the target actions in the video, we then carefully examine each optional answer with commonsense knowledge, where the final result is output after ranking those candidates.
We then rank the scores of all options and select the most optimal answer Answer.
Finally, VoT performs verification for the answer from both pixel grounding perception and commonsense cognition perspectives, ensuring the most factually accurate result.
If any inconsistencies are found in perception and cognition perspectives, we record the corresponding rationale, and re-execute the 4-th step to reselect the answer. This approach ensures that the final outcome is the most factually accurate.
We introduce a novel video LLM, namely MotionEpic, which, based on a similar architecture as existing popular video MLLMs, supports not only video input but also the encoding, understanding and generation of STSGs.
We propose the integration of a STSG representation, modeling both the input video and its STSG representation, where fine-grained spatial-temporal features are carefully integrated and modeled.
To enable MotionEpic with fine-grained pixel-level spatial-temporal grounding between videos and STSGs, we also investigate various distinct video-STSG training objects. STSG annotations are used during the grounding-aware tuning phase, while in the subsequent stage, the system is learned to autonomously parse STSG, and thus supports STSG-free inference and reasoning for downstream tasks.
▶ Main results on 6 video QA datasets
▶ Few-shot results on 4 video QA datasets
▶ Visualization
@inproceedings{VoT24Hao,
author = {Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Meishan Zhang, Mong-Li Lee, Wynne Hsu},
title = {Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition},
journal = {Proceedings of the International Conference on Machine Learning (ICML)},
year = {2024},
}