Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition

Poster

Video Presentation

Abstract

Existing research of video understanding still struggles to achieve in-depth comprehension and reasoning in complex videos, primarily due to the under-exploration of two key bottlenecks: fine-grained spatial-temporal perceptive understanding and cognitive-level video scene comprehension. This paper bridges the gap by presenting a novel solution. We first introduce a novel video Multimodal Large Language Model (MLLM), MotionEpic, which achieves fine-grained pixel-level spatial-temporal video grounding by integrating video spatial-temporal scene graph (STSG) representation. Building upon MotionEpic, we then develop a Video-of-Thought (VoT) reasoning framework. VoT inherits the Chain-of-Thought (CoT) core, breaking down a complex task into simpler and manageable sub-problems, and addressing them step-by-step from a low-level pixel perception to high-level cognitive interpretation. Extensive experiments across various complex video QA benchmarks demonstrate that our overall framework strikingly boosts existing state-of-the-art. To our knowledge, this is the first attempt at successfully implementing the CoT technique for achieving human-level video reasoning, where we show great potential in extending it to a wider range of video understanding scenarios.

Video-of-Thought (VoT)

Video Reasoning Framework

▶ Framework Overview

▶ Step 1: Task Definition and Target Identification

Given an input video and a question, VoT identifies the possible target(s) involved in the question to observe.

After this step, all the possible Target involved in the question will be confirmed.

▶ Step-2: Object Tracking

The system then grounds the temporal tracklet(s), which serves as supporting evidence/rationale for content perception in subsequent analysis.

The yielded grounded Target Tracklet of STSG will serve as low-level evidence (i.e., supporting rationale) for the next step of behavior analysis.

▶ Step-3: Action Analyzing

Combined with factual commonsense, VoT next interprets the target object's trajectory and its interactions with neighboring scenes to thoroughly understand the action dynamics and semantics.

This step yields the target action's Observation and Implication.

▶ Step-4: Question Answering via Ranking

With in-depth understanding of the target actions in the video, we then carefully examine each optional answer with commonsense knowledge, where the final result is output after ranking those candidates.

We then rank the scores of all options and select the most optimal answer Answer.

▶ Step-5: Answer Verification

Finally, VoT performs verification for the answer from both pixel grounding perception and commonsense cognition perspectives, ensuring the most factually accurate result.

If any inconsistencies are found in perception and cognition perspectives, we record the corresponding rationale, and re-execute the 4-th step to reselect the answer. This approach ensures that the final outcome is the most factually accurate.

MotionEpic

Fine-grained Spatiotemporal-grounded Video MLLM

We introduce a novel video LLM, namely MotionEpic, which, based on a similar architecture as existing popular video MLLMs, supports not only video input but also the encoding, understanding and generation of STSGs.

▶ STSG Representation Integration

We propose the integration of a STSG representation, modeling both the input video and its STSG representation, where fine-grained spatial-temporal features are carefully integrated and modeled.

▶ Fine-grained Video-Scene Grounding-aware Tuning

To enable MotionEpic with fine-grained pixel-level spatial-temporal grounding between videos and STSGs, we also investigate various distinct video-STSG training objects. STSG annotations are used during the grounding-aware tuning phase, while in the subsequent stage, the system is learned to autonomously parse STSG, and thus supports STSG-free inference and reasoning for downstream tasks.

Enhancing coarse-grained correspondence
- L₁: predicting if the overall input video and STSG are paired.
- L₂: given a video, generating the whole STSG (expression) of the video.
Enhancing fine-grained correspondence
- L₃: given a video and action description(s), outputting the corresponding object tracklet(s), i.e., a partial STSG.
- L₄: given a video and key object(s), describing the corresponding temporal action(s) in textual response, and outputting the corresponding object tracklet(s).
- L₅: given a video and a bbox of a certain frame's object, outputting the object label, as well as the corresponding tracklet.

Experiment

▶ Main results on 6 video QA datasets

▶ Few-shot results on 4 video QA datasets

▶ Visualization

**Case.2 :** Qualitative examples of perception-level reasoning. The correct answer is marked with a green checkmark, and the wrong answer is marked with a red cross.

**Case.3 :** Qualitative examples of cognitive-level reasoning.

Paper

BibTeX

@inproceedings{VoT24Hao,
  author    = {Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Meishan Zhang, Mong-Li Lee, Wynne Hsu},
  title     = {Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition},
  journal   = {Proceedings of the International Conference on Machine Learning (ICML)},
  year      = {2024},
}