Research Statement

`On Multimodal Generalist towards Human-level Capacity and Cognition`

The emergence of large language models (LLMs) has bestowed unprecedented levels of intelligence upon human society. We human beings live in a world where vairous sensory modalities of signals coexist, which means that integrating multimodal capabilities—Multimodal might largely be the most promising path toward the ultimate goal of AGI. The human-level AI we envision should be a multimodal generalist, embodying human-like behavioral patterns: not only perceive the semantic content of various modalities and scenarios but also generate and output signals across different modalities to interact with the external world. This implies possessing universal modalities and universal task capacities with strong synergistic generalizability. To achieve true human-level AI, in addition to perception, it should also encompass complex-reasoning, knowledge-applying, and empathy capabilities, akin to human beings. With these beliefs in mind, my research topic is: studying and building unified multimodal generalists towards human-level capacity (Modality, Task, Knowledge) and cognition (Reasoning, Affection). I also extensively explore the AI for vertical domains such as medicine, healthcare, clinical psychology, and social studies, by integrating these advanced general-purpose LLM/agent methodologies.

My research can be sliced into multiple blocks, with the following image illustrating the overall logical architecture.

Below each research topic is shown, with representative publications [View complete publications]:

`▶ Foundation-level: Multimodal LLMs and Generalists`

• Unified Multimodal LLMs with universal capability of comprehension and generation, with synergistic ability

NExT-GPT: The 1st unified any-to-any multimodal LLM
Vitron: The 1st unified pixel-level vision LLM for understanding, generating, segmenting, editing of image and video
OMG-LLaVA: A pioneering vision LLM for pixel-level, object-level, and image-level understanding and reasoning
General-Level: Pioneer the path of MLLM evaluations towards multimodal generalists

• MLLMs for Image/Video/3D/etc

Setok: Enhance Vision LLMs with a dynamic semantic-equivalent vision tokenizer
VPGTrans: For the first time investigate the vision encoder transferability across LLMs
LL3DA: A pioneering 3D-LLM (3D point cloud)
Momentor: A pioneering Video-LLM for fine-grained comprehension and localization in videos
Molca: A pioneering Protein LLM

• Multimodal Agent for addressing a wide range of downstream applications, with embodied intelligence

• Multimodal Perception: low-/high-level audio/speech/image/video/3D modeling, cross-modal captioning/retrieval, scene graph parsing

Finsta: Enhance video-language models with a fine-grained structural spatio-temporal alignment learning
GO3D-SG: Enhance visual spatial understanding with holistic 3D spatial scene graph
HostSG: Explore a novel holistic spatio-temporal scene graph for video event analysis
USG: A Universal Scene Graph representation across any modalities including images, text, videos, and 3D

• Multimodal Generation: text/vision/audio synthesis, text-to-vision generation, joint multimodal generation

JavisDiT: A novel Joint Audio-Video Diffusion Transformer for synchronized audio-video generation
Any2Caption: A any-condition video generation by leveraging MLLMs to interpret diverse inputs into dense, structured captions
Dysen-VDM: Enhance temporal dynamics of text-to-video diffusion from LLMs
LayoutLLM-T2I: Enhance fidelity of text-to-image diffusion with layout from LLMs
Salad: Improve text-to-image synthesis under abstract-to-intricate setting with scene graph representation

• Knowledge Acquisition: cross-modal information extraction, translation

MUIE: The 1st benchmark for grounded multimodal universal information extraction
SpeechEE: The 1st benchmark for extracting events from speech
W2NER: Unify flat, overlapped and discontinuous NER as word-word relation classification
LasUIE: Pioneer universal information extraction with a latent adaptive structure-aware generative LM
UMMT: Pioneer the inference-time image-free unsupervised multimodal machine translation with visual scene hallucination mechanism

`▶ Cognition-level: Multimodal Human-centric Reasoning and Affection`

• Reasoning: complex reasoning, neuro/symbolic reasoning, cross-modal reasoning

MCoT-Survey: The 1st systematic survey of MCoT reasoning
Video-of-Thought: The 1st video chain-of-thought reasoning framework
SymbCoT: The 1st fully LLM-based logical reasoning framework based on chain-of-thought

• Affective Computing: cross-modal, fine-grained affection and opinion analysis in social media

AvaMERG: The 1st Avatar-based Multimodal Empathetic Conversation Benchmark
PanoSent: The 1st cognitive-level benchmark for multimodal conversational aspect-based sentiment analysis
THOR-ISA: For the 1st time address implicit sentiment reasoning with chain-of-thought framework
DiaASQ: The 1st benchmark for conversational aspect-based sentiment analysis
UABSA: A multiplex cascade framework for unified aspect-based sentiment analysis

Previously I paid my particular focus on Structure-aware Intelligence Learning (SAIL), and worked on the following topics:

NLP
- Text Generation (paper)
- Dialogue/Document Analysis (paper)
- Semantic Parsing (XSRL)
- Syntax Parsing and Grammar Induction (XNLP)
Multimodal Learning
- Image/Video Captioning
- Multimodal Grammar Induction (VAT-GI)
Langauge Modeling
- Structure-aware Langauge Modeling (StructLM)
- KG-empowered Langauge Modeling (BioLM)
Machine Learning
- Prompt Learning/Tuning (TKDP)
- In-context Learning (Paper)
- Dual Learning (StructDual)
- Reinforcement Learning (Paper)

Research Statement

On Multimodal Generalist towards Human-level Capacity and Cognition

▶ Foundation-level: Multimodal LLMs and Generalists

▶ Capacity-level: Cross-modal Information Comprehension, Generation and Acquisition

▶ Cognition-level: Multimodal Human-centric Reasoning and Affection

`On Multimodal Generalist towards Human-level Capacity and Cognition`

`▶ Foundation-level: Multimodal LLMs and Generalists`

`▶ Capacity-level: Cross-modal Information Comprehension, Generation and Acquisition`

`▶ Cognition-level: Multimodal Human-centric Reasoning and Affection`