Research Statement

On Multimodal Generalist towards Human-level Capacity and Cognition

The emergence of large language models (LLMs) has bestowed unprecedented levels of intelligence upon human society. We human beings live in a world where vairous sensory modalities of signals coexist, which means that integrating multimodal capabilities—Multimodal might largely be the most promising path toward the ultimate goal of AGI. The human-level AI we envision should be a multimodal generalist, embodying human-like behavioral patterns: not only perceive the semantic content of various modalities and scenarios but also generate and output signals across different modalities to interact with the external world. This implies possessing universal modalities and universal task capacities with strong synergistic generalizability. To achieve true human-level AI, in addition to perception, it should also encompass complex-reasoning, knowledge-applying, and empathy capabilities, akin to human beings. With these beliefs in mind, my research topic is: studying and building unified multimodal generalist towards human-level capacity (Modality, Task, Knowledge) and cognition (Reasoning, Affection).

My research can be sliced into multiple blocks, with the following image illustrating the overall logical architecture.

Below each research topic is shown, with representative publications [View complete publications]:

▶ Foundation-level: Multimodal LLMs and Generalists

  Unified Multimodal LLMs with universal capability of comprehension and generation, with synergistic ability

  • NExT-GPT:    The 1st unified any-to-any multimodal LLM
  • Vitron:    The 1st unified pixel-level vision LLM for understanding, generating, segmenting, editing of image and video
  • OMG-LLaVA:    A pioneering vision LLM for pixel-level, object-level, and image-level understanding and reasoning
  • General-Level:    Pioneer the path of MLLM evaluations towards multimodal generalists

  MLLMs for Image/Video/3D/etc

  • Setok:    Enhance Vision LLMs with a dynamic semantic-equivalent vision tokenizer
  • VPGTrans:    For the first time investigate the vision encoder transferability across LLMs
  • LL3DA:    A pioneering 3D-LLM (3D point cloud)
  • Momentor:    A pioneering Video-LLM for fine-grained comprehension and localization in videos
  • Molca:    A pioneering Protein LLM

  Multimodal Agent for addressing a wide range of downstream applications, with embodied intelligence

▶ Capacity-level: Cross-modal Information Comprehension, Generation and Acquisition

  Multimodal Perception: low-/high-level audio/speech/image/video/3D modeling, cross-modal captioning/retrieval, scene graph parsing

  • Finsta:    Enhance video-language models with a fine-grained structural spatio-temporal alignment learning
  • GO3D-SG:    Enhance visual spatial understanding with holistic 3D spatial scene graph
  • HostSG:    Explore a novel holistic spatio-temporal scene graph for video event analysis
  • Cross2StrA:    Explore the visual scene graph and linguistic structural representation for cross-lingual image captioning
  • RIS-CQ:    Present a novel referring image understanding benchmark with complex queries

  Multimodal Generation: text/vision/audio synthesis, text-to-vision generation, joint multimodal generation

  • Dysen-VDM:    Enhance temporal dynamics of text-to-video diffusion from LLMs
  • LayoutLLM-T2I:    Enhance fidelity of text-to-image diffusion with layout from LLMs
  • Salad:    Improve text-to-image synthesis under abstract-to-intricate setting with scene graph representation

  Knowledge Acquisition: cross-modal information extraction, translation

  • MUIE:    The 1st benchmark for grounded multimodal universal information extraction
  • SpeechEE:    The 1st benchmark for extracting events from speech
  • W2NER:    Unify flat, overlapped and discontinuous NER as word-word relation classification
  • LasUIE:    Pioneer universal information extraction with a latent adaptive structure-aware generative LM
  • MMRE:    Enhance multimodal relation extraction via internal-information screening and external-information exploiting
  • UMMT:    Pioneer the inference-time image-free unsupervised multimodal machine translation with visual scene hallucination mechanism

▶ Cognition-level: Multimodal Human-centric Reasoning and Affection

  Reasoning: complex reasoning, neuro/symbolic reasoning, cross-modal reasoning

  • Video-of-Thought:    The 1st video chain-of-thought reasoning framework
  • SymbCoT:    The 1st fully LLM-based logical reasoning framework based on chain-of-thought

  Affective Computing: cross-modal, fine-grained affection and opinion analysis in social media

  • PanoSent:    The 1st cognitive-level benchmark for multimodal conversational aspect-based sentiment analysis
  • THOR-ISA:    For the first time address implicit sentiment reasoning with chain-of-thought framework
  • DiaASQ:    The 1st benchmark for conversational aspect-based sentiment analysis
  • UABSA:    A multiplex cascade framework for unified aspect-based sentiment analysis
  • RobustABSA:    Comprehensively rethinking the robustness of model, data, and training in aspect-based sentiment analysis

Previously I paid my particular focus on Structure-aware Intelligence Learning (SAIL), and worked on the following topics:

  • NLP
    • Text Generation (paper)
    • Dialogue/Document Analysis (paper)
    • Semantic Parsing (XSRL)
    • Syntax Parsing and Grammar Induction
  • Multimodal Learning
    • Image/Video Captioning
    • Multimodal Grammar Induction (VAT-GI)
  • Langauge Modeling
    • Structure-aware Langauge Modeling (StructLM)
    • KG-empowered Langauge Modeling (BioLM)
  • Machine Learning