Previously, I focused on a slightly different research theme, Multimodal Generalist towards Human-level Capacity and Cognition. Even earlier, I focused on Structure-aware Intelligence Learning (SAIL).
Future human-level AI should bridge physical-world grounding with mental-world intelligence, combining unified multimodal capability with cognition, affection, mind, behavior, and social understanding. My research aims to build multimodal AI systems that do not only interact with the external world mechanically, but perceive, reason, generate, and act with increasingly advanced cognitive foundations.
Build cross-modal/multimodal systems that can understand, reason over, and generate rich signals across text, speech, audio, image, video, 3D, and 4D settings.
Move AI beyond mechanical interaction by modeling cognition, mind, emotion, empathy, mental behavior, and collective intelligence.
A unified multimodal foundation model serves as the common substrate that supports both physical-world grounding and mental-world intelligence, enabling generalist perception, reasoning, generation, interaction, and agentic coordination across diverse modalities, paradigm and tasks.
Unified and advanced multimodal foundational systems (LLMs and agents) that coordinate perception, reasoning, generation, and interaction across modalities. Below, I highlight several representative research series that define this direction.
Unified any-to-any multimodal LLMs that move beyond perception-only systems toward fully interleaved multimodal input-output intelligence.
Pixel-level large vision foundation models for understanding, generation, segmentation, editing, and controllable interaction with visual content.
A universal video agent that unifies video understanding, editing, tracking, planning, and creation in an automated agentic workflow.
A research line on building multimodal generalist AI unifies modalities, tasks, paradigms, enabling synergistic perception, reasoning, and generation.
Model the physical world through multimodal understanding, generation, and structured reasoning across text, speech, image, video, 3D, and 4D signals. Below, I highlight several representative research series that define this direction.
Large generative foundation models for synchronized/joint audio-video generation.
Video generation research embedding multimodal intent, causal reasoning, and physics for high-quality, faithful, physically consistent synthesis.
Scene graphs provide structured representations enabling multimodal reasoning, generation, controllability, explainability, and generalization across modalities.
Line of work around multimodal chain-of-thought reasoning and other projects that connect perception with higher-level cognition.
Symbolic structures and logic guide LLM reasoning toward more faithful, controllable, and interpretable decision processes.
A flexible line of work around multimodal chain-of-thought reasoning and other projects that connect perception with higher-level cognition.
Model how AI individuals can feel, infer, empathize, and behave with richer cognitive and social awareness rather than interacting with the world in a purely mechanical manner. Below, I highlight several representative research series that define this direction.
UniCAE unifies multimodal affective understanding and generation, enabling empathetic AI across language, speech, vision, and 3D.
Mental world modeling integrates physical dynamics and human cognitive-social states for comprehensive, interactive intelligence modeling.
Cognition-driven affective computing integrates perception, Theory-of-Mind, and social context and mental state to reason implicit emotions and hidden intentions.
Large-scale modeling of cognition, mind, behavior, and social intelligence, including simulations of individuals and groups.
Previously, I focused on a slightly different research theme, Multimodal Generalist towards Human-level Capacity and Cognition. Even earlier, I focused on Structure-aware Intelligence Learning (SAIL).