Hao Fei

Senior Postdoctoral Researcher

Department of Computer Science, University of Oxford
Wolfson Building, Parks Road, Oxford OX1 3QD, UK

Profile

I am a senior postdoctoral researcher at the University of Oxford, jointly working with Prof. Yarin Gal in the OATML Group and Prof. Chris Holmes in the Big Data Institute. Previously, I was a senior research fellow at National University of Singapore, where I worked with Prof. Mong-Li Lee, Prof. Wynne Hsu, Prof. Tat-Seng Chua and Prof. Shuicheng Yan. I also worked as a visiting researcher at Microsoft Research Asia, an associate researcher at Skywork AI Singapore, and SEA AI lab, respectively. I graduated as Ph.D from Wuhan University.

My research has been published in top-tier ML/NLP/CV/MM venues, e.g., ICML, NeurIPS, ACL, CVPR, AAAI, WWW, SIGIR, IJCAI, EMNLP, ACM-MM, TPAMI, TKDE, TOIS, TNNLS, TASLP. I was awarded the World AI Conference Rising Star in 2023. My papers were selected as Most Influential Papers by Paper Digest, and ESI Highly Influential Papers and 2024 WAIC Outstanding Paper Award. I was also the recipient of the 2023 WAIC Rising Star award, and ranked as Top 2% Scientists Worldwide 2024 (Single Year) by Stanford University. I’ve regularly served as (Senior) Area Chair or Senior Program Committee of top-tier conferences. I was the organization committee of WSDM 2022, EMNLP 2023, ACL 2024, ACM MM 2025. I serve as the Associate Editor of some journals, including TALLIP and Neurocomputing. And I am a persistently-invited reviewer for many journals including TPAMI, IJCV, TNNLS, TKDE, TOIS, etc. My Ph.D thesis was awarded the Excellent Doctoral Thesis of Chinese Information Processing Society (CIPS). I won more than ten honors and awards during Ph.D stage.

Research

My research interests lie in NLP, CV, and the intersection of both (i.e., Multimodal/Vision-Language Learning). My long-term goal is to achieve human-level AI centered around multimodal LLMs & generalists. While previously I worked a lot on the topic of Structural Modeling of Language&Vision, I pay the most recent focus on the unified multimodal generalist towards human-level capacity (Modality, Task, Knowledge) and cognition (Reasoning, Affection), with following key topics and representative works (detailed in research statement):

▶  Multimodal Foundation Models: Unified multimodal LLMs and generalists.

  • NExT-GPT:      The 1st unified any-to-any multimodal LLM
  • Vitron:      The 1st unified pixel-level vision LLM for understanding, generating, segmenting, editing of image and video
  • General-Level:      Pioneer the path of MLLM evaluations towards multimodal generalists
  • MLLM tutorial:      A pioneering & comprehensive tutorial series for MLLM techniques

▶  Capacity: Comprehension/generation of modalities/tasks, knowledge acquisition.

  • JavisDiT:    A novel Diffusion Transformer for synchronized audio-video generation
  • Any2Caption:    A SoTA video generation framework from any input conditions
  • Dysen-VDM:      Enhance temporal dynamics of text-to-video diffusion from LLMs
  • LayoutLLM-T2I:      Enhance fidelity of text-to-image diffusion with layout from LLMs
  • MUIE:      The 1st benchmark for grounded multimodal universal information extraction

▶  Cognition: Cross-modal neuro-symbolic reasoning, human-centric affective computing.

  • MCoT-Survey:    The 1st systematic survey of MCoT reasoning
  • Video-of-Thought:      The 1st video chain-of-thought reasoning framework
  • SymbCoT:      The 1st fully LLM-based logical reasoning framework based on chain-of-thought
  • THOR-ISA:      The 1st chain-of-thought reasoning framework for implicit sentiment analysis
  • PanoSent:      The 1st cognitive-level benchmark for multimodal conversational aspect-based sentiment analysis
  • AvaMERG:    The 1st avatar-based multimodal empathetic conversation benchmark

I also extensively explore the AI for science, including 1) clinical psychology & social studies, 2) bio-/medicine & healthcare, and 3) material science, by integrating the advanced LLM/agent methodologies.

Advertising

I am constantly looking for collaborations on the above topics. Remote manner is also supported. For promising students I will provide sufficient GPUs. Hit me up, if you are a Ph.D/master/bachelor student and interested in what I am doing now (with potential vacancies for research interns/RAs/visiting). For students from University of Oxford, I’m particularly looking for collaborations on world modeling + AI scientist. Please describe your research status and attach your resume & statement.

News

  26 Jan 2026

Five papers are accepted by ICLR 2026, 1) JavisDiT++, 2) JavisDiT, 3) LogicReward 4) Interleaved Reasoning and 5) Cognitive Emotion Reasoning. Congrats to all my co-authors!

  14 Nov 2025

We are thrilled to release UniVA: Universal Video Agent — an open-source next-generation video generalist! UniVA features: 1) 🤖 Unified Agentic System: an one-stop, omnipotent, highly automated, interactive, interactive and proactive video creation station, with deep memory and planner–executor synergy. 2) 🎬 Powerful Creation: MCP-native modular system covering understanding, editing, tracking, and any-conditioned video generation with industrial-grade cinematic quality. Try the online demo now! Check the paper.

  8 Nov 2025

Two papers are accepted by AAAI 2025, 1) 4D Generation and 2) DragNeXt. Congrats to all my co-authors!

  25 Sep 2025

Four papers are accepted by NeurIPS 2025, 1) JavisGPT, 2) MuSLR, 3) VimoRAG and 4) Visual Thoughts. Congrats to all my co-authors!

  21 Aug 2025

Three papers are accepted by EMNLP 2025, 1) 3D Emotional Facial Generation, 2) Financial QA, and 3) Legal LLM. Congrats to all my co-authors!

  16 July 2025

Four papers are accepted by ACM MM 2025, 1) SSM for Salient Object Detection, 2) FormFactory, 3) ViTCoT and 4) MCM-DPO. Congrats to all my co-authors!

  26 June 2025

Four papers are accepted by ICCV 2025, 1) PhysSplat, 2) Explainable Driving, 3) Derm1M: Clinical Ontology Knowledge and 4) Iris: Self-Refining for GUI Agent. Congrats to all my co-authors!

... see all News