In the field of information extraction (IE), tasks across a wide range of modalities and their combinations have been traditionally studied in isolation, leaving a gap in deeply recognizing and analyzing cross-modal information. To address this, this work for the first time introduces the concept of grounded Multimodal Universal Information Extraction (MUIE), providing a unified task framework to analyze any IE tasks over various modalities, along with their fine-grained groundings. To tackle MUIE, we tailor a multimodal large language model (MLLM), Reamo, capable of extracting and grounding information from all modalities, i.e., recognizing everything from all modalities at once. Reamo is updated via varied tuning strategies, equipping it with powerful capabilities for information recognition and fine-grained multimodal grounding. To address the absence of a suitable benchmark for grounded MUIE, we curate a high-quality, diverse, and challenging test set, which encompasses IE tasks across 9 common modality combinations with the corresponding multimodal groundings. The extensive comparison of Reamo with existing MLLMs integrated into pipeline approaches demonstrates its advantages across all evaluation dimensions, establishing a strong benchmark for the follow-up research.
predicting all possible textual labels of entities \( \{E^{\text{ner}}\} \), with pre-defined entity types \( C^{\text{ner}}\in \mathcal{C}^{\text{ner}} \) (e.g., person, location and organization), where each \( E \) may correspond to a span within \( T \), or visual region within \( I \), or a speech segment within \( A \). We denote the visual grounding mask as \( M_{img} \) and the speech segment as \( M_{aud} \).
first identifying all possible entities \( \{E^{\text{re}}\} \) following the NER step, and then determine a pre-defined relation label \( R^{\text{RE}}\in \mathcal{R}^{\text{RE}} \) for two entities \( < E^{\text{RE}}_i, E^{\text{RE}}_j> \) that should be paired. Also \( E^{\text{RE}} \) should correspond to \( T \) , \( I \) , or \( A \) , as in NER.
detecting all possible structured event records that consist of event trigger \( E^{\text{et}} \) , event type \( C^{\text{et}}\in \mathcal{C}^{\text{et}} \) , event argument \( E^{\text{ea}} \) and event argument role \( C^{\text{er}} \in \mathcal{C}^{\text{er}} \) . Here \( E^{\text{et}} \) and \( E^{\text{ea}} \) correspond to a continuous span within \( T \) or a speech segment within \( A \) . Also \( E^{\text{ea}} \) might refer to the visual region within \( I \) , or the temporal dynamic tracklet in video \( V \) (i.e., object tracking). We denote the video tracking mask as \( M_{vid} \) . \( \mathcal{C}^{\text{et}} \) and \( \mathcal{C}^{\text{er}} \) are pre-defined label sets.
To solve MUIE, we consider taking advantage of the existing generative LLMs with in-context instructions. We develop a novel multimodal LLM, REAMO, achieving Recognizing Everything from All Modalities at Once. REAMO not only outputs all possible textual IE labels but also identifies corresponding groundings across other modalities: 1) statically, by segmenting visual objects and audio speeches, and 2) dynamically, by tracking textual or vocal events in videos. Technically, REAMO employs a Vicuna LLM as its core semantic reasoner, utilizing ImageBind as a multimodal encoder to project image, video and audio inputs into LLM-understandable signals. At the decoding end, we integrate the SEEM for visual grounding&tracking, and the SHAS for audio segmentation, where the messages are passing from LLM to decoders through structured meta-response effectively. Given input multimodal information, REAMO is able to output UIE label tokens as well as fine-grained groundings recurrently.
▶ Visualization of REAMO on MUIE
To evaluate the performance of our grounded MUIE system, we develop a benchmark testing set. We select 9 existing datasets from different modalities (or combinations thereof) for IE/MIE tasks. The following table summarizes these datasets of the raw sources. We then process these datasets, such as Text\( \leftrightarrow \)Speech, to create 6 new datasets under new multimodal (combination) scenarios. Before annotation, we carefully select 200 instances from their corresponding testing sets, ensuring each instance contained as much IE information as possible.
The benchmark data is now being enriched and scaled, stay tuned!
▶ Evaluation Methods
Grounded MUIE evaluation dataset involves predictions for three tasks, including UIE label prediction and multimodal grounding prediction task prediction.
▶ Zero-shot performance of text+image or standalone image input.
▶ Zero-shot performance of text+audio or standalone audio input.
▶ Zero-shot performance of text+video or standalone video input.
▶ Zero-shot performance in more complex modality-hybrid scenarios.
@inproceedings{ACL24MUIE,
title = {Recognizing Everything from All Modalities at Once: Grounded Multimodal Universal Information Extraction},
author = {Zhang, Meishan and Fei, Hao and Wang, Bin and Wu, Shengqiong and Cao, Yixin and Li, Fei and Zhang, Min},
booktitle = {Findings of the Association for Computational Linguistics: ACL 2024},
year = {2024},
}