MUIE: Multimodal Universal Information Extraction

Video Presentation

Abstract

In the field of information extraction (IE), tasks across a wide range of modalities and their combinations have been traditionally studied in isolation, leaving a gap in deeply recognizing and analyzing cross-modal information. To address this, this work for the first time introduces the concept of grounded Multimodal Universal Information Extraction (MUIE), providing a unified task framework to analyze any IE tasks over various modalities, along with their fine-grained groundings. To tackle MUIE, we tailor a multimodal large language model (MLLM), Reamo, capable of extracting and grounding information from all modalities, i.e., recognizing everything from all modalities at once. Reamo is updated via varied tuning strategies, equipping it with powerful capabilities for information recognition and fine-grained multimodal grounding. To address the absence of a suitable benchmark for grounded MUIE, we curate a high-quality, diverse, and challenging test set, which encompasses IE tasks across 9 common modality combinations with the corresponding multimodal groundings. The extensive comparison of Reamo with existing MLLMs integrated into pipeline approaches demonstrates its advantages across all evaluation dimensions, establishing a strong benchmark for the follow-up research.

Task Definition

Grounded MUIE

▶ NER

predicting all possible textual labels of entities $ \{E^{\text{ner}}\} $, with pre-defined entity types $ C^{\text{ner}}\in \mathcal{C}^{\text{ner}} $ (e.g., person, location and organization), where each $ E $ may correspond to a span within $ T $, or visual region within $ I $, or a speech segment within $ A $. We denote the visual grounding mask as $ M_{img} $ and the speech segment as $ M_{aud} $.

▶ RE

first identifying all possible entities $ \{E^{\text{re}}\} $ following the NER step, and then determine a pre-defined relation label $ R^{\text{RE}}\in \mathcal{R}^{\text{RE}} $ for two entities $ < E^{\text{RE}}_i, E^{\text{RE}}_j> $ that should be paired. Also $ E^{\text{RE}} $ should correspond to $ T $ , $ I $ , or $ A $ , as in NER.

▶ EE

detecting all possible structured event records that consist of event trigger $ E^{\text{et}} $ , event type $ C^{\text{et}}\in \mathcal{C}^{\text{et}} $ , event argument $ E^{\text{ea}} $ and event argument role $ C^{\text{er}} \in \mathcal{C}^{\text{er}} $ . Here $ E^{\text{et}} $ and $ E^{\text{ea}} $ correspond to a continuous span within $ T $ or a speech segment within $ A $ . Also $ E^{\text{ea}} $ might refer to the visual region within $ I $ , or the temporal dynamic tracklet in video $ V $ (i.e., object tracking). We denote the video tracking mask as $ M_{vid} $ . $ \mathcal{C}^{\text{et}} $ and $ \mathcal{C}^{\text{er}} $ are pre-defined label sets.

Model: REAMO

Recognizing Everything from All Modalities at Once

To solve MUIE, we consider taking advantage of the existing generative LLMs with in-context instructions. We develop a novel multimodal LLM, REAMO, achieving Recognizing Everything from All Modalities at Once. REAMO not only outputs all possible textual IE labels but also identifies corresponding groundings across other modalities: 1) statically, by segmenting visual objects and audio speeches, and 2) dynamically, by tracking textual or vocal events in videos. Technically, REAMO employs a Vicuna LLM as its core semantic reasoner, utilizing ImageBind as a multimodal encoder to project image, video and audio inputs into LLM-understandable signals. At the decoding end, we integrate the SEEM for visual grounding&tracking, and the SHAS for audio segmentation, where the messages are passing from LLM to decoders through structured meta-response effectively. Given input multimodal information, REAMO is able to output UIE label tokens as well as fine-grained groundings recurrently.

▶ Visualization of REAMO on MUIE

**Case.1 :** Visualization of MUIE (NER) with modality-specific case via reasoning.

**Case.2 :** Visualization of MUIE (RE) with grounding rationale via reasoning.

**Case.3 :** Visualization of MUIE (EE) with commonsense-aware cognitive reasoning.

Benchmark Data

To evaluate the performance of our grounded MUIE system, we develop a benchmark testing set. We select 9 existing datasets from different modalities (or combinations thereof) for IE/MIE tasks. The following table summarizes these datasets of the raw sources. We then process these datasets, such as Text$ \leftrightarrow $Speech, to create 6 new datasets under new multimodal (combination) scenarios. Before annotation, we carefully select 200 instances from their corresponding testing sets, ensuring each instance contained as much IE information as possible.

The benchmark data is now being enriched and scaled, stay tuned!

▶ Evaluation Methods

Grounded MUIE evaluation dataset involves predictions for three tasks, including UIE label prediction and multimodal grounding prediction task prediction.

▶ UIE: To evaluate textual UIE results of the model, we use span-based offset Micro-F1 as the primary metric.

For NER task, we follow a span-level evaluation setting, where the entity boundary and entity type must be correctly predicted.
For RE task, a relation triple is correct if the model correctly predicts the boundaries of the subject entity, the object entity, and the entity relation.
For EE task, we report two evaluation metrics:
- Event Trigger (ET): an event trigger is correct if the event type and the trigger word are correctly predicted.
- Event Argument (EA): an event argument is correct if its role type and event type match a reference argument mention.

▶ Modality Grounding: For the evaluation of the fine-grained modality grounding accuracy, the key idea is to measure the mean Intersection over Union (mIoU).

Image Segmentation. Let us denote by $ \hat{M}_{img} = \{M_g\}^G_{g=1} $ the ground truth set of $G$ regions, and $ {M}_{img} = \{M_k\}^K_{k=1} $ the set of $ K $ predictions. Inspired by prior work, if $ K \neq G $, we employ padding with $ \emptyset $ to equalize the sizes of both sets, resulting in a final size of $ P =\operatorname{max}(G, K) $. Then, we find a bipartite matching between these two sets by searching for a permutation of $ P $ elements, $ \sigma \in \mathcal{S}_P $, with the lowest cost: \begin{equation} \hat{\sigma} = \operatorname{arg min}_{\sigma \in \mathcal{S}_P} \sum_{i}^P \mathcal{L}_{match}(\hat{M}_i, M_{\sigma(i)}), \end{equation} where $ \mathcal{L}_{match}(\hat{M}_i, M_{\sigma(i)}) $ is a pairwise matching cost between ground truth Mi and a prediction with index $ \sigma(i) $. We compute this optimal assignment efficiently with the Hungarian algorithm. We define $ \mathcal{L}_{match}(\hat{M}_i, M_{\sigma(i)}) $ as $ \mathcal{L}_{bce}(\hat{M}_i , M_{\sigma(i)}) + \mathcal{L}_{dice}(\hat{M}_i , M_{\sigma(i)}) $. The final IoU of each prediction is: \begin{equation} \text{IoU} = \frac{\text{Area of Overlap}}{\text{Area of Union}} \end{equation} Based on the IoU scores, we can calculate mIoU metric by referring image segmentation dataset.
Video Tracking. For videos, we compute the Jaccard Index (a.k.a, mIoU score) for each frame via the above calculations, and then average them.
Audio Segmentation. Similarly, the mIoU score for each audio segment is computed to evaluate the quality of speech segmentation results. We measure the 1D span of the extracted segments and the 1D span of gold segments.

Leaderboard

▶ Zero-shot performance of text+image or standalone image input.

▶ Zero-shot performance of text+audio or standalone audio input.

▶ Zero-shot performance of text+video or standalone video input.

▶ Zero-shot performance in more complex modality-hybrid scenarios.

Poster

BibTeX

@inproceedings{ACL24MUIE,
  title     = {Recognizing Everything from All Modalities at Once: Grounded Multimodal Universal Information Extraction},
  author    = {Zhang, Meishan and Fei, Hao and Wang, Bin and Wu, Shengqiong and Cao, Yixin and Li, Fei and Zhang, Min},
  booktitle = {Findings of the Association for Computational Linguistics: ACL 2024},
  year      = {2024},
}