KIT | KIT-Bibliothek | Impressum | Datenschutz

MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks

Papi, Sara; Züfle, Maike ORCID iD icon 1; Gaido, Marco; Savoldi, Beatrice; Liu, Danni ORCID iD icon 1; Douros, Ioannis; Bentivogli, Luisa; Niehues, Jan ORCID iD icon
1 Institut für Anthropomatik und Robotik (IAR), Karlsruher Institut für Technologie (KIT)

Abstract:

Recent advances in large language models have laid the foundation for multimodal LLMs (MLLMs), which unify text, speech, and vision within a single framework. As these models are rapidly evolving toward general-purpose instruction following across diverse and complex tasks, a key frontier is evaluating their crosslingual and multimodal capabilities over both short- and long-form inputs. However, existing benchmarks fall short in evaluating these dimensions jointly: they are often limited to English, mostly focus on a single modality at a time, rely on short-form inputs, or lack human annotations--hindering comprehensive assessment of model performance across languages, modalities, and task complexity. To address these gaps, we introduce MCIF (Multimodal Crosslingual Instruction Following), the first crosslingual human-annotated benchmark based on scientific talks on NLP and beyond. MCIF evaluates instruction following in crosslingual, multimodal settings over different input lengths and spans four macro-tasks: recognition, translation, question answering, and summarization. It covers three core modalities (speech, vision, and text) and four diverse languages (English, German, Italian, and Chinese), fully aligned across all dimensions. ... mehr


Verlagsausgabe §
DOI: 10.5445/IR/1000192828
Veröffentlicht am 30.04.2026
Cover der Publikation
Zugehörige Institution(en) am KIT Institut für Anthropomatik und Robotik (IAR)
Publikationstyp Proceedingsbeitrag
Publikationsjahr 2026
Sprache Englisch
Identifikator KITopen-ID: 1000192828
Erschienen in The Fourteenth International Conference on Learning Representations
Veranstaltung 14th International Conference on Learning Representations (2016), Rio de Janeiro, Brasilien, 23.04.2026 – 27.04.2026
Verlag OpenReview.net
Vorab online veröffentlicht am 26.01.2026
Externe Relationen Siehe auch
Schlagwörter benchmark, crosslingual, multimodal, instruction-following, speech, video
KIT – Die Universität in der Helmholtz-Gemeinschaft
KITopen Landing Page