MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks

Papi, Sara; Züfle, Maike; Gaido, Marco; Savoldi, Beatrice; Liu, Danni; Douros, Ioannis; Bentivogli, Luisa; Niehues, Jan

MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks

Papi, Sara; Züfle, Maike

¹; Gaido, Marco; Savoldi, Beatrice; Liu, Danni

¹; Douros, Ioannis; Bentivogli, Luisa; Niehues, Jan

¹ Institut für Anthropomatik und Robotik (IAR), Karlsruher Institut für Technologie (KIT)

Abstract:

Recent advances in large language models have laid the foundation for multimodal LLMs (MLLMs), which unify text, speech, and vision within a single framework. As these models are rapidly evolving toward general-purpose instruction following across diverse and complex tasks, a key frontier is evaluating their crosslingual and multimodal capabilities over both short- and long-form inputs. However, existing benchmarks fall short in evaluating these dimensions jointly: they are often limited to English, mostly focus on a single modality at a time, rely on short-form inputs, or lack human annotations--hindering comprehensive assessment of model performance across languages, modalities, and task complexity. To address these gaps, we introduce MCIF (Multimodal Crosslingual Instruction Following), the first crosslingual human-annotated benchmark based on scientific talks on NLP and beyond. MCIF evaluates instruction following in crosslingual, multimodal settings over different input lengths and spans four macro-tasks: recognition, translation, question answering, and summarization. It covers three core modalities (speech, vision, and text) and four diverse languages (English, German, Italian, and Chinese), fully aligned across all dimensions. ... mehr

KITopen-Download

Verlagsausgabe

DOI: 10.5445/IR/1000192828

Veröffentlicht am 30.04.2026

Export

Statistiken

Seitenaufrufe: 53
seit 30.04.2026

Downloads: 24
seit 30.04.2026

Zugehörige Institution(en) am KIT	Institut für Anthropomatik und Robotik (IAR)
Publikationstyp	Proceedingsbeitrag
Publikationsjahr	2026
Sprache	Englisch
Identifikator	KITopen-ID: 1000192828
Erschienen in	The Fourteenth International Conference on Learning Representations
Veranstaltung	14th International Conference on Learning Representations (2016), Rio de Janeiro, Brasilien, 23.04.2026 – 27.04.2026
Verlag	OpenReview.net
Vorab online veröffentlicht am	26.01.2026
Externe Relationen	Siehe auch
Schlagwörter	benchmark, crosslingual, multimodal, instruction-following, speech, video

Repository KITopen

MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks

Abstract: