KIT | KIT-Bibliothek | Impressum | Datenschutz

SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading

Dinh, Tu Anh ORCID iD icon 1; Mullov, Carlos; Bärmann, Leonard ORCID iD icon 1; Li, Zhaolin 1; Liu, Danni ORCID iD icon 1; Reiß, Simon ORCID iD icon 1; Lee, Jueun; Lerzer, Nathan; Ternava, Fabian; Gao, Jianfeng; Röddiger, Tobias ORCID iD icon 2; Waibel, Alexander; Asfour, Tamim 1; Beigl, Michael ORCID iD icon 2; Stiefelhagen, Rainer ORCID iD icon 1; Dachsbacher, Carsten; Böhm, Klemens 3; Niehues, Jan ORCID iD icon
1 Institut für Anthropomatik und Robotik (IAR), Karlsruher Institut für Technologie (KIT)
2 Institut für Telematik (TM), Karlsruher Institut für Technologie (KIT)
3 Institut für Programmstrukturen und Datenorganisation (IPD), Karlsruher Institut für Technologie (KIT)

Abstract:

With the rapid development of Large Language Models (LLMs), it is crucial to have benchmarks which can evaluate the ability of LLMs on different domains. One common use of LLMs is performing tasks on scientific topics, such as writing algorithms, querying databases or giving mathematical proofs. Inspired by the way university students are evaluated on such tasks, in this paper, we propose SciEx - a benchmark consisting of university computer science exam questions, to evaluate LLMs ability on solving scientific tasks. SciEx is (1) multilingual, containing both English and German exams, and (2) multi-modal, containing questions that involve images, and (3) contains various types of freeform questions with different difficulty levels, due to the nature of university exams. We evaluate the performance of various state-of-the-art LLMs on our new benchmark. Since SciEx questions are freeform, it is not straightforward to evaluate LLM performance. Therefore, we provide human expert grading of the LLM outputs on SciEx. We show that the free-form exams in SciEx remain challenging for the current LLMs, where the best LLM only achieves 59.4\% exam grade on average. ... mehr


Volltext §
DOI: 10.5445/IR/1000176792
Veröffentlicht am 02.12.2024
Cover der Publikation
Zugehörige Institution(en) am KIT Institut für Anthropomatik und Robotik (IAR)
Institut für Programmstrukturen und Datenorganisation (IPD)
Institut für Telematik (TM)
Publikationstyp Forschungsbericht/Preprint
Publikationsjahr 2024
Sprache Englisch
Identifikator KITopen-ID: 1000176792
Verlag arxiv
Schlagwörter Computation and Language (cs.CL), I.2.7
Nachgewiesen in Dimensions
Relationen in KITopen
KIT – Die Forschungsuniversität in der Helmholtz-Gemeinschaft
KITopen Landing Page