KIT | KIT-Bibliothek | Impressum | Datenschutz

SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading

Dinh, Tu Anh ORCID iD icon 1; Mullov, Carlos; Bärmann, Leonard ORCID iD icon 1; Li, Zhaolin 1; Liu, Danni ORCID iD icon 1; Reiß, Simon ORCID iD icon 1; Lee, Jueun; Lerzer, Nathan; Gao, Jianfeng; Peller-Konrad, Fabian; Röddiger, Tobias ORCID iD icon 2; Waibel, Alexander; Asfour, Tamim 1; Beigl, Michael ORCID iD icon 2; Stiefelhagen, Rainer ORCID iD icon 1; Dachsbacher, Carsten; Böhm, Klemens 3; Niehues, Jan ORCID iD icon; Al-Onaizan, Yaser [Hrsg.]; ... mehr

Abstract:

With the rapid development of Large Language Models (LLMs), it is crucial to have benchmarks which can evaluate the ability of LLMs on different domains. One common use of LLMs is performing tasks on scientific topics, such as writing algorithms, querying databases
or giving mathematical proofs. Inspired by the way university students are evaluated on such tasks, in this paper, we propose SciEx - a bench-
mark consisting of university computer science exam questions, to evaluate LLMs’ ability on solving scientific tasks. SciEx is (1) multilin-
gual, containing both English and German exams, and (2) multi-modal, containing questions that involve images, and (3) contains various types of freeform questions with different difficulty levels, due to the nature of university exams. We evaluate the performance of various state-of-the-art LLMs on our new benchmark. Since SciEx questions are freeform, it is not straightforward to evaluate LLM performance. Therefore, we provide human expert grading of the LLM outputs on SciEx. We show that the free-form exams in SciEx remain challenging for the current LLMs, where the best LLM only achieves 59.4% exam grade on average. ... mehr


Verlagsausgabe §
DOI: 10.5445/IR/1000176197
Veröffentlicht am 12.11.2024
Cover der Publikation
Zugehörige Institution(en) am KIT Institut für Anthropomatik und Robotik (IAR)
Publikationstyp Proceedingsbeitrag
Publikationsmonat/-jahr 11.2024
Sprache Englisch
Identifikator ISBN: 979-8-89176-167-4
KITopen-ID: 1000176197
HGF-Programm 46.24.01 (POF IV, LK 01) Applied TA: Digitalizat. & Automat. Socio-Technical Change
Erschienen in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, 12th-16th November 2024, Hrsg.: Al-Onaizan, Y., Bansal, M., Chen, Y.-N.
Veranstaltung Conference on Empirical Methods in Natural Language Processing (EMNLP 2024), Miami, FL, USA, 12.11.2024 – 16.11.2024
Verlag Association for Computational Linguistics (ACL)
Seiten 11592–11610
Externe Relationen Siehe auch
Relationen in KITopen
KIT – Die Forschungsuniversität in der Helmholtz-Gemeinschaft
KITopen Landing Page