KIT | KIT-Bibliothek | Impressum | Datenschutz

(Towards) Scalable Reliable Automated Evaluation with Large Language Models

Braun, Bertil; Forell, Martin ORCID iD icon 1
1 Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB), Karlsruher Institut für Technologie (KIT)

Abstract:

Evaluating the quality and relevance of textual outputs from Large Language Models (LLMs) remains challenging and resource-intensive.Existing automated metrics often fail to capture the complexity and variability inherent in LLM-generated outputs.Moreover, these metrics typically rely on explicit reference standards, limiting their use mostly to domains with objective benchmarks.This work introduces a novel evaluation framework designed to approximate expert-level assessments of LLM-generated content.The proposed method employs pairwise comparisons of outputs by multiple LLMs, reducing biases from individual models.An Elo rating system is used to generate stable and interpretable rankings.Adjustable agreement thresholds—from full unanimity to majority voting—allow flexible control over evaluation confidence and coverage.The method’s effectiveness is demonstrated through evaluating competency profiles extracted from scientific abstracts.Preliminary results show that automatically derived rankings correlate well with expert judgments, significantly reducing the need for extensive human intervention.By offering a scalable, consistent, and domain-agnostic evaluation layer, the framework supports more efficient and reliable quality assessments of LLM outputs across diverse applications.


Zugehörige Institution(en) am KIT Fakultät für Wirtschaftswissenschaften (WIWI)
Publikationstyp Proceedingsbeitrag
Publikationsjahr 2025
Sprache Englisch
Identifikator ISBN: 979-8-89176-261-9
KITopen-ID: 1000183931
Erschienen in Proceedings of the 4th Workshop on Generation, Evaluation and Metrics (GEM) / The 63rd Annual Meeting of the Association for Computational Linguistics. Ed.: K. Dhole
Veranstaltung 4th Workshop on Generation, Evaluation and Metrics / 63rd Annual Meeting of the Association for Computational Linguistics (GEM / ACL 2025), Wien, Österreich, 31.07.2025 – 01.08.2025
Verlag Association for Computational Linguistics (ACL)
Seiten 320–336
Externe Relationen Siehe auch
KIT – Die Universität in der Helmholtz-Gemeinschaft
KITopen Landing Page