(Towards) Scalable Reliable Automated Evaluation with Large Language Models

Braun, Bertil; Forell, Martin

(Towards) Scalable Reliable Automated Evaluation with Large Language Models

¹
¹ Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB), Karlsruher Institut für Technologie (KIT)

Abstract:

Evaluating the quality and relevance of textual outputs from Large Language Models (LLMs) remains challenging and resource-intensive.Existing automated metrics often fail to capture the complexity and variability inherent in LLM-generated outputs.Moreover, these metrics typically rely on explicit reference standards, limiting their use mostly to domains with objective benchmarks.This work introduces a novel evaluation framework designed to approximate expert-level assessments of LLM-generated content.The proposed method employs pairwise comparisons of outputs by multiple LLMs, reducing biases from individual models.An Elo rating system is used to generate stable and interpretable rankings.Adjustable agreement thresholds—from full unanimity to majority voting—allow flexible control over evaluation confidence and coverage.The method’s effectiveness is demonstrated through evaluating competency profiles extracted from scientific abstracts.Preliminary results show that automatically derived rankings correlate well with expert judgments, significantly reducing the need for extensive human intervention.By offering a scalable, consistent, and domain-agnostic evaluation layer, the framework supports more efficient and reliable quality assessments of LLM outputs across diverse applications.

Export

Statistiken

Seitenaufrufe: 151
seit 11.08.2025

Zugehörige Institution(en) am KIT	Fakultät für Wirtschaftswissenschaften (WIWI)
Publikationstyp	Proceedingsbeitrag
Publikationsjahr	2025
Sprache	Englisch
Identifikator	ISBN: 979-8-89176-261-9 KITopen-ID: 1000183931
Erschienen in	Proceedings of the 4th Workshop on Generation, Evaluation and Metrics (GEM) / The 63rd Annual Meeting of the Association for Computational Linguistics. Ed.: K. Dhole
Veranstaltung	4th Workshop on Generation, Evaluation and Metrics / 63rd Annual Meeting of the Association for Computational Linguistics (GEM / ACL 2025), Wien, Österreich, 31.07.2025 – 01.08.2025
Verlag	Association for Computational Linguistics (ACL)
Seiten	320–336
Externe Relationen	Siehe auch

Repository KITopen

(Towards) Scalable Reliable Automated Evaluation with Large Language Models

Abstract: