KIT | KIT-Bibliothek | Impressum | Datenschutz

How Toxic Can You Get? Search-based Toxicity Testing for Large Language Models

Corbo, Simone; Bancale, Luca; Gennaro, Valeria De; Lestingi, Livia; Scotti, Vincenzo ORCID iD icon 1; Camilli, Matteo
1 Institut für Informationssicherheit und Verlässlichkeit (KASTEL), Karlsruher Institut für Technologie (KIT)

Abstract (englisch):

Language is a deep-rooted means of perpetration of stereotypes and discrimination. Large Language Models (LLMs), now a pervasive technology in our everyday lives, can cause extensive harm when prone to generating toxic responses. The standard way to address this issue is to align the LLM , which, however, dampens the issue without constituting a definitive solution. Therefore, testing LLM even after alignment efforts remains crucial for detecting any residual deviations with respect to ethical standards. We present EvoTox, an automated testing framework for LLMs’ inclination to toxicity, providing a way to quantitatively assess how much LLMs can be pushed towards toxic responses even in the presence of alignment. The framework adopts an iterative evolution strategy that exploits the interplay between two LLMs, the System Under Test (SUT ) and the Prompt Generator steering SUT responses toward higher toxicity. The toxicity level is assessed by an automated oracle based on an existing toxicity classifier. We conduct a quantitative and qualitative empirical evaluation using five state-of-the-art LLMs as evaluation subjects having increasing complexity (7–671B parameters). ... mehr


Preprint §
DOI: 10.5445/IR/1000185879/pre
Veröffentlicht am 20.10.2025
Cover der Publikation
Zugehörige Institution(en) am KIT Institut für Informationssicherheit und Verlässlichkeit (KASTEL)
Publikationstyp Zeitschriftenaufsatz
Publikationsmonat/-jahr 11.2025
Sprache Englisch
Identifikator ISSN: 0098-5589, 1939-3520, 2326-3881
KITopen-ID: 1000185879
HGF-Programm 46.23.01 (POF IV, LK 01) Methods for Engineering Secure Systems
Erschienen in IEEE Transactions on Software Engineering
Verlag Institute of Electrical and Electronics Engineers (IEEE)
Band 51
Heft 11
Seiten 3056–3071
Schlagwörter automated testing, evolutionary testing, large language models,, toxic speech
Nachgewiesen in Web of Science
Scopus
Dimensions
OpenAlex
KIT – Die Universität in der Helmholtz-Gemeinschaft
KITopen Landing Page