KIT | KIT-Bibliothek | Impressum | Datenschutz

Mechanistic Interpretability of Large Language Models using Kolmogorov-Arnold Networks

Adler, Erik 1; Loevenich, Johannes F.; Kluge, Silja; Holz, Laurin; Hürten, Tobias; Spelter, Florian; Lopes, Roberto Rigolin F.
1 Fakultät für Informatik (INFORMATIK), Karlsruher Institut für Technologie (KIT)

Abstract:

This paper introduces a methodology for mechanistic interpretability in language models that integrates Kolmogorov-Arnold Networks (KANs) and Sparse Autoencoders (SAEs). Our solution substitutes KAN layers for the feed-forward blocks typically found in Transformer architectures, replacing them with interpretable polynomial functions to provide symbolic insights into Large Language Models (LLMs). We extract sparse latent features from model activations using SAEs and trace them back to their respective KAN polynomials. This allows for human-understandable interpretations of linguistic representations. Our experiments demonstrate that KAN-based Transformers perform similarly to conventional Multi-Layer Perceptrons (MLPs) while offering greater transparency. On the TinyStories dataset, KAN Transformers achieve a validation loss of 2.89, nearly identical to the baseline GPT-2 model despite having fewer parameters. Additionally, SAE analysis reveals a tunable trade-off between sparsity and reconstruction accuracy, confirming the feasibility of structured feature extraction. Our case study identifies the polynomials responsible for sentiment processing and proposes a selective manipulation technique that manipulates edge polynomials to change alignment-related outputs. ... mehr


Originalveröffentlichung
DOI: 10.1109/CAI68641.2026.11536293
Zugehörige Institution(en) am KIT Fakultät für Informatik (INFORMATIK)
Publikationstyp Proceedingsbeitrag
Publikationsdatum 08.05.2026
Sprache Englisch
Identifikator ISBN: 979-8-3315-6039-3
KITopen-ID: 1000194760
Erschienen in 2026 IEEE Conference on Artificial Intelligence (CAI)
Veranstaltung IEEE Conference on Artificial Intelligence (CAI 2026), Granada, Spanien, 08.05.2026 – 10.05.2026
Verlag Institute of Electrical and Electronics Engineers (IEEE)
Seiten 1239 - 1244
Externe Relationen Siehe auch
Nachgewiesen in OpenAlex
Scopus
KIT – Die Universität in der Helmholtz-Gemeinschaft
KITopen Landing Page