Mechanistic Interpretability of Large Language Models using Kolmogorov-Arnold Networks

Adler, Erik; Loevenich, Johannes F.; Kluge, Silja; Holz, Laurin; Hürten, Tobias; Spelter, Florian; Lopes, Roberto Rigolin F.

doi:10.1109/CAI68641.2026.11536293

Mechanistic Interpretability of Large Language Models using Kolmogorov-Arnold Networks

Adler, Erik ¹; Loevenich, Johannes F.; Kluge, Silja; Holz, Laurin; Hürten, Tobias; Spelter, Florian; Lopes, Roberto Rigolin F.
¹ Fakultät für Informatik (INFORMATIK), Karlsruher Institut für Technologie (KIT)

Abstract:

This paper introduces a methodology for mechanistic interpretability in language models that integrates Kolmogorov-Arnold Networks (KANs) and Sparse Autoencoders (SAEs). Our solution substitutes KAN layers for the feed-forward blocks typically found in Transformer architectures, replacing them with interpretable polynomial functions to provide symbolic insights into Large Language Models (LLMs). We extract sparse latent features from model activations using SAEs and trace them back to their respective KAN polynomials. This allows for human-understandable interpretations of linguistic representations. Our experiments demonstrate that KAN-based Transformers perform similarly to conventional Multi-Layer Perceptrons (MLPs) while offering greater transparency. On the TinyStories dataset, KAN Transformers achieve a validation loss of 2.89, nearly identical to the baseline GPT-2 model despite having fewer parameters. Additionally, SAE analysis reveals a tunable trade-off between sparsity and reconstruction accuracy, confirming the feasibility of structured feature extraction. Our case study identifies the polynomials responsible for sentiment processing and proposes a selective manipulation technique that manipulates edge polynomials to change alignment-related outputs. ... mehr

Externe Links

Originalveröffentlichung
DOI: 10.1109/CAI68641.2026.11536293

Scopus

Export

Statistiken

Seitenaufrufe: 21
seit 30.06.2026

Zugehörige Institution(en) am KIT	Fakultät für Informatik (INFORMATIK)
Publikationstyp	Proceedingsbeitrag
Publikationsdatum	08.05.2026
Sprache	Englisch
Identifikator	ISBN: 979-8-3315-6039-3 KITopen-ID: 1000194760
Erschienen in	2026 IEEE Conference on Artificial Intelligence (CAI)
Veranstaltung	IEEE Conference on Artificial Intelligence (CAI 2026), Granada, Spanien, 08.05.2026 – 10.05.2026
Verlag	Institute of Electrical and Electronics Engineers (IEEE)
Seiten	1239 - 1244
Externe Relationen	Siehe auch
Nachgewiesen in	OpenAlex Scopus

Repository KITopen

Mechanistic Interpretability of Large Language Models using Kolmogorov-Arnold Networks

Abstract: