KIT | KIT-Bibliothek | Impressum | Datenschutz

Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals

Reuss, Moritz 1; Yağmurlu, Ömer Erdinç; Wenzel, Fabian; Lioutikov, Rudolf
1 Institut für Anthropomatik und Robotik (IAR), Karlsruher Institut für Technologie (KIT)

Abstract (englisch):

This work introduces the Multimodal Diffusion Transformer (MDT), a novel diffusion policy framework, that excels at learning versatile behavior from multimodal goal specifi- cations with few language annotations. MDT leverages a diffusion based multimodal transformer backbone and two self-supervised auxiliary objectives to master long-horizon manipulation tasks based on multimodal goals. The vast majority of imitation learning methods only learn from individual goal modalities, e.g. either language or goal images. However, existing large- scale imitation learning datasets are only partially labeled with language annotations, which prohibits current methods from learning language conditioned behavior from these datasets. MDT addresses this challenge by introducing a latent goal- conditioned state representation, that is simultaneously trained on multimodal goal instructions. This state representation aligns image and language based goal embeddings and encodes suffi- cient information to predict future states. The representation is trained via two self-supervised auxiliary objectives that enhance the performance of the presented transformer backbone. ... mehr


Postprint §
DOI: 10.5445/IR/1000174157
Veröffentlicht am 13.09.2024
Cover der Publikation
Zugehörige Institution(en) am KIT Institut für Anthropomatik und Robotik (IAR)
Publikationstyp Proceedingsbeitrag
Publikationsjahr 2024
Sprache Englisch
Identifikator ISBN: 979-8-9902848-0-7
KITopen-ID: 1000174157
Erschienen in Robotics: Science and Systems XX. Ed.: D. Kulic
Veranstaltung 20th Robotics: Science and Systems (2024), Delft, Niederlande, 15.07.2024 – 19.07.2024
Verlag Robotics: Science and Systems Foundation
Vorab online veröffentlicht am 08.07.2024
Schlagwörter Imitation Learning, Diffusion Models, Self-Supervised Learning, Robotics
Nachgewiesen in arXiv
KIT – Die Forschungsuniversität in der Helmholtz-Gemeinschaft
KITopen Landing Page