Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals

Reuss, Moritz; Yağmurlu, Ömer Erdinç; Wenzel, Fabian; Lioutikov, Rudolf

Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals

Reuss, Moritz ¹; Yağmurlu, Ömer Erdinç; Wenzel, Fabian; Lioutikov, Rudolf
¹ Institut für Anthropomatik und Robotik (IAR), Karlsruher Institut für Technologie (KIT)

Abstract (englisch):

This work introduces the Multimodal Diffusion Transformer (MDT), a novel diffusion policy framework, that excels at learning versatile behavior from multimodal goal specifi- cations with few language annotations. MDT leverages a diffusion based multimodal transformer backbone and two self-supervised auxiliary objectives to master long-horizon manipulation tasks based on multimodal goals. The vast majority of imitation learning methods only learn from individual goal modalities, e.g. either language or goal images. However, existing large- scale imitation learning datasets are only partially labeled with language annotations, which prohibits current methods from learning language conditioned behavior from these datasets. MDT addresses this challenge by introducing a latent goal- conditioned state representation, that is simultaneously trained on multimodal goal instructions. This state representation aligns image and language based goal embeddings and encodes suffi- cient information to predict future states. The representation is trained via two self-supervised auxiliary objectives that enhance the performance of the presented transformer backbone. ... mehr

Zugehörige Institution(en) am KIT	Institut für Anthropomatik und Robotik (IAR)
Publikationstyp	Proceedingsbeitrag
Publikationsjahr	2024
Sprache	Englisch
Identifikator	ISBN: 979-8-9902848-0-7 KITopen-ID: 1000174157
Erschienen in	Robotics: Science and Systems XX. Ed.: D. Kulic
Veranstaltung	20th Robotics: Science and Systems (2024), Delft, Niederlande, 15.07.2024 – 19.07.2024
Verlag	Robotics: Science and Systems Foundation
Vorab online veröffentlicht am	08.07.2024
Schlagwörter	Imitation Learning, Diffusion Models, Self-Supervised Learning, Robotics
Nachgewiesen in	arXiv

KITopen-Download

Postprint

DOI: 10.5445/IR/1000174157

Veröffentlicht am 13.09.2024

Export

Statistiken

Seitenaufrufe: 52
seit 14.09.2024

Downloads: 13
seit 24.09.2024

Repository KITopen

Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals

Abstract (englisch):