KIT | KIT-Bibliothek | Impressum | Datenschutz

Scaffolding Dexterous Manipulation with Vision-Language Models

Bakker, Vincent de; Hejna, Joey; Lum, Tyler Ga Wei; Celik, Onur; Taranovic, Aleksandar 1; Blessing, Denis; Neumann, Gerhard 1; Bohg, Jeannette; Sadigh, Dorsa
1 Institut für Anthropomatik und Robotik (IAR), Karlsruher Institut für Technologie (KIT)

Abstract (englisch):

Dexterous robotic hands are essential for performing complex manipulation tasks, yet remain difficult to train due to the challenges of demonstration collection and high-dimensional control. Thus, contemporary works in dexterous manipulation have often bootstrapped from reference trajectories to trajectories specify target hand poses that guide the exploration of RL policies and object poses that enable dense, task-agnostic rewards. However, sourcing suitable trajectories---particularly for dexterous hands---remains a significant challenge. Our key insight is that modern vision-language models (VLMs) already encode the commonsense spatial and semantic knowledge needed to specify tasks and guide exploration effectively. Given a task description (e.g., “open the cabinet”) and a visual scene, our method uses an off-the-shelf VLM to first identify task-relevant keypoints (e.g., handles, buttons) and then synthesize 3D trajectories for hand motion and object motion. Subsequently, we train a low-level residual RL policy in simulation to track these coarse trajectories or ``scaffolds'' with high fidelity.


Verlagsausgabe §
DOI: 10.5445/IR/1000189664
Veröffentlicht am 15.01.2026
Cover der Publikation
Zugehörige Institution(en) am KIT Institut für Anthropomatik und Robotik (IAR)
Publikationstyp Proceedingsbeitrag
Publikationsdatum 25.06.2025
Sprache Englisch
Identifikator KITopen-ID: 1000189664
Erschienen in The 39th Annual Conference on Neural Information Processing Systems; San Diego, USA, 02.-07.12.2025
Veranstaltung 39th Annual Conference on Neural Information Processing Systems (NeurlPS 2025), San Diego, CA, USA, 02.12.2025 – 07.12.2025
Seiten 16 S.
Serie Advances in Neural Information Processing Systems ; 38
Externe Relationen Siehe auch
Schlagwörter Dexterous manipulation, residual RL, VLM
KIT – Die Universität in der Helmholtz-Gemeinschaft
KITopen Landing Page