Observational learning is a fundamental mechanism by which humans acquire new skills by watching others and understanding the consequences of their actions. This capability allows for skill acquisition through demonstration, thereby reducing the need for costly trial-and-error processes. Cognitive development research has shown that infants can learn complex skills and make inductive generalizations from sparse samples by observing caregivers and peers; they leverage statistical evidence that models the covariation of task features, all without direct physical interaction or explicit linguistic instructions. By identifying invariant task features -- such as keypoints associated with an object's functional parts -- from high-dimensional visual inputs, it is possible to derive effective and transferable task representations. These insights have motivated significant research in robotics to develop Visual Imitation Learning (VIL) systems that emulate human observational learning mechanisms. Nevertheless, acquiring generalizable task representations solely from sparse human demonstration videos remains a significant challenge.
In this thesis, we adopt a bottom-up approach that extracts essential invariant task features from demonstrations without relying on ground-truth labels, direct physical interaction, or linguistic bootstrapping commonly employed in top-down methodologies. ... mehrWe address the computational challenges of learning complex manipulation tasks by decomposing them into tractable sub-problems in both spatial and temporal domains. In spatial domain, our approach extracts object-centric, keypoint-based geometric constraints that capture the functional aspects of objects and spatial coordination between arms. We then leverage neural object descriptors to facilitate the transfer of learned tasks to novel object instances or categories. In the temporal domain, we segment human demonstrations hierarchically and learn temporal coordination and action primitives. Throughout the work, we employ variance-based statistical analyses to extract invariant task features -- including keypoints, common viewpoints, object and hand dominance, and spatio-temporal constraints -- from sparse human demonstrations. This research addresses the following key research questions: 1) How can keypoint-based subsymbolic task representations be modeled to enable intra-category generalization? 2) How can these representations be effectively detected and extracted from visual input? 3) How can coordination strategies for various unimanual and bimanual manipulation be learned and incorporated into bimanual compliance controllers? 4) How can demonstration videos be segmented at a consistent granularity level to facilitate learning of spatio-temporal coordination?
1. Learning Keypoint-based Task Representation
We first focus on modeling generalizable subsymbolic keypoint-based task representations and learning them from sparse human demonstration videos of unimanual manipulation tasks. To this end, we propose a neural descriptor-based object representation along with a perception pipeline for VIL that detects humans and objects, tracks their states, proposes keypoint candidates, and establishes dense correspondences between object instances to address viewpoint mismatches and object variations. Our Principal Constraint Estimation (PCE) algorithm extracts sparse yet semantically meaningful keypoints -- associated with object functional parts -- from the candidate set by analyzing the statistical variance of their spatial distribution across multiple demonstrations. PCE simultaneously derives keypoint-based geometric constraints on principal manifolds, their corresponding local frames, and movement primitives as subsymbolic task representations. In contrast to most existing approaches that learn only a subset of these representations, our method provides a comprehensive understanding of task constraints. The resulting task representations are interpretable, transferable, viewpoint invariant, and embodiment-independent. Consequently, the learned tasks can be robustly generalized to novel object categories and reproduced by a novel keypoint-based admittance controller on a humanoid robot. Our key insight is that a sparse, object-centric representation combined with dense correspondence detection greatly enhances intra-category generalization, enabling the learning of various daily tasks from as few as 10 demonstration videos in cluttered scenes.
2. Learning Bimanual Coordination Strategies
Bimanual manipulation introduces additional complexity due to intricate object relationships, fine-grained motion details, and diverse coordination strategies between the arms. Similarly to unimanual tasks, bimanual manipulation tasks exhibit invariant features across multiple demonstrations. We extend our unimanual task representations to the bimanual domain by introducing a novel hybrid master-slave object relationship, which encapsulates the roles of both objects and hands in a task by exploiting statistical invariances in their spatial distributions. This formulation enables the derivation of various coordination strategies that cover a complete taxonomy of bimanual manipulation tasks, thereby unifying uni- and bimanual task representations. Fine-grained, keypoint-based geometric constraints enable our approach to capture detailed motion styles from demonstrations, paving the way for modeling personalized task representations in the future. Based on the extracted bimanual coordination categories, we develop real-time compliance controllers designed to manage bimanuality, motion synchronization, and hybrid master-slave relationships.
3. Keypoint-based Segmentation, Bimanual Coordination and Grasping
Human demonstrations typically consist of a sequence of actions, making the detection of common motion segments across demonstrations crucial for learning comprehensive task representations. To address this, we propose a keypoint-based hierarchical motion segmentation algorithm for VIL that leverages the motion characteristics of keypoints within object-centric local frames. By exploiting dense keypoint information, our algorithm accurately identifies changes in the static and dynamic spatial relationships of objects at the finest granularity. These fine-grained segments are then merged into semantically meaningful action primitives using derived contextual information. This bottom-up approach yields motion segments at consistent granularity levels across different layers, facilitating precise semantic and temporal alignment of segments across multiple demonstrations and enabling the learning of spatio-temporal bimanual coordination strategies. As an application, we present a task-oriented grasp learning and generation framework that models task-specific grasp poses in the grasping segments of human demonstrations as a pose constraint relative to object functional parts, and transfers it to novel categorical objects during inference time.
Together, these three interconnected components constitute our bottom-up keypoint-based visual imitation learning (KVIL) framework, which derives subsymbolic spatio-temporal task representations from sparse human demonstration videos for both unimanual and bimanual manipulation tasks. The main objective of this thesis is to model invariant task features based on keypoints, geometric constraints, task roles of objects and hands, and spatio-temporal coordination -- while also developing effective mathematical methods to extract these features from video inputs. The developed framework is evaluated across various daily tasks on humanoid robots, demonstrating its effectiveness and potential to robustly generalize learned tasks to novel objects and environments, thereby significantly advancing the state-of-the-art in visual imitation learning. Furthermore, this thesis opens numerous avenues for future research, including the exploration of more complex spatio-temporal task models, application to articulated and soft objects, and the development of more sophisticated learning algorithms to enhance the robustness and generalization of visual imitation learning systems to real-world scenarios.