KIT | KIT-Bibliothek | Impressum | Datenschutz

Learning Robust Aligned Representations Across Multiple Visual Modalities in Human Action Recognition

Lerch, David J. 1; Rothenburger, Bastian; Zhong, Zeyun ORCID iD icon 2; Martin, Manuel 2; Diederichs, Frederik; Stiefelhagen, Rainer ORCID iD icon 2
1 Fraunhofer-Institut für Optronik, Systemtechnik und Bildauswertung (IOSB)
2 Institut für Anthropomatik und Robotik (IAR), Karlsruher Institut für Technologie (KIT)

Abstract:

We propose Cross-Modal Video Representation Alignment (CMVRA), a novel framework for human action recognition that leverages multiple visual modalities—RGB, infrared (IR), depth, and skeleton data—to learn robust, generalizable representations with reduced reliance on annotated data. By employing contrastive learning, CMVRA effectively aligns these modalities, enhancing the model's ability to integrate complementary information and capture richer representations across domains. This multi-modal alignment is crucial for improving recognition performance in diverse and challenging contexts. We propose a unified multi-modal embedding framework that aligns RGB, depth, infrared, and skeleton data, enhancing robustness and feature diversity, while also advancing alignment techniques by demonstrating that fully integrated multi-modal alignment outperforms traditional pairwise strategies. Extensive experiments conducted on the NTU and Drive&Act datasets confirm the effectiveness of our approach. CMVRA achieves a 3.01% improvement in 3D skeleton-based activity recognition on Drive&Act, outperforming state-of-the-art methods. Experiments on NTU show that CMVRA closes the gap between self-supervised and supervised learning methods. ... mehr


Originalveröffentlichung
DOI: 10.1109/ICCVW69036.2025.00281
Zugehörige Institution(en) am KIT Institut für Anthropomatik und Robotik (IAR)
Publikationstyp Proceedingsbeitrag
Publikationsdatum 19.10.2025
Sprache Englisch
Identifikator ISBN: 979-8-3315-8988-2
KITopen-ID: 1000192428
Erschienen in 2025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)
Veranstaltung IEEE/CVF International Conference on Computer Vision Workshops (ICCVW 2025), Honolulu, HI, USA, 19.10.2025 – 20.10.2025
Verlag Institute of Electrical and Electronics Engineers (IEEE)
Seiten 2700 - 2710
Externe Relationen Siehe auch
Schlagwörter self-supervised learning, multi-modal learning, representation learning, activity recognition
Nachgewiesen in Scopus
OpenAlex
KIT – Die Universität in der Helmholtz-Gemeinschaft
KITopen Landing Page