Learning Robust Aligned Representations Across Multiple Visual Modalities in Human Action Recognition

Lerch, David J.; Rothenburger, Bastian; Zhong, Zeyun; Martin, Manuel; Diederichs, Frederik; Stiefelhagen, Rainer

doi:10.1109/ICCVW69036.2025.00281

Learning Robust Aligned Representations Across Multiple Visual Modalities in Human Action Recognition

Lerch, David J. ¹; Rothenburger, Bastian; Zhong, Zeyun

²; Martin, Manuel ²; Diederichs, Frederik; Stiefelhagen, Rainer

²
¹ Fraunhofer-Institut für Optronik, Systemtechnik und Bildauswertung (IOSB)
² Institut für Anthropomatik und Robotik (IAR), Karlsruher Institut für Technologie (KIT)

Abstract:

We propose Cross-Modal Video Representation Alignment (CMVRA), a novel framework for human action recognition that leverages multiple visual modalities—RGB, infrared (IR), depth, and skeleton data—to learn robust, generalizable representations with reduced reliance on annotated data. By employing contrastive learning, CMVRA effectively aligns these modalities, enhancing the model's ability to integrate complementary information and capture richer representations across domains. This multi-modal alignment is crucial for improving recognition performance in diverse and challenging contexts. We propose a unified multi-modal embedding framework that aligns RGB, depth, infrared, and skeleton data, enhancing robustness and feature diversity, while also advancing alignment techniques by demonstrating that fully integrated multi-modal alignment outperforms traditional pairwise strategies. Extensive experiments conducted on the NTU and Drive&Act datasets confirm the effectiveness of our approach. CMVRA achieves a 3.01% improvement in 3D skeleton-based activity recognition on Drive&Act, outperforming state-of-the-art methods. Experiments on NTU show that CMVRA closes the gap between self-supervised and supervised learning methods. ... mehr

Externe Links

Originalveröffentlichung
DOI: 10.1109/ICCVW69036.2025.00281

Scopus

Export

Statistiken

Seitenaufrufe: 44
seit 20.04.2026

Zugehörige Institution(en) am KIT	Institut für Anthropomatik und Robotik (IAR)
Publikationstyp	Proceedingsbeitrag
Publikationsdatum	19.10.2025
Sprache	Englisch
Identifikator	ISBN: 979-8-3315-8988-2 KITopen-ID: 1000192428
Erschienen in	2025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)
Veranstaltung	IEEE/CVF International Conference on Computer Vision Workshops (ICCVW 2025), Honolulu, HI, USA, 19.10.2025 – 20.10.2025
Verlag	Institute of Electrical and Electronics Engineers (IEEE)
Seiten	2700 - 2710
Externe Relationen	Siehe auch
Schlagwörter	self-supervised learning, multi-modal learning, representation learning, activity recognition
Nachgewiesen in	Scopus OpenAlex

Repository KITopen

Learning Robust Aligned Representations Across Multiple Visual Modalities in Human Action Recognition

Abstract: