Scalable Video Action Anticipation with Cross Linear Attentive Memory

Zhong, Zeyun; Martin, Manuel; Schneider, David; Lerch, David J.; Wu, Chengzhi; Diederichs, Frederik; Gall, Juergen; Beyerer, Jürgen

doi:10.1109/WACV61042.2026.00783

Scalable Video Action Anticipation with Cross Linear Attentive Memory

¹; Martin, Manuel ¹; Schneider, David ¹; Lerch, David J. ²; Wu, Chengzhi ¹; Diederichs, Frederik; Gall, Juergen; Beyerer, Jürgen ¹
¹ Institut für Anthropomatik und Robotik (IAR), Karlsruher Institut für Technologie (KIT)
² Lichttechnisches Institut (LTI), Karlsruher Institut für Technologie (KIT)

Abstract:

Recent advances in action anticipation rely heavily on Transformer architectures to learn discriminative representations of the past observation, incurring high computational and memory overhead that limits their applicability to long videos. While temporal processors with linear complexity like RNNs and state-space models offer efficient alternatives, their sequential nature risks overlooking subtle cues in observed frames that could enhance future anticipation. We address this limitation with Cross Linear Attentive Memory (CLAM), a memory module that selectively retrieves complementary context cues from frame features. By reformulating linear attention to replace traditional cross-attention, CLAM achieves linear computation complexity and constant memory usage relative to input length. Finally, by fusing the outputs of the temporal processor and CLAM, a non-autoregressive Transformer decoder generates future actions in one shot with high accuracy. Experiments on egocentric (EpicKitchens100 and Ego4D) and third-person (Thumos14) benchmarks demonstrate our model’s superior anticipation accuracy and scalability, processing longer sequences with significantly less latency growth than alternatives. ... mehr

Externe Links

Originalveröffentlichung
DOI: 10.1109/WACV61042.2026.00783

Scopus

Export

Statistiken

Seitenaufrufe: 9
seit 23.06.2026

Zugehörige Institution(en) am KIT	Institut für Anthropomatik und Robotik (IAR) Lichttechnisches Institut (LTI)
Publikationstyp	Proceedingsbeitrag
Publikationsdatum	06.03.2026
Sprache	Englisch
Identifikator	ISBN: 979-8-3315-5511-5 KITopen-ID: 1000194426
Erschienen in	2026 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)
Veranstaltung	IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2026), Tucson, AZ, USA, 06.03.2026 – 10.03.2026
Verlag	Institute of Electrical and Electronics Engineers (IEEE)
Seiten	8113 - 8123
Externe Relationen	Siehe auch
Nachgewiesen in	Scopus OpenAlex

Repository KITopen

Scalable Video Action Anticipation with Cross Linear Attentive Memory

Abstract: