KIT | KIT-Bibliothek | Impressum | Datenschutz

@BENCH: Benchmarking Vision-Language Models for Human-centered Assistive Technology

Jiang, Xin 1; Zheng, Junwei 1; Liu, Ruiping 1; Li, Jiahang 1; Zhang, Jiaming 2; Matthiesen, Sven 3; Stiefelhagen, Rainer ORCID iD icon 2
1 Karlsruher Institut für Technologie (KIT)
2 Institut für Anthropomatik und Robotik (IAR), Karlsruher Institut für Technologie (KIT)
3 Institut für Produktentwicklung (IPEK), Karlsruher Institut für Technologie (KIT)

Abstract (englisch):

As Vision-Language Models (VLMs) advance, human-centered Assistive Technologies (ATs) for helping People with Visual Impairments (PVIs) are evolving into generalists, capable of performing multiple tasks simultaneously. However, benchmarking VLMs for ATs remains under-explored. To bridge this gap, we first create a novel AT benchmark (@ Bench). Guided by a pre-design user study with PVIs, our benchmark includes the five most crucial vision-language tasks: Panoptic Segmentation, Depth Estimation, Optical Character Recognition (OCR), Image Captioning, and Visual Question Answering (VQA). Besides, we propose a novel AT model (@MODEL) that addresses all tasks simultaneously and can be expanded to more assistive functions for helping PVIs. Our framework exhibits outstanding performance across tasks by integrating multi-modal information, and it offers PVIs a more comprehensive assistance. Extensive experiments prove the effectiveness and generalizability of our framework.


Zugehörige Institution(en) am KIT Institut für Anthropomatik und Robotik (IAR)
Institut für Produktentwicklung (IPEK)
Publikationstyp Proceedingsbeitrag
Publikationsdatum 26.02.2025
Sprache Englisch
Identifikator ISBN: 979-83-315-1083-1
KITopen-ID: 1000182093
Erschienen in 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)
Veranstaltung IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2025), Tucson, AZ, USA, 28.02.2025 – 04.03.2025
Verlag Institute of Electrical and Electronics Engineers (IEEE)
Seiten 3934–3943
Schlagwörter vlm; assistive technology; panoptic segmentation; depth estimation; ocr; image captioning; vqa
Nachgewiesen in OpenAlex
Scopus
Dimensions
KIT – Die Universität in der Helmholtz-Gemeinschaft
KITopen Landing Page