Assistive technology has witnessed significant evolution in recent decades, particularly with the advancement of artificial intelligence and multimodal machine learning. Among these, Vision-Language Models (VLMs) have emerged as promising tools that combine visual and linguistic understanding to facilitate human-computer interaction.
With over 285 million visually impaired people globally—39 million of whom are blind—according to the World Health Organization, there is a pressing need for inclusive solutions that enable equitable participation in society.
Figure 1: Architecture of CogVLM and Vision-Language Models
VLMs have demonstrated potential in enhancing accessibility by enabling machines to interpret visual input and respond in natural language. Liu et al. [2] introduced BLIP-2, a pre-trained vision-language model that enables efficient multimodal understanding by combining frozen vision encoders with large language models (LLMs).
CogVLM [3], an advanced model incorporating trainable visual experts and enhanced attention mechanisms, improved contextual alignment between visual features and language generation.
The YOLO (You Only Look Once) framework, particularly its latest iteration YOLOv8, has become a cornerstone of real-time object detection. Studies [8] confirm YOLOv8's high performance in speed and accuracy.
This project is guided by the hypothesis that a hybrid system combining task-specific computer vision models (YOLOv8, EasyOCR) with general-purpose Vision-Language Models (CogVLM, BLIP-2) can outperform standalone approaches in providing real-time, context-aware assistance for visually impaired individuals.
Can VLMs be effectively deployed in real-time systems for assistive purposes without compromising reliability?
What is the optimal architectural balance between general-purpose VLMs and specialized object detection models in assistive scenarios?
How can such systems be modularly designed to adapt to multiple tasks (e.g., navigation, ATM usage) without performance degradation?
Beyond Vision is a modular, multi-modal system that integrates real-time computer vision, OCR, and VLMs to assist visually impaired individuals with campus navigation and ATM usage.