Beyond Vision

Assistive technology has witnessed significant evolution in recent decades, particularly with the advancement of artificial intelligence and multimodal machine learning. Among these, Vision-Language Models (VLMs) have emerged as promising tools that combine visual and linguistic understanding to facilitate human-computer interaction.

With over 285 million visually impaired people globally—39 million of whom are blind—according to the World Health Organization, there is a pressing need for inclusive solutions that enable equitable participation in society.

Figure 1: Architecture of CogVLM and Vision-Language Models

Problem Definition

Campus Navigation Difficulties

•Dynamic obstacle avoidance
•Real-time spatial orientation and contextual awareness
•Recognition of landmarks and destination identification
•Navigational assistance through crowded environments

ATM Accessibility Limitations

•Inability to interpret on-screen content without external help
•Difficulty locating and identifying ATM buttons
•Lack of secure, private, and independent transaction mechanisms
•Absence of real-time feedback for finger position

Literature Review

Vision-Language Models (VLMs) in Assistive Applications

VLMs have demonstrated potential in enhancing accessibility by enabling machines to interpret visual input and respond in natural language. Liu et al. [2] introduced BLIP-2, a pre-trained vision-language model that enables efficient multimodal understanding by combining frozen vision encoders with large language models (LLMs).

CogVLM [3], an advanced model incorporating trainable visual experts and enhanced attention mechanisms, improved contextual alignment between visual features and language generation.

Navigation Assistance Systems

Research on assistive navigation has largely focused on GPS-based and computer vision-driven systems. Zhang et al. [4] proposed a smartphone-based solution utilizing camera input for obstacle detection.

Kumar et al. [5] introduced a hybrid system combining ultrasonic sensors with computer vision.

ATM Accessibility Technologies

Traditional ATM accessibility features include audio instructions and tactile indicators. Johnson et al. [6] employed computer vision to locate ATM buttons.

Park et al. [7] combined Optical Character Recognition (OCR) with visual mapping to interpret screen contents.

Object Detection with YOLOv8

The YOLO (You Only Look Once) framework, particularly its latest iteration YOLOv8, has become a cornerstone of real-time object detection. Studies [8] confirm YOLOv8's high performance in speed and accuracy.

Hypothesis and Research Questions

Research Hypothesis

This project is guided by the hypothesis that a hybrid system combining task-specific computer vision models (YOLOv8, EasyOCR) with general-purpose Vision-Language Models (CogVLM, BLIP-2) can outperform standalone approaches in providing real-time, context-aware assistance for visually impaired individuals.

Research Questions

Real-time VLM Deployment

Can VLMs be effectively deployed in real-time systems for assistive purposes without compromising reliability?

Architectural Balance

What is the optimal architectural balance between general-purpose VLMs and specialized object detection models in assistive scenarios?

Modular Design

How can such systems be modularly designed to adapt to multiple tasks (e.g., navigation, ATM usage) without performance degradation?

Proposed Solution

Beyond Vision is a modular, multi-modal system that integrates real-time computer vision, OCR, and VLMs to assist visually impaired individuals with campus navigation and ATM usage.

Core Components

•Custom-trained YOLOv8 Models
•EasyOCR for text extraction
•Intuitive Voice Interface
•Visual Question Answering (VQA)
•Cloud-Based Architecture

Key Features

•Real-time obstacle detection
•Spatial awareness and navigation
•ATM interaction assistance
•Natural language processing
•Modular and scalable design

Beyond Vision: AI-Powered Assistance for Visual Impairment