Beyond Vision: Empowering Visually Impaired Students

This project focuses on increasing visually impaired students' quality of life in two aspects: helping them navigate their campus safely and independently, and helping them complete their ATM operations without outside help. The field of assistive technology is developing rapidly however due to poor real-world performance and reliability concerns systems and projects integrating Visual-Language Models (VLMs) are limited. This project addresses these concerns and aim to prove that a system which is using VLMs to detect the worlds around users and give them reliable feedback.

In this study, we present a system designed to assist visually impaired individuals by responding to their needs through voice interaction and enabling seamless switching between two main modules: the ATM module and the Campus Navigation module. Additionally, the system allows users to utilize a Vision-Language Model (VLM) in a manner similar to a Large Language Model (LLM), enabling them to ask questions and receive informative responses.

ATM Module

Within the ATM module, users can interact with the ATM interface by moving their finger across the screen while receiving feedback from a trained model that detects finger position. We employed the YOLOv8 model both to detect the user's finger and to identify buttons on the ATM screen. Text on the screen is detected using EasyOCR. This setup provides an environment where users can independently complete ATM transactions without external assistance.

Campus Navigation Module

In the Campus Navigation module, we again utilized the YOLOv8 model, training it with data collected from our campus to classify various obstacles. A dataset of 4,202 images was used to improve the model's accuracy. Furthermore, we developed a mechanism to alert users in real time based on the location and proximity of potential hazards in the environment.

Vision-Language Model (VLM) Module

In the Vision-Language Model (VLM) module, users can ask context-aware questions about their surroundings, such as "What is in front of me?" or "Describe this scene," using voice commands. The module processes the visual input alongside the spoken query to generate informative, natural language responses. This functionality was evaluated using several state-of-the-art VLMs, including CogVLM and BLIP-2, both known for their visual reasoning capabilities. Additionally, we tested LLaVA (Large Language and Vision Assistant) for its fine-grained scene understanding and ChatGPT with vision support for its general-purpose reasoning and accessibility features. These models were assessed based on their response relevance, coherence, and latency, ensuring that the module can assist users effectively in dynamic environments through multimodal interaction.

To enhance usability, the system supports voice commands and provides audio feedback, allowing users to interact with the system comfortably and intuitively. Our approach demonstrates that it is possible to develop VLM-based systems with sufficient performance for visually impaired individuals. This project serves as a proof of concept, indicating that while the use of real-time systems poses certain challenges, significant advancements can be achieved in the future through adequate research and investment.