Our Methodology

Dataset

1. Data Acquisition and Preprocessing

1.1 Campus Navigation Dataset

For the campus navigation module, a custom dataset was developed to reflect the unique environmental challenges of university campuses. The data collection involved video recordings across multiple campus locations at Hacettepe University.

Video Format: 1080p, 30fps
Frame Extraction: Extracted at 2fps using FFmpeg, resulting in ~14,000 frames
Final Dataset Size: 4,202 manually labeled images after filtering

1.2 ATM Interaction Dataset

The ATM dataset consists of:

Annotated images of ATM interfaces under real-world lighting
Finger and button positions labeled using Roboflow
Synthetic augmentations (rotation, blur, noise, occlusion) to increase robustness

1.3 Preprocessing Pipeline

Resizing: Images resized to 640×640 pixels
Normalization: Pixel values scaled to [0, 1]
Augmentation: Random flip, rotation, Gaussian noise, contrast adjustment
Annotation Format: YOLO format (class x_center y_center width height)

YOLO Detection

2. Computational Models and Algorithms

2.1 YOLOv8 for Object Detection

YOLOv8 was used for real-time object detection tasks. It optimizes the following loss function:

\mathcal{L} = \lambda_{box} \cdot \mathcal{L}_{CIoU} + \lambda_{cls} \cdot \mathcal{L}_{cls} + \lambda_{obj} \cdot \mathcal{L}_{obj}

Training Parameters:

Epochs: 50
Batch size: 16
Optimizer: SGD with momentum 0.937
Learning rate: 0.01 with cosine decay
Evaluation Metric: mAP@0.5 and mAP@0.5:0.95

Advanced Algorithms

2.2 Visual Question Answering (VQA) with VLMs

Two VLMs were explored:

CogVLM: Visual expert module, multimodal alignment
BLIP-2: Frozen image encoders + LLM

2.3 OCR with EasyOCR

Text Detection: CRAFT algorithm
Text Recognition: CRNN-based decoder
Post-Processing: Confidence threshold filtering and normalization

Mobile Application

3. System Architecture

3.1 Modular Pipeline Overview

Input Layer: Smart glasses or mobile camera
Processing Layer: YOLOv8 + VLMs on cloud (Paperspace)
Interaction Layer: Mobile app with voice input/output

3.2 Module Flow

Camera Input → YOLOv8 → Task Decision → ATM/Navigation/VQA → OCR/Text Detection → Voice Output

3.3 Optimization Techniques

Multithreading for parallel module execution
FIFO queues for processing decoupling
Trigger mode to reduce processing load

Vision-Language Models

4. Training and Testing

4.1 Training

Training using Ultralytics YOLOv8 CLI with custom datasets.

Campus dataset: 70% train / 15% val / 15% test
ATM dataset: 60% train / 20% val / 20% test

4.2 Testing and Validation

Frame processing latency
mAP on held-out test sets
OCR accuracy (edit distance)
VQA answer relevance (BLEU, human scoring)

Cloud Infrastructure

5. Evaluation Methodology

5.1 Detection Accuracy

Mean Average Precision (mAP) used:

mAP@0.5 = (1/|C|) ∑ (precision per class × recall)

5.2 VQA Evaluation

BLEU, METEOR, Human Grading

5.3 OCR Evaluation

Character Error Rate (CER):

CER = (S + D + I) / N

5.4 Latency

Measured per-frame processing time for detection, OCR, VQA, and TTS. Threshold: <500ms/task

Summary

The Beyond Vision system was developed through multi-phase methodology emphasizing real-world performance. By combining YOLOv8, VLMs, OCR, and modular design, we created a scalable and adaptive assistive solution for visually impaired users in academic environments.