Beyond Vision Logo

Our Methodology

Dataset

1. Data Acquisition and Preprocessing

1.1 Campus Navigation Dataset

For the campus navigation module, a custom dataset was developed to reflect the unique environmental challenges of university campuses. The data collection involved video recordings across multiple campus locations at Hacettepe University.

  • Video Format: 1080p, 30fps
  • Frame Extraction: Extracted at 2fps using FFmpeg, resulting in ~14,000 frames
  • Final Dataset Size: 4,202 manually labeled images after filtering

1.2 ATM Interaction Dataset

The ATM dataset consists of:

  • Annotated images of ATM interfaces under real-world lighting
  • Finger and button positions labeled using Roboflow
  • Synthetic augmentations (rotation, blur, noise, occlusion) to increase robustness

1.3 Preprocessing Pipeline

  • Resizing: Images resized to 640×640 pixels
  • Normalization: Pixel values scaled to [0, 1]
  • Augmentation: Random flip, rotation, Gaussian noise, contrast adjustment
  • Annotation Format: YOLO format (class x_center y_center width height)

YOLO Detection

2. Computational Models and Algorithms

2.1 YOLOv8 for Object Detection

YOLOv8 was used for real-time object detection tasks. It optimizes the following loss function:

L=λboxLCIoU+λclsLcls+λobjLobj\mathcal{L} = \lambda_{box} \cdot \mathcal{L}_{CIoU} + \lambda_{cls} \cdot \mathcal{L}_{cls} + \lambda_{obj} \cdot \mathcal{L}_{obj}

Training Parameters:

  • Epochs: 50
  • Batch size: 16
  • Optimizer: SGD with momentum 0.937
  • Learning rate: 0.01 with cosine decay
  • Evaluation Metric: mAP@0.5 and mAP@0.5:0.95

Advanced Algorithms

2.2 Visual Question Answering (VQA) with VLMs

Two VLMs were explored:

  • CogVLM: Visual expert module, multimodal alignment
  • BLIP-2: Frozen image encoders + LLM

2.3 OCR with EasyOCR

  • Text Detection: CRAFT algorithm
  • Text Recognition: CRNN-based decoder
  • Post-Processing: Confidence threshold filtering and normalization

Mobile Application

3. System Architecture

3.1 Modular Pipeline Overview

  • Input Layer: Smart glasses or mobile camera
  • Processing Layer: YOLOv8 + VLMs on cloud (Paperspace)
  • Interaction Layer: Mobile app with voice input/output

3.2 Module Flow

Camera Input → YOLOv8 → Task Decision → ATM/Navigation/VQA → OCR/Text Detection → Voice Output

3.3 Optimization Techniques

  • Multithreading for parallel module execution
  • FIFO queues for processing decoupling
  • Trigger mode to reduce processing load

Vision-Language Models

4. Training and Testing

4.1 Training

Training using Ultralytics YOLOv8 CLI with custom datasets.

  • Campus dataset: 70% train / 15% val / 15% test
  • ATM dataset: 60% train / 20% val / 20% test

4.2 Testing and Validation

  • Frame processing latency
  • mAP on held-out test sets
  • OCR accuracy (edit distance)
  • VQA answer relevance (BLEU, human scoring)

Cloud Infrastructure

5. Evaluation Methodology

5.1 Detection Accuracy

Mean Average Precision (mAP) used:

mAP@0.5 = (1/|C|) ∑ (precision per class × recall)

5.2 VQA Evaluation

BLEU, METEOR, Human Grading

5.3 OCR Evaluation

Character Error Rate (CER):

CER = (S + D + I) / N

5.4 Latency

Measured per-frame processing time for detection, OCR, VQA, and TTS. Threshold: <500ms/task

Summary

The Beyond Vision system was developed through multi-phase methodology emphasizing real-world performance. By combining YOLOv8, VLMs, OCR, and modular design, we created a scalable and adaptive assistive solution for visually impaired users in academic environments.