Case study

Sight2Speech

Real-time on-device YOLOv8 classification with spoken feedback from the live camera feed.

mobileaccessibilitycomputer-visionon-device-mlttsyoloFlutterDartCameratflite_flutterflutter_ttsTensorFlow LiteYOLOv8 (TFLite)

Overview

Real-time on-device YOLOv8 classification with spoken feedback from the live camera feed.

Problem

Sight2Speech identifies what the camera sees and speaks the result quickly enough for hands-free use. It runs fully on-device to avoid network dependency and protect privacy, balancing responsiveness with limited mobile compute.

Constraints

On-device only with no cloud inference, YUV420 camera frames must be converted to RGB, processing is throttled to avoid overheating and UI lag, low-end processors introduce noticeable latency, and YOLOv8 classification returns top-1 labels instead of bounding boxes.

Decisions

Use YOLOv8 classification (yolov8n-cls_float32.tflite) for compact on-device inference.
Convert YUV420 camera frames to RGB and resize to the model input size.
Throttle inference to roughly every 400ms to reduce CPU load and UI jank.
Gate speech output with confidence >= 0.4 and a 2s cooldown to avoid chatter.
Load labels from assets/bhaudata.txt.txt and map the top-1 index to a label.

Metrics

model_assetassets/yolov8n-cls_float32.tflite

labels_sourceassets/bhaudata.txt.txt

tts_cooldown_s2

score_threshold0.4

camera_resolutionmedium

inference_interval_ms400

Architecture

Camera (YUV420 stream)

Frame converter (YUV420 -> RGB)

Resizer (model input size)

YOLOv8 TFLite interpreter

Top-1 selector

UI overlay

Text-to-speech

Connections

camera → converter

converter → resizer

resizer → tflite

tflite → top1

top1 → ui

top1 → tts

Links

Code Demo Video