terminal

$

Case study

Sight2Speech

Real-time on-device YOLOv8 classification with spoken feedback from the live camera feed.

mobileaccessibilitycomputer-visionon-device-mlttsyoloFlutterDartCameratflite_flutterflutter_ttsTensorFlow LiteYOLOv8 (TFLite)

Overview

Real-time on-device YOLOv8 classification with spoken feedback from the live camera feed.

Problem

Sight2Speech identifies what the camera sees and speaks the result quickly enough for hands-free use. It runs fully on-device to avoid network dependency and protect privacy, balancing responsiveness with limited mobile compute.

Constraints

On-device only with no cloud inference, YUV420 camera frames must be converted to RGB, processing is throttled to avoid overheating and UI lag, low-end processors introduce noticeable latency, and YOLOv8 classification returns top-1 labels instead of bounding boxes.

Decisions

  • Use YOLOv8 classification (yolov8n-cls_float32.tflite) for compact on-device inference.
  • Convert YUV420 camera frames to RGB and resize to the model input size.
  • Throttle inference to roughly every 400ms to reduce CPU load and UI jank.
  • Gate speech output with confidence >= 0.4 and a 2s cooldown to avoid chatter.
  • Load labels from assets/bhaudata.txt.txt and map the top-1 index to a label.

Metrics

model_assetassets/yolov8n-cls_float32.tflite
labels_sourceassets/bhaudata.txt.txt
tts_cooldown_s2
score_threshold0.4
camera_resolutionmedium
inference_interval_ms400

Architecture

Camera (YUV420 stream)

Frame converter (YUV420 -> RGB)

Resizer (model input size)

YOLOv8 TFLite interpreter

Top-1 selector

UI overlay

Text-to-speech

Connections

cameraconverter

converterresizer

resizertflite

tflitetop1

top1ui

top1tts

Links