Case study
Sight2Speech
Real-time on-device YOLOv8 classification with spoken feedback from the live camera feed.
Overview
Real-time on-device YOLOv8 classification with spoken feedback from the live camera feed.
Problem
Sight2Speech identifies what the camera sees and speaks the result quickly enough for hands-free use. It runs fully on-device to avoid network dependency and protect privacy, balancing responsiveness with limited mobile compute.
Constraints
On-device only with no cloud inference, YUV420 camera frames must be converted to RGB, processing is throttled to avoid overheating and UI lag, low-end processors introduce noticeable latency, and YOLOv8 classification returns top-1 labels instead of bounding boxes.
Decisions
- Use YOLOv8 classification (yolov8n-cls_float32.tflite) for compact on-device inference.
- Convert YUV420 camera frames to RGB and resize to the model input size.
- Throttle inference to roughly every 400ms to reduce CPU load and UI jank.
- Gate speech output with confidence >= 0.4 and a 2s cooldown to avoid chatter.
- Load labels from assets/bhaudata.txt.txt and map the top-1 index to a label.
Metrics
Architecture
Camera (YUV420 stream)
Frame converter (YUV420 -> RGB)
Resizer (model input size)
YOLOv8 TFLite interpreter
Top-1 selector
UI overlay
Text-to-speech
Connections
camera → converter
converter → resizer
resizer → tflite
tflite → top1
top1 → ui
top1 → tts