Subtitle glasses for hearing impaired and tourist

School of Science and Technology 科技學院
Computing Programmes 電腦學系

Subtitle glasses for hearing impaired and tourist

Chui Tsz Ching, Lee Chak Yin, Wu Long Fung, Yip On Tik

Programme	Bachelor of Computing with Honours in Internet Technology Bachelor of Science with Honours in Computer Science
Supervisor	Dr. Jeff Au Yeung Siu Kei
Areas	Intelligent Applications
Year of Completion	2025

Objectives

Project Aim

The aim of the project has remained unchanged: Our target audiences experience significant inconvenience in communication and network limitations. To address these challenges, we aim to develop an application for smart glasses. This application will feature offline capabilities, focusing on speech-to-text conversion in Cantonese and text-to-text functionality, translating spoken Cantonese to multi-languages and processing everything in reall time.

Project Objectives

The projectives of the project has remained unchanged, to achieve the project aim, we need to attain these objective:

Research any offline Speech-to-text & Text-to-Text model in the internet and evaluate their performance(accuracy and response time)
Setup Speech-to-text & Text-to-Text model accessible server in order to lower the development cost.
Test and Evaluate with using the server.
Investigate technical problems like filtering out the wearer’s speech and output the result with only others' speech, the valid distance that the microphone can receive the speech etc.
Design User Interface and implement a user-friendly feature like choosing the language that the wearer wants to output(if applicable), resizing the font etc.

Videos

Demonstration Video

Presentation Video

Methodologies and Technologies used

Overview of the Solution

Early Prototype (Cloud-Based)

Leveraged Azure Speech-to-Text + Google Cloud Translation for real-time captions
Proved real-time transcription and multilingual translation feasibility
Faced issues: third-party latency, black-box model behavior, and escalating API costs

Shift to Local Server Architecture

Migrated STT workloads from cloud APIs to self-hosted, lightweight models
Goals: lower & predictable latency, full control over models, better privacy, and cost savings

Parallel Development Streams

UI & Display Team: Built smart-glasses interface, customization for tourists vs. hearing-impaired users
STT & Backend Team: Evaluated/deployed on-premise speech models; designed server with:
- Asynchronous task queues
- WebSocket for real-time audio/text exchange
- Secure data handling

Final System Design

Client-Server Model:
- Client (Glasses): Captures audio, streams to server, renders transcriptions/translations instantly
- Server: Processes audio with local STT (and optional translation), returns text via WebSocket
- Outcome: A self-contained real-time captioning solution offering predictable performance, privacy, and cost efficiency.

Hardware Components (Smart Glasses)

INMO Air2 glasses

jjBattery: 500mA

Processor and Memory Capacity: RAM: 2GB, ROM: 32GB, Chip: ZiGuang ZhanRui AI Chip, 4Core 1.8GHz

Connectivity Options: Wi-Fi with 2.4GHz/5GHz

Audio: 2 microphones

Displays: Micro-OLED; FOV26; sRGB 100%, Resolution: 640*400

Operating System: IMOS2.0 (Similar to Android)

Controller: A ring and touch pads built-in the glasses.

Figure 1: System Block Diagram 1

Figure 2: System Block Diagram 2

Results (Prototype System Design)

Prototype Architecture & Cloud Service Selection

Cloud-Based Proof-of-Concept

Deployed both STT and T2T in the cloud to rapidly validate real-time captioning on smart glasses
Audio streamed from the glasses' mic → cloud APIs → text returned to OLED display

Comparative STT Testing

Services: Azure Speech Service vs. Google Speech-to-Text
Test conditions: scripted/unscripted dialog, indoor quiet, outdoor noise
Findings: Azure outperformed Google for mixed Cantonese-English and noisy settings
Glasses' built-in microphone proved reliable across environments

Comparative Translation Testing

Services: Google Translate API vs. Azure Translation
Test material: informal Cantonese utterances
Findings: Google produced more natural, idiomatic translations

Final Prototype Stack

Transcription: Azure Speech-to-Text API
Translation: Google Translate API
Client: Android mobile app streaming audio/text to/from the cloud
Display: Instant captions on smart glasses' OLED

Core Features Implemented

Real-Time STT

Glasses capture audio and stream it to Azure Speech-to-Text.
Supports mixed Cantonese/English input with 1–2.5 s latency.

On-the-Fly Translation

Transcribed text is fed into Google Translate.
Delivers immediate translations in the user's chosen language—ideal for tourists.

Wearable Subtitle Display
• OLED shows two lines: original transcript + translated text.
• Minimal UI keeps captions legible without blocking vision.

Basic Controls
• Start/Stop toggle for speech capture.
• Dual-line display for quick performance checks.

End-to-End Cloud Pipeline
• Audio sent over HTTPS to cloud back-end.
• Validated seamless flow: capture → transcribe → translate → display

Figure 3: Prototype's System Architecture

Testing Result

Motor Driver

The motor driver, written in Python, was unique compared to other programs in the package. It operated as a stand-alone program that could be executed with commands. It interfaced with the ROS core, receiving commands from other packages to power the motors via GPIO pins. The performance of the motor driver during testing and evaluations met our expectations, enabling the robotic car to move smoothly based on the generated commands. With the motor driver, the robotic car gained enhanced mobility, including the ability to move forward with automatic speed adjustment and to turn left or right with adjusted angles. This ensured the car avoided crashes or unnecessary shifting, enhancing its overall performance and reliability.

RPLiDAR

Throughout the testing and demonstration phases of the project, we fully realized the potential of the RPLiDAR scanner, proving it to be a key component of the prototype. It was instrumental in drawing maps using the Cartographer algorithm and navigating the car using both the Cartographer and navigation stacks. The successful deployment of the RPLiDAR scanner and associated programs enabled several key features in the prototype, including cloud-point map drawing, path finding, and obstacle avoidance, significantly enhancing the functionality and effectiveness of the prototype.

Ultra-Wideband (UWB)

The algorithms implemented in the project performed well in their respective roles as outlined in the previous sections. The ultra-wideband (UWB) system effectively received signals and guided the robotic car to its target. However, there was a regrettable limitation in the current setup. The Reinforcement Learning (RL) navigation system, a crucial component for autonomous movement, unfortunately, was unable to complete the final integration step. This was an area that required further attention and development to fully realize the potential of the autonomous robotic car.

Demo Application

Our application primarily consisted of existing ROS features, so our testing and validation process mainly involved comparing the functionality of these features in ROS and our application. The results were largely similar or closely matched. Additional features, such as logging, were also tested and performed as expected. To evaluate the user interface (UI) and user experience (UX), we conducted blind tests with volunteers from various technical backgrounds. The aim was to ensure that our application was accessible and easy to learn, regardless of the user’s prior knowledge of the subject. The feedback we received was generally positive, and we used this feedback to make necessary updates to the UI, thereby enhancing the overall user experience.

Figure 4: Design Evolution Overview

Version 4 (Current Design):

Our final prototype UI refinement focused on intuitive iconography and improved visual hierarchy:

Text-based buttons were replaced with universally recognizable icons
Clear visual instruction (“Tap to begin listening”) with a prominent play button
Simplified main screen showing only essential information during conversations
Enhanced contrast and optimized text size for readability in various lighting conditions

The interface consisted of four basic buttons:

Start Recording – Begins capturing and transmitting audio to the local server.
Stop Recording – Ends the current audio session.
Connect to server – A button to connect to the server
Disconnect – A button to disconnect to the server

Figure 5: Developer UI deployed on INMO Air2 for Server-based testing

Implementation

Deployment Hardware

Device: INMO Air2 Smart Glasses
OS: IMOS 2.0 (Android-based)
Processor: ZiGuang ZhanRui AI Chip (Quad-core, 1.8GHz)
Memory: 2GB RAM, 32GB ROM
Display: Micro-OLED, 640×400 resolution, sRGB 100%, FOV 26°
Input: 2 built-in microphones
Connectivity: Dual-band Wi-Fi (2.4GHz / 5GHz)
Control: Touch pad or ring controller

Prototype System Workflow

Connection to Wifi: Connect to a Wifi or hotspots to enable the app to connect with the cloud services.
Audio Input: User speaks directly to the glasses.
Cloud Transcription: Audio is streamed to Azure Speech-to-Text API for real-time transcription.
Translation (Optional): Transcribed text is sent to Google Translate API.
Subtitle Display: Output is displayed on the glasses’ OLED screen with adjustable font size.

Figure 6: Script 1 (simulate a tour guide in HKMU)

Figure 8: Script 2 (Simulate food ordering in restaurants)

Conclusion

The project, initiated in August 2022, aimed to provide a solution for investigating unknown areas using map drawing, route planning, and AI-enhanced positioning. The objectives were met, with achievements including ROS package and program development, RPLiDAR and IPS implementation, and web application development.

ROS Implementation (Package and Program Development)

The robotic car used a Raspberry Pi for operations such as motor control and sensor data integration. Despite Raspbian OS meeting basic needs, ROS was chosen for its standardization in deploying and executing software operations in robotics. Despite the learning curve, ROS became the project’s fundamental platform.

RPLiDAR Implementation and Development

The RPLiDAR scanner, a key project component, scanned the environment and generated data for a cloud-point map, enabling users to review visualized mapping data. This data aided in obstacle avoidance, route planning, and map review in the web application.

IPS Implementation

The prototype was capable of locating itself in an indoor area using the Indoor Positioning System (IPS).

Demo Application Development

An API set was developed for easy ROS management/deployment. A responsive web app allowed users to control the car and scaled automatically to the user’s screen size. Original ROS features were migrated to the app, with additional features added to enhance the user experience.

ROS Launch Script

Some features, not originally part of the plan, were added due to their unavailability in the demo.

Future Development

The current design of our solution had a few limitations. Firstly, the car did not move smoothly when it needed to make a turn at an angle larger than 30 degrees. This impacted the overall maneuverability of the car. Secondly, there was not enough memory to effectively run the IPS analysis program, which was crucial for the car’s positioning system. Lastly, there were technical gaps in generating a live map in the web application for regular users, which affected user experience.

To address these issues, we suggested several improvements for future work. We could develop the motor control program with more precise PWM settings and formulas to improve the car’s movement. We could also develop a more lightweight algorithm for the IPS analysis program to reduce memory usage. Additionally, we could create a better script for resource management to optimize the system’s performance. Finally, we could develop a package with related programs and APIs that could transmit the cloud-point map data to the web application, enhancing the live map feature.