School of Science and Technology 科技學院
Computing Programmes 電腦學系

Cantonese to Written Chinese Machine Translation 

Tsang Tsz Yui, Ng Man Kwan, Lau Lok Lam

ProgrammeBachelor of Computing with Honours in Internet Technology

Bachelor of Science with Honours in Computer Science
SupervisorDr. Jeff Au Yeung Siu Kei
AreasIntelligent Applications
Year of Completion2024

Objectives

Project Aim

This project aims to enhance the accessibility of Cantonese content while providing convenience to both the content creator and audience by creating a platform for automating transcription and Cantonese-to-written Chinese translation from video input. 

Project Objectives 

Our objective is to develop a Cantonese-to-Chinese auto-subtitling tool, which specifically provides written Chinese subtitles for Cantonese movies and videos. The application will be a web application and the application can be mainly divided into transcription and translation. 

  1. Transcription:
  • Collaborate with an existing tool to extract and unify the audio format from the user's video. 
  • Cooperate with the speech-to-text service provided by Microsoft Azure for getting the content from audio. 
  1. Translation:
  • Build our own Cantonese-to-Chinese dataset to ensure the translation accuracy. 
  • Develop a Cantonese-to-Chinese translation AI model by using the pre-trained AI model to train with our dataset. 
  1. Develop a user-friendly user interface for our web application.

Videos

Demonstration Video

Presentation Video

Methodologies and Technologies used

The technologies we used for this project were included here, many of these technologies are third-party Python libraries.  

  • Python: The main language of our application development. 
  • Ffmpeg: Video/audio editing modules for Python. Used for extracting audio from input video. 
  • Pre-trained models: AI models from big companies(e.g. BERT from Google). Used as the basis of the translation model of our application. Will be retrained with our dataset later. 
  • Werkzeug: Securing input file for storing the file on a file system safely, any potentially problematic filename will be converted into ASCII only string. 
  • pyTorch: Used to implement AI models. 
  • Cloud server: Cloud server services provided by big companies. Used in our application for hosting the website. 
  • Flask: Framework used for web application development and implementation of APIs. 
  • Moviepy: Creating subtitle embed-ed video for final product preview and download. 
  • Auditok: Splitting the input audio into proper sentence length suitable for subtitle, by detecting silence moments in audio. Accurate timestamps are also generated for each audio part. 
  • Transformers: a library provided from huggingface for downloading, training, and implementing transformer AI models. 
  • Datasets: a library provided from huggingface for accessing dataset in the huggingface database. 
  • Evaluates: a library provided from huggingface with multiple evaluation metrics to measure the model performances.

Figure 1: Use case diagram

Single-page website design used for clarity and ease of use.

Caters to both standard and experienced users:

  • Simple UI for standard users.
  • Advanced options hidden but available for experienced users.

Advanced settings have default values:

  • Users can ignore or customize them without interrupting the process.

Uploaded video file is the main data source:

  • Temporarily stored only during processing.
  • Automatically deleted after process completion and user disconnect.

System data flow is represented with a supporting diagram.

Figure 2: User flow diagram representation of our system.

Uploaded video is the primary data source.

Temporarily stored and deleted after processing and user disconnection.

Data flow process:

  1. User uploads video.
  2. Audio is extracted from the video.
  3. Video is saved for final embedding.
  4. Audio is segmented into smaller parts with time offsets.
  5. Each segment is sent to Azure API for processing.
  6. Results are used to translate and timestamp the audio.
  7. A subtitle file (.srt) is generated.
  8. User can either receive the subtitle file or request the burned-in video output.

Figure 3: Data flow diagram representation of the system.

Training methodology

Dataset:

  • Used 110k training pairs and 5.6k validation pairs from sources like Cantonese Wikipedia and Hong Kong forums.
  • Manual validation ensured clean grammar, punctuation, and meaning.
  • Cantonese sentences were hand-translated to written Chinese for accuracy.

Training Process:

  • Dataset downloaded from Huggingface and tokenized.
  • Pretrained models fine-tuned on the tokenized training set.
  • Multiple training iterations and evaluations using the validation set.
  • Final output is an optimized translation model.

Model Evaluation:

  • Used BLEU (word n-grams) and CHRF (character n-grams) as metrics.
  • Validation set split into high- and low-similarity samples to assess performance on varied translation difficulty.
  • Training and inference used different tokenization and padding setups, causing internal evaluation inconsistency.
  • To ensure accurate results, all checkpoints were saved and evaluated externally—time-intensive but more reliable.

Evaluation

Objective: Assess overall application performance (transcription + translation).

Method:

  • Use a mix of short (<10 min) and long (>10 min) Cantonese videos.
  • Measure processing time from video upload to SRT subtitle file generation.
  • Repeat measurement for generating video previews with burnt-in subtitles.

User Evaluation

Objective: Evaluate user experience.

Method:

  • Recruit end users (Mandarin speakers or Cantonese content creators).
  • Collect feedback via a questionnaire.
  • Focus areas: UI/UX satisfaction, subtitle quality, clarity of instructions, and overall ease of use.

Experiment

Objective: Assess machine translation performance.

Model Used: Fine-tuned BART-Cantonese model.

Method:

  • Input pre-subtitled Cantonese videos into the application.
  • Use BLEU and CHRF scores to compare generated translations with original subtitles.
  • Conduct manual review for human judgment of translation quality.

Processing Time Evaluation: 
The system’s total processing time is tested using a range of Cantonese videos, both short (10 minutes). Two key outputs are timed: 

  • Generating the subtitle (SRT) file. 
  • Creating the video preview with embedded subtitles. 

User Evaluation: 
Selected users—primarily native Mandarin speakers or Cantonese content creators—test the application. Their feedback is gathered via a questionnaire focusing on: 

  • UI/UX satisfaction 
  • Subtitle quality 
  • Instruction clarity 
  • Overall convenience of use 

Translation Model Evaluation: 
The BART-Cantonese model, fine-tuned with the team's dataset, is tested on subtitled videos. Evaluation includes: 

  • Automated metrics: BLEU and CHRF scores, using original subtitles as reference. 
  • Manual review: Human evaluators assess translation quality to capture nuances missed by automated metrics. 

Together, these assessments ensure the application is thoroughly tested for speed, usability, and translation accuracy. 

Implementation

The application is structured around five key functions, providing an intuitive user experience from video input to post-processing: 

  • Configuration Settings: Users can filter unwanted words and set a silence threshold to improve subtitle segmentation before translation begins. 
  • Content Transcription: Uploaded video/audio is transcribed into Cantonese sentences using Microsoft Azure's speech-to-text service. 
  • Content Translation: The transcribed text is translated into Written Chinese using the team’s fine-tuned machine translation model. 
  • Subtitled Video Preview and Formatting: Users can customize subtitle appearance—font, size, and color—and preview the results live. 
  • Subtitle Review and Download: Final subtitles are displayed in a scrollable table for review, with options to download the subtitled video and subtitle file. 

The design emphasizes ease of use and customization, supporting a seamless subtitling workflow for users. 

 

System Processing Time 

The team benchmarked the total time required to generate subtitles and preview videos across different setups: 

  • Using the custom model with GPU yielded the fastest processing times, outperforming Azure for all video lengths tested (under 1 minute to 10 minutes). 
  • The GPU-enhanced model even surpassed Azure by up to 8 seconds, demonstrating that hardware upgrades can further boost performance, unlike Azure which is limited by its static API service speed. 

Translation Model Performance 

Three translation models were evaluated: Ayaka, indiejoseph, and fnlp, with results split between high and low similarity sample types. 

No single model excelled in both; models trained longer (like Ayaka) handled low similarity (localized/slang-heavy) content better, while others fared better with high similarity (formal) content. 

When compared to Azure, the final Ayaka model: 

  • Scored slightly lower on high similarity (formal) content. 
  • Scored higher on low similarity, which aligns better with the subtitling goal of capturing nuanced, spoken Cantonese. 

Ultimately, Ayaka was adopted as the final model due to its superior handling of informal and localized content, fulfilling the project's objective of high-quality Cantonese-to-written-Chinese subtitling. 

Conclusion

The project successfully implemented a Retrieval-Augmented Generation (RAG) system powered by LLaMA-2 and enhanced with Semantic Textual Similarity (STS) to provide accurate, context-aware answers within the singing domain. Key achievements included: 

  • Building a specialized knowledge base. 
  • Fine-tuning LLaMA-2 for domain relevance. 
  • Enabling efficient content retrieval and generating high-quality responses. 
  • Establishing and meeting robust evaluation benchmarks. 

Limitations 

  • The knowledge base, while rich, lacked full coverage for rare or complex singing topics. 
  • The system was unable to handle multi-turn dialogue, limiting conversational continuity. 
  • Use of top-1 context retrieval restricted the model’s response depth. 

Future Development

  • Shift to a top-k retrieval approach to synthesize insights from multiple sources. 
  • Expand the knowledge base to include wider and deeper content across all singing subfields. 
  • Improve context management for smoother, ongoing conversations. 
Jonathan Chiu
Marketing Director
3DP Technology Limited

Jonathan handles all external affairs include business development, patents write up and public relations. He is frequently interviewed by media and is considered a pioneer in 3D printing products.

Krutz Cheuk
Biomedical Engineer
Hong Kong Sanatorium & Hospital

After graduating from OUHK, Krutz obtained an M.Sc. in Engineering Management from CityU. He is now completing his second master degree, M.Sc. in Biomedical Engineering, at CUHK. Krutz has a wide range of working experience. He has been with Siemens, VTech, and PCCW.

Hugo Leung
Software and Hardware Engineer
Innovation Team Company Limited

Hugo Leung Wai-yin, who graduated from his four-year programme in 2015, won the Best Paper Award for his ‘intelligent pill-dispenser’ design at the Institute of Electrical and Electronics Engineering’s International Conference on Consumer Electronics – China 2015.

The pill-dispenser alerts patients via sound and LED flashes to pre-set dosage and time intervals. Unlike units currently on the market, Hugo’s design connects to any mobile phone globally. In explaining how it works, he said: ‘There are three layers in the portable pillbox. The lowest level is a controller with various devices which can be connected to mobile phones in remote locations. Patients are alerted by a sound alarm and flashes. Should they fail to follow their prescribed regime, data can be sent via SMS to relatives and friends for follow up.’ The pill-dispenser has four medicine slots, plus a back-up with a LED alert, topped by a 500ml water bottle. It took Hugo three months of research and coding to complete his design, but he feels it was worth all his time and effort.

Hugo’s public examination results were disappointing and he was at a loss about his future before enrolling at the OUHK, which he now realizes was a major turning point in his life. He is grateful for the OUHK’s learning environment, its industry links and the positive guidance and encouragement from his teachers. The University is now exploring the commercial potential of his design with a pharmaceutical company. He hopes that this will benefit the elderly and chronically ill, as well as the society at large.

Soon after completing his studies, Hugo joined an automation technology company as an assistant engineer. He is responsible for the design and development of automation devices. The target is to minimize human labor and increase the quality of products. He is developing products which are used in various sections, including healthcare, manufacturing and consumer electronics.

Course Code Title Credits
  COMP S321F Advanced Database and Data Warehousing 5
  COMP S333F Advanced Programming and AI Algorithms 5
  COMP S351F Software Project Management 5
  COMP S362F Concurrent and Network Programming 5
  COMP S363F Distributed Systems and Parallel Computing 5
  COMP S382F Data Mining and Analytics 5
  COMP S390F Creative Programming for Games 5
  COMP S492F Machine Learning 5
  ELEC S305F Computer Networking 5
  ELEC S348F IOT Security 5
  ELEC S371F Digital Forensics 5
  ELEC S431F Blockchain Technologies 5
  ELEC S425F Computer and Network Security 5
 Course CodeTitleCredits
 ELEC S201FBasic Electronics5
 IT S290FHuman Computer Interaction & User Experience Design5
 STAT S251FStatistical Data Analysis5
 Course CodeTitleCredits
 COMPS333FAdvanced Programming and AI Algorithms5
 COMPS362FConcurrent and Network Programming5
 COMPS363FDistributed Systems and Parallel Computing5
 COMPS380FWeb Applications: Design and Development5
 COMPS381FServer-side Technologies and Cloud Computing5
 COMPS382FData Mining and Analytics5
 COMPS390FCreative Programming for Games5
 COMPS413FApplication Design and Development for Mobile Devices5
 COMPS492FMachine Learning5
 ELECS305FComputer Networking5
 ELECS363FAdvanced Computer Design5
 ELECS425FComputer and Network Security5