School of Science and Technology 科技學院
Computing Programmes 電腦學系

Cantonese to Written Chinese Machine Translation 

Tsang Tsz Yui, Ng Man Kwan, Lau Lok Lam

ProgrammeBachelor of Computing with Honours in Internet Technology

Bachelor of Science with Honours in Computer Science
SupervisorDr. Jeff Au Yeung Siu Kei
AreasIntelligent Applications
Year of Completion2024

Objectives

Project Aim

This project aims to enhance the accessibility of Cantonese content while providing convenience to both the content creator and audience by creating a platform for automating transcription and Cantonese-to-written Chinese translation from video input. 

Project Objectives 

Our objective is to develop a Cantonese-to-Chinese auto-subtitling tool, which specifically provides written Chinese subtitles for Cantonese movies and videos. The application will be a web application and the application can be mainly divided into transcription and translation. 

  1. Transcription:
  • Collaborate with an existing tool to extract and unify the audio format from the user's video. 
  • Cooperate with the speech-to-text service provided by Microsoft Azure for getting the content from audio. 
  1. Translation:
  • Build our own Cantonese-to-Chinese dataset to ensure the translation accuracy. 
  • Develop a Cantonese-to-Chinese translation AI model by using the pre-trained AI model to train with our dataset. 
  1. Develop a user-friendly user interface for our web application.

Videos

Demonstration Video

Presentation Video

Methodologies and Technologies used

The technologies we used for this project were included here, many of these technologies are third-party Python libraries.  

  • Python: The main language of our application development. 
  • Ffmpeg: Video/audio editing modules for Python. Used for extracting audio from input video. 
  • Pre-trained models: AI models from big companies(e.g. BERT from Google). Used as the basis of the translation model of our application. Will be retrained with our dataset later. 
  • Werkzeug: Securing input file for storing the file on a file system safely, any potentially problematic filename will be converted into ASCII only string. 
  • pyTorch: Used to implement AI models. 
  • Cloud server: Cloud server services provided by big companies. Used in our application for hosting the website. 
  • Flask: Framework used for web application development and implementation of APIs. 
  • Moviepy: Creating subtitle embed-ed video for final product preview and download. 
  • Auditok: Splitting the input audio into proper sentence length suitable for subtitle, by detecting silence moments in audio. Accurate timestamps are also generated for each audio part. 
  • Transformers: a library provided from huggingface for downloading, training, and implementing transformer AI models. 
  • Datasets: a library provided from huggingface for accessing dataset in the huggingface database. 
  • Evaluates: a library provided from huggingface with multiple evaluation metrics to measure the model performances.

Figure 1: Use case diagram

Single-page website design used for clarity and ease of use.

Caters to both standard and experienced users:

  • Simple UI for standard users.
  • Advanced options hidden but available for experienced users.

Advanced settings have default values:

  • Users can ignore or customize them without interrupting the process.

Uploaded video file is the main data source:

  • Temporarily stored only during processing.
  • Automatically deleted after process completion and user disconnect.

System data flow is represented with a supporting diagram.

Figure 2: User flow diagram representation of our system.

Uploaded video is the primary data source.

Temporarily stored and deleted after processing and user disconnection.

Data flow process:

  1. User uploads video.
  2. Audio is extracted from the video.
  3. Video is saved for final embedding.
  4. Audio is segmented into smaller parts with time offsets.
  5. Each segment is sent to Azure API for processing.
  6. Results are used to translate and timestamp the audio.
  7. A subtitle file (.srt) is generated.
  8. User can either receive the subtitle file or request the burned-in video output.

Figure 3: Data flow diagram representation of the system.

Training methodology

Dataset:

  • Used 110k training pairs and 5.6k validation pairs from sources like Cantonese Wikipedia and Hong Kong forums.
  • Manual validation ensured clean grammar, punctuation, and meaning.
  • Cantonese sentences were hand-translated to written Chinese for accuracy.

Training Process:

  • Dataset downloaded from Huggingface and tokenized.
  • Pretrained models fine-tuned on the tokenized training set.
  • Multiple training iterations and evaluations using the validation set.
  • Final output is an optimized translation model.

Model Evaluation:

  • Used BLEU (word n-grams) and CHRF (character n-grams) as metrics.
  • Validation set split into high- and low-similarity samples to assess performance on varied translation difficulty.
  • Training and inference used different tokenization and padding setups, causing internal evaluation inconsistency.
  • To ensure accurate results, all checkpoints were saved and evaluated externally—time-intensive but more reliable.

Evaluation

Objective: Assess overall application performance (transcription + translation).

Method:

  • Use a mix of short (<10 min) and long (>10 min) Cantonese videos.
  • Measure processing time from video upload to SRT subtitle file generation.
  • Repeat measurement for generating video previews with burnt-in subtitles.

User Evaluation

Objective: Evaluate user experience.

Method:

  • Recruit end users (Mandarin speakers or Cantonese content creators).
  • Collect feedback via a questionnaire.
  • Focus areas: UI/UX satisfaction, subtitle quality, clarity of instructions, and overall ease of use.

Experiment

Objective: Assess machine translation performance.

Model Used: Fine-tuned BART-Cantonese model.

Method:

  • Input pre-subtitled Cantonese videos into the application.
  • Use BLEU and CHRF scores to compare generated translations with original subtitles.
  • Conduct manual review for human judgment of translation quality.

Processing Time Evaluation: 
The system’s total processing time is tested using a range of Cantonese videos, both short (10 minutes). Two key outputs are timed: 

  • Generating the subtitle (SRT) file. 
  • Creating the video preview with embedded subtitles. 

User Evaluation: 
Selected users—primarily native Mandarin speakers or Cantonese content creators—test the application. Their feedback is gathered via a questionnaire focusing on: 

  • UI/UX satisfaction 
  • Subtitle quality 
  • Instruction clarity 
  • Overall convenience of use 

Translation Model Evaluation: 
The BART-Cantonese model, fine-tuned with the team's dataset, is tested on subtitled videos. Evaluation includes: 

  • Automated metrics: BLEU and CHRF scores, using original subtitles as reference. 
  • Manual review: Human evaluators assess translation quality to capture nuances missed by automated metrics. 

Together, these assessments ensure the application is thoroughly tested for speed, usability, and translation accuracy. 

Implementation

The application is structured around five key functions, providing an intuitive user experience from video input to post-processing: 

  • Configuration Settings: Users can filter unwanted words and set a silence threshold to improve subtitle segmentation before translation begins. 
  • Content Transcription: Uploaded video/audio is transcribed into Cantonese sentences using Microsoft Azure's speech-to-text service. 
  • Content Translation: The transcribed text is translated into Written Chinese using the team’s fine-tuned machine translation model. 
  • Subtitled Video Preview and Formatting: Users can customize subtitle appearance—font, size, and color—and preview the results live. 
  • Subtitle Review and Download: Final subtitles are displayed in a scrollable table for review, with options to download the subtitled video and subtitle file. 

The design emphasizes ease of use and customization, supporting a seamless subtitling workflow for users. 

 

System Processing Time 

The team benchmarked the total time required to generate subtitles and preview videos across different setups: 

  • Using the custom model with GPU yielded the fastest processing times, outperforming Azure for all video lengths tested (under 1 minute to 10 minutes). 
  • The GPU-enhanced model even surpassed Azure by up to 8 seconds, demonstrating that hardware upgrades can further boost performance, unlike Azure which is limited by its static API service speed. 

Translation Model Performance 

Three translation models were evaluated: Ayaka, indiejoseph, and fnlp, with results split between high and low similarity sample types. 

No single model excelled in both; models trained longer (like Ayaka) handled low similarity (localized/slang-heavy) content better, while others fared better with high similarity (formal) content. 

When compared to Azure, the final Ayaka model: 

  • Scored slightly lower on high similarity (formal) content. 
  • Scored higher on low similarity, which aligns better with the subtitling goal of capturing nuanced, spoken Cantonese. 

Ultimately, Ayaka was adopted as the final model due to its superior handling of informal and localized content, fulfilling the project's objective of high-quality Cantonese-to-written-Chinese subtitling. 

Conclusion

The project successfully implemented a Retrieval-Augmented Generation (RAG) system powered by LLaMA-2 and enhanced with Semantic Textual Similarity (STS) to provide accurate, context-aware answers within the singing domain. Key achievements included: 

  • Building a specialized knowledge base. 
  • Fine-tuning LLaMA-2 for domain relevance. 
  • Enabling efficient content retrieval and generating high-quality responses. 
  • Establishing and meeting robust evaluation benchmarks. 

Limitations 

  • The knowledge base, while rich, lacked full coverage for rare or complex singing topics. 
  • The system was unable to handle multi-turn dialogue, limiting conversational continuity. 
  • Use of top-1 context retrieval restricted the model’s response depth. 

Future Development

  • Shift to a top-k retrieval approach to synthesize insights from multiple sources. 
  • Expand the knowledge base to include wider and deeper content across all singing subfields. 
  • Improve context management for smoother, ongoing conversations.