School of Science and Technology 科技學院
Computing Programmes 電腦學系

Job Interview Simulator

CHAN Ho, TONG Kin Lik, WONG Kwun Hei, WU Shing Yan

ProgrammeBachelor of Computing with Honours in Internet Technology

Bachelor of Science with Honours in Computer Science
SupervisorDr. Keith Lee
AreasVirtual Reality Applications
Year of Completion2024

Objectives

Project Aim  

This report presents a VR prototype called Job Interview Simulator as a solution to reinforce better performance in first job interviews taken by fresh graduates. 

Project Objectives  

The project aims to support fresh graduates in improving their job interview performance through a VR-based application. The key goals include: 

  1. Identifying challenges that graduates commonly face during interviews. 
  2. Exploring VR technologies as tools to address these challenges effectively. 
  3. Analyzing existing VR solutions to adopt useful features for the new application. 
  4. Developing an interactive VR interview simulator featuring: 
  • Reading and speaking comprehension sessions 
  • A virtual interviewer powered by Text-to-Speech (TTS) and Speech-to-Text (STT) 
  • Real-time speech recognition and pronunciation scoring 
  • A remote database for Q&A content delivery and response analysis 

The system is designed to provide feedback and track user progress, enhancing both language skills and interview readiness. 

Videos

Demonstration Video

Presentation Video

Methodologies and Technologies used

Overview of Solution

Our solution leveraged existing hardware but innovated in software aspects, including backend scripts and user-facing programs.

The project has entered its final development phase using PICO 4 Pro VR headsets. Key components include: 

VR Application Development: 

  • Built using Unity, with the application exported as an APK and deployed to the headset. 
  • Realistic 3D scenes and human avatars simulate immersive interview experiences. 

Server-Side Processing: 

  • Python servers handle backend functions such as question generation, answer retrieval, and speech scoring using connected libraries and APIs. 

Sample Q&A Module: 

  • Utilizes JSON files and Firebase as data storage, allowing synchronized access between the VR app and backend. 

Chatbot Interviewer: 

  • Custom AI chatbots from POE simulate real interviewers, generating and processing questions dynamically through a dedicated Python server. 

Voice Interaction: 

  • Users respond via the VR headset's microphone. 
  • Microsoft Azure APIs handle speech-to-text and text-to-speech. 
  • My-voice-analysis library evaluates pronunciation and provides scoring feedback. 
Architecture or Level System Design 

Use Case Diagram and Function List:

There is only one user type in our application which is the user themselves. The user would first choose a VR scene and a question type before starting their job interview practice. They would be able to practice their interview responses in three different ways including picking a choice in multiple-choice questions, reading aloud a suggested answer, and answering questions in their own words. The details are shown as follows

Figure 1: Use Case Diagram and Function List

System Design: 

Figure 2: Application Hierarchy 

Figure 3: Design Chart of Sample Q&A/Pronunciation Training 

Once a VR user opens the Q&A module to start learning, Unity can visit the Python server by accessing a URL link. The Python server will be acknowledged to call one of its functions to collect all data stored in a JSON file. As mentioned before, the JSON file is deployed to store and update all questions, answers and explanations. The Python server then relays the data back to Unity so that VR user can see the questions, answers and explanations one by one.  

Figure 4: JSON file for the Q&A Reading module

The system enables real-time pronunciation feedback in the VR interview app using a multi-layered architecture: 

  • Audio Capture: The Unity VR application records the user’s spoken response and uploads the audio file to Firebase Cloud Storage. 
  • Backend Processing: Simultaneously, the app sends a request to a Python Flask server, which initiates a scoring program. 
  • Speech Evaluation: The program retrieves the audio file from Firebase and processes it using Azure’s Speech Assessment API, which evaluates pronunciation, fluency, accuracy, and overall proficiency. 
  • Results Delivery: Once scoring is completed, the results are returned through the server back to the VR application for display to the user. 

This setup combines cloud storage, real-time processing, and speech analytics to deliver detailed and automated feedback on spoken performance. 

Figure 5: Design Chart of Speech Scoring System 

3D models

Figure 6: Office Meeting Room Scene (Unity 3D) 

Figure 7: Office Reception Area Scene (Unity 3D) 

Implementation

Question and Answer Reading 

This module helps users improve speaking and presentation skills by allowing them to practice reading aloud sample job interview questions and answers, followed by detailed speech performance analysis. 

Key Features: 

Language Options: Users can choose to practice in English, Cantonese, or Mandarin, with the system adjusting accordingly. 

Figure 8

Figure 8

Interactive Q&A: Users can view sample questions, answers, and explanations on a virtual panel and navigate through different Q&A sets.

Figure 9

Figure 9

Users record their spoken responses using a VR interface.

Figure 10

Figure 10

A color-coded word accuracy list helps identify specific pronunciation errors (green = excellent, red = poor).

Figure 11

Figure 11

Module- Virtual Interview

The module blends interactivity, real-time feedback, and multilingual support to enhance users' job interview readiness and spoken language proficiency 

Key Features: 

Start Interface: Users begin the session by clicking “Start” to enter the interview scene.

Figure 12

Figure 12

Language Options: Users can choose between English, Cantonese, or Mandarin before proceeding.

Figure 13

Figure 13

A microphone button appears after language selection.

Users record their responses by clicking the mic; input is disabled while speaking. 

Figure 14

Figure 14

After recording, users can view their spoken text, delete it, or resend it.

Figure 15

Figure 15

Clicking “Send” transmits the response via ZMQ to a backend server.

Figure 16

Figure 16

The POE chatbot processes user input and returns a reply. 

Figure 17

Figure 17

The response is converted to speech for playback. 

Users can click “Replay” to hear the interviewer's response again.

Figure 18

Figure 18

If users do not look at the interviewer, “Look at the interviewer!” will appear to remind users.

Figure 19

Figure 19

Conclusion

Key Achievements 

  • Built a complete VR interview training application with immersive, realistic interaction. 
  • Integrated Azure STT and TTS APIs for seamless speech-based interaction. 
  • Adopted POE chatbot to act as an interviewer, saving development time. 
  • Added pronunciation training with scoring and visual feedback. 
  • Created a comprehensive set of sample questions and answers in general and technical domains. 

 

Limitations 

  • Hardware dependency: Requires VR headsets, which may limit user accessibility. 
  • AI hallucinations: Potential for inaccurate or irrelevant responses from the chatbot. 
  • VR sickness: Risks of disorientation or discomfort during prolonged use. 
  • Content scope: Technical questions may not suit users from non-IT fields. 
  • Limited interactivity: Interviewer lacks animations, making the experience feel less lifelike. 

Future Development

  • Expand QA content to cover more industries; add interviewer animations for better immersion. 
  • Improve user-chatbot interaction flow, eliminating the need to press send; add facial expressions for realism. 
  • Introduce group discussion simulations to prepare for multi-candidate interview formats. 
  • Implement resume analysis and personalized question generation. 
  • Add body language detection for non-verbal feedback. 
  • Extend platform compatibility to PCs, smartphones, and consoles to broaden accessibility.