AI-Powered Multimodal Depression Detection: Harnessing Visual Cues For Early Screening

School of Science and Technology 科技學院
Computing Programmes 電腦學系

AI-Powered Multimodal Depression Detection: Harnessing Visual Cues For Early Screening

Anthony Jesu Ashok Savitha Maria Dharshini

Programme	Bachelor of Computing with Honours in Computer Science
Supervisor	Dr. Kevin Hung
Areas	Intelligent Applications
Year of Completion	2026

Objectives

Project Aim

The aim of this project is to design and implement an AI-based system for depression detection. By focusing on visual behavioural cues and multimodal analysis, the system seeks to move beyond traditional single-modality approaches. The goal is to provide a reliable, explainable, and accessible screening tool that can support early identification of depressive symptoms and enhance mental health assessment practices.

Project Objectives

The objectives of this project include:

Review existing literature on AI-based depression detection, with emphasis on visual behavioural cues and multimodal analysis.
Collect and preprocess relevant datasets, extracting key visual features such as eye gaze, head pose, facial action units (FAUs), and eye blinks for model development.
Engineer quaternion-based features from extracted cues to robustly encode spatial orientation and relationships.
Develop, train, and optimize machine learning models for binary depression classification, applying techniques to address data imbalance.
Evaluate and compare model performance using established metrics including accuracy, precision, recall, F1-score, Cohen's Kappa, and ROC curve analysis.
Design and implement a minimal functional prototype integrating the best-performing model into a web application for screening via pre-extracted features.

Videos

Demonstration Video

Presentation Video

Methodologies and Technologies used

Model Overview

Developed quaternion-based depression detection model
Pipeline includes data collection, feature extraction, multimodal fusion, model training and evaluation
Illustrated in Figure 3.1: Proposed Model Development Pipeline

OpenFace 2.0 Feature Extraction

Processed DAIC-WOZ clinical interview videos using OpenFace 2.0
Extracted behavioural cues:
- Head pose: Euler angles (pose_Rx, pose_Ry, pose_Rz)
- Eye gaze: 3D gaze vectors (gaze_0_x/y/z, gaze_1_x/y/z)
Excluded Facial Action Units (FAUs) to focus on geometric cues
Aggregated signals into fixed-length feature vectors for quaternion encoding and ML models

Data Source and Description

Dataset: Distress Analysis Interview Corpus – Wizard of Oz (DAIC-WOZ)
189 semi-structured clinical interviews with virtual agent “Ellie”
Depression labels derived from PHQ-8 scores (≥10 = depressed, <10 = non-depressed)
Partitioned into Training (107), Development (35), Test (47) participants
Pre-extracted OpenFace features used: head pose, eye gaze, blink dynamics

Quaternion-Based Feature Fusion

Transformed 3D visual features into 4D quaternion representations
Head pose quaternion: qhead = 0 + hxi + hyj + hzk
Eye gaze quaternion: qgaze = 0 + gxi + gyj + gzk
Encoding designed to capture spatial orientation and interrelationships between cues

Model Development and Training

Selected classical ML models: SVC, Logistic Regression, Random Forest, kNN, Decision Tree, XGBoost, CatBoost
Rationale: small dataset size, interpretability, robustness to class imbalance
Excluded deep learning models due to limited samples and pre-extracted features

Hyperparameter Tuning and Validation

Used 5-fold Stratified Cross-Validation with GridSearchCV
Preserved class distribution across folds to handle imbalance
Optimized parameters for XGBoost and other models to prevent overfitting
Evaluation metrics: Accuracy, F1-score, Precision, Recall, Cohen's Kappa, ROC-AUC

Results and Analysis

Head pose quaternion features achieved highest performance (F1=0.80)
Quaternion encoding improved recall for depressed individuals compared to standard 3D features
Eye gaze quaternion features showed high recall but lower precision, useful for screening
Blink dynamics provided stable signals with moderate accuracy (~68–71%)
Decision Tree and XGBoost emerged as strong learners for quaternion features

Figure 1: Proposed Model Development Pipeline

Figure 2: Data Preprocessing and Model Development Workflow

Results ( Prototype System Design)

Head Pose Analysis

Quaternion (4D) features outperformed standard 3D features in validation (Accuracy 82.86%, Recall 0.67 vs. 0.33).
Decision Tree proved robust, showing “Substantial Agreement” (Kappa 0.60).
On unseen test data, accuracy remained >60%, confirming scalability despite small sample size.

Eye Gaze Analysis

Standard models achieved high accuracy (86.36%) but limited recall (0.50).
Quaternion features boosted recall to 1.00, identifying all depressed cases, though at lower precision.
Demonstrates trade-off: quaternions improve sensitivity, ensuring no cases are missed.

Blink Eye Analysis

Stable accuracy across validation and test (~68–71%).
Logistic Regression balanced performance best, achieving Recall 0.50 on test data.
Shows blink behaviour is a consistent biomarker, even without complex quaternion encoding.

Blink Head Analysis

Decision Tree achieved balanced accuracy (72.73%) and Recall 0.50 on test data.
kNN models consistently flagged depressed cases (Recall up to 1.00) but struggled with healthy individuals.
Indicates “Blink Head” is a viable secondary biomarker, especially when fused with Blink Eye features.

Overall Findings

Quaternion-based features capture subtle “micro-gestures” and gaze fixations linked to depression.
Models achieved strong validation performance, with consistent >60% accuracy on unseen test data.
Trade-off observed: higher recall often came at the expense of precision, but in medical screening, sensitivity is preferred to avoid missing cases.
Combined signals (head pose, eye gaze, blink behaviour) provide a multi-modal foundation for reliable depression detection.

Figure 4. Landing Page

Figure 4. Non-depressed person dashboard result

Figure 5. Depressed person dashboard result

Implementation

Modular Design

Function-based architecture ensures reproducibility and clear separation between exploratory analysis and official evaluation.
Each visual cue (Head, Eye, Blink+Head, Blink+Eye) has two scripts:
- *.py (e.g., H.py): Exploratory analysis with internal train/validation split.
- *2.py (e.g., H2.py): Official DAIC-WOZ protocol using fixed development set.

Package Dependencies

Data Handling: pandas, numpy
Preprocessing: StandardScaler (z-score normalization), SMOTE (class imbalance handling)
Models: SVC, Logistic Regression, Random Forest, kNN, Decision Tree, XGBoost, CatBoost
Evaluation: Accuracy, Precision, Recall, F1, Cohen's Kappa, ROC-AUC
Visualization: seaborn (heatmaps), matplotlib (ROC curves)

Data Loading & Preprocessing

load_data(): Loads Excel feature matrices and labels with strict row alignment.
preprocess_data(): Applies z-score normalization, fitted only on training data to prevent leakage.
SMOTE: Applied after standardization, generating synthetic samples only in training sets.

Model Training & Evaluation

train_and_evaluate(): Encapsulates training, prediction, and metric computation.
Metrics include Accuracy, Cohen's Kappa, Precision/Recall/F1, Confusion Matrix, ROC-AUC.
Visual outputs (heatmaps, ROC curves) saved for report inclusion.

Model Selection & Hyperparameter Tuning

Seven classical ML models tested under consistent protocols.
XGBoost tuned via exhaustive grid search with 5-fold StratifiedKFold cross-validation.
Parameters optimized: n_estimators, max_depth, learning_rate, subsample, colsample_bytree, gamma, min_child_weight.

Execution Flow

H.py: Exploratory, uses train_test_split for internal validation.
H2.py: Official protocol, loads DAIC-WOZ development set directly for fair comparison with literature.

Computational Environment

Isolated Python virtual environment (myenv/) with fixed package versions (scikit-learn==1.3.0, xgboost==2.0.0, catboost==1.2.0, imbalanced-learn==0.11.0).
Reproducibility enforced via fixed random seeds across SMOTE, StratifiedKFold, train_test_split, and model initialization.

Output Management

Console logs: Accuracy, Kappa, Precision/Recall/F1 per iteration.
Visual outputs: Confusion matrices, ROC curves.
Serialized models: Exported via joblib/pickle (e.g., catboost_final_model.pkl).
Diagnostics: CatBoost convergence logs stored for loss curve analysis.

Cross-Feature Consistency

Template-driven pipeline ensures identical workflow across Head, Eye, Blink+Head, Blink+Eye.
Only file paths and dataset identifiers change, enabling systematic ablation studies.

Robustness & Edge Case Handling

zero_division=0 prevents metric errors when minority classes receive zero predictions.
Safe ROC-AUC handling for classifiers without probability outputs.
Strict scaler reuse prevents feature leakage.
Multi-core processing enabled via n_jobs=-1 for efficient grid search.

Conclusion

This project demonstrates that quaternion-based feature encoding significantly improves depression detection compared to traditional 3D concatenation methods (F1 = 0.80 vs. F1 = 0.63). By strategically applying quaternion algebra to the most informative visual channel—head pose—competitive performance was achieved with minimal complexity.

The system design emphasizes privacy-preserving principles, ensuring that sensitive behavioural data can be processed securely without reliance on external APIs. The resulting prototype provides a viable pre-clinical screening tool that lowers barriers to mental health assessment, particularly for individuals hesitant to seek professional help due to stigma or accessibility constraints.

Overall, the project contributes to the field of AI-driven mental health screening by showing that quaternion encoding and multimodal behavioural cues can enhance sensitivity, reliability, and explainability. This lays the foundation for future work in scalable, accessible depression detection systems.

Future Development

Dataset Limitations

Small sample size (n=189) with only 22 participants in the held-out test set reduces statistical power and increases risk of overfitting.
Class imbalance (~1:2.7 ratio) required aggressive balancing (SMOTE), which may not fully capture natural variability in depressive behaviours.
Controlled laboratory conditions (consistent lighting, fixed protocol, high-quality recordings) differ from real-world home environments with variable lighting, noise, and distractions.
Demographic homogeneity (Los Angeles cohort) limits cultural and socioeconomic diversity, reducing generalizability.

Methodological Limitations

Quaternion encoding improved head pose features (F1: 0.63 → 0.80) but showed limited benefits for eye gaze and blink dynamics.
Combined feature sets (Blink+Head, Blink+Eye) underperformed compared to head pose alone, suggesting quaternion algebra is most effective for rotational cues.
Exclusion of Facial Action Units (FAUs) may have limited performance, as FAUs are established depression indicators.

Technical Limitations

Analysis relied on pre-extracted features; real-time extraction introduces latency, tracking failures, and quality degradation under suboptimal conditions.
Classical ML models provided interpretability but may lack the representational capacity of deep learning approaches.
Exclusion of audio and text modalities preserved privacy but limited performance compared to multimodal systems.

Future Work

Validation on Larger, Diverse Datasets: Replicate findings on independent datasets (e.g., AVEC 2019), expand participant diversity, and increase sample size for stronger statistical power.
Real-World Deployment Studies: Test robustness in uncontrolled home environments with variable lighting, camera quality, and background noise; conduct longitudinal studies to assess temporal stability.
Advanced Feature Engineering: Explore quaternion encoding for additional rotational cues (body posture, gait), investigate hybrid fusion strategies, and incorporate temporal dynamics via sequence modeling.
Clinical Translation: Conduct usability testing with target populations, define clinical decision thresholds balancing sensitivity and specificity, integrate with mental health services, and address regulatory requirements.