School of Science and Technology 科技學院
Computing Programmes 電腦學系

PetVision: Advanced System for Multi-Level Pet Classification

Wong Ho Yin, Yu Yiu Pang , Chan Tsz Ho, Huang Liming

ProgrammeBachelor of Computing with Honours in Computer Science
SupervisorDr. Kayley Xiaoxue Ma
AreasIntelligent Applications
Year of Completion2026

Objectives

Project Aim

The aim of this project is to design and deploy PetVision, a hybrid intelligence system that surpasses existing limitations in automated computer vision for pet care. By integrating Multi-Task Learning (MTL) and Large Language Models (LLMs), PetVision will classify species, breed, size, and coat in a hierarchical manner while also inferring complex attributes such as physical condition and mood context. The goal is to provide pet owners with transparent, immediate, and human-grade health and care insights.

Project Objectives
The objectives of this project include:
  1. Develop a Hierarchical MTL Model: Construct a high-efficiency MTL model based on a MobileNetV2 backbone to classify species, breeds, and physical attributes simultaneously, while overcoming severe class imbalances in training datasets.
  2. Create Deterministic Logic Guardrails: Implement a post-processing layer that cross-checks AI predictions against structured biological data (e.g., mapping breed to species) to eliminate logical hallucinations and classification errors.
  3. Ensure Explainable AI (XAI): Integrate Grad-CAM (Gradient-weighted Class Activation Mapping) to generate visual heatmaps, allowing users to validate AI decision-making with clear physical evidence.
  4. Enable Advanced Cognitive Inference via LLM: Leverage the Pixtral-12B-2409 multimodal model to synthesize CNN outputs with objective image data, bridging raw classification with actionable veterinary-grade advice.

Videos

Demonstration Video

Presentation Video

Methodologies and Technologies used

Data Engineering
  • Aggregated 159,935 images from Oxford-IIIT Pet, Kaggle Dogs, Cat Breeds, and Reptiles/Amphibians datasets.
  • Initial challenge: severe long-tail distribution (e.g., Domestic Shorthair ~33% of data vs. rare breeds <10 samples).
  • Implemented Heuristic Sampling Strategy with a Global Breed Cap of 200 images per class.
  • Applied Iterative Offline Augmentation (rotations, brightness shifts, flips) to balance rare categories.
  • Produced a balanced dataset of ~50,000 high-variance images, enabling focus on morphological features rather than statistical frequency.
Multi-Task Learning Architecture
  • Phase 1: Over-engineered design with 13 dense layers (2048 units) caused gradient vanishing and poor validation scores.
  • Phase 2: Optimized design reduced breed branch to 5 layers (512→128 units), improving convergence speed and accuracy by 20%.
  • Introduced Channel Attention Mechanism to weigh feature maps dynamically.
  • Switched from Standard Cross-Entropy to Weighted Focal Loss to prioritize hard-to-classify breeds.
Training Optimization & Loss Functions
  • Online Data Augmentation applied random transformations (rotations, flips, brightness adjustments) per batch to reduce overfitting.
  • Final task weights: [0.8 species, 1.2 breed, 1.0 size, 1.0 coat] after resolving dataset imbalance.
  • Composite loss strategy: Sparse Categorical Crossentropy for simple tasks, Weighted Focal Loss for breed classification.
  • Two-phase learning rate scheduling: warm-up + cosine decay (Phase 1), fine-tuning with cosine annealing restarts (Phase 2).
  • Adam optimizer chosen for robustness and rapid convergence.
Regularization & Generalization
  • Dropout: aggressive rates (0.65–0.50) in breed branch to counter overfitting.
  • L2 regularization applied to fully connected layers (coefficients 0.0005→0.0003).
  • Batch Normalization added after convolutional and dense layers to stabilize training.
  • Residual connections introduced to improve gradient flow and prevent degradation in deep stacks.
Deterministic Logic Guardrails
  • Implemented a safety layer to cross-check CNN outputs against structured biological data.
  • Overrides biologically impossible predictions (e.g., “Beagle” classified as “Cat”).
  • Ensures user-facing results remain consistent with zoological reality.
Cognitive Inference & Prompt Engineering
  • Final LLM: Pixtral-12B-2409 (Mistral AI Cloud API) chosen for multimodal inference.
  • Inputs: structured text prompt, CNN outputs (species, breed, size, coat), and user-uploaded image.
  • Configured with low temperature (0.25) and strict JSON output for reproducibility and front-end parsing.
  • Transitioned from local Ollama deployment (slow, hardware stress) to cloud API (latency <10s per image, dataset processed in ~7 hours).
  • Few-Shot Prompting adopted for structured diagnostic reports; Zero-Shot found inconsistent and shallow.
  • Chain-of-Thought prompting guides reasoning: breed health traits → individual condition → practical suggestions.
  • Logical safeguards: reject conflicts, enforce breed lists, flag problematic breeds, and override false positives (e.g., toys misclassified as pets).
Final Integration
  • Combined CNN perception engine, deterministic guardrails, and LLM cognitive inference into a unified pipeline.
  • Delivered structured, explainable, and biologically consistent outputs with professional veterinary-style insights.

Figure 1. Flowchart of LLM logic

Results ( Prototype System Design)

Model Selection Benchmarking
  • Compared five CNN backbones: MobileNetV2, DenseNet121, ResNet50, EfficientNetB0, and MobileNetV3Large.
  • MobileNetV2 selected as optimal backbone with validation accuracies: 0.9912 (Species), 0.9601 (Coat), 0.9181 (Size), 0.6701 (Breed).
  • Demonstrated stable convergence, efficient parameter count, and lowest computational resource demand.
Total System Performance Analysis
  • Species recognition achieved 95.2% overall accuracy across 8 biological classes.
  • Cats (98.3%) and Dogs (96.0%) achieved near-perfect separation; Amphibians and Fish maintained >75% accuracy.
  • Hybrid CNN + LLM integration improved overall accuracy from 49.34% (CNN only) to 56.37% (+7.03% lift).
  • Global metrics (Precision, Recall, F1-Score) improved by 14.4%, validating the cognitive layer's contribution.
  • Coat Type accuracy: 93.0%; Body Size accuracy: 91.8%; Breed identification remains most challenging at 56.37%.
Discriminative Power Analysis (ROC)
  • ROC curve shows PetVision outperforms baseline CNN with higher AUC.
  • Operating point shifted to ~56.4% recall vs. ~50% baseline, with near-zero false positives.
  • LLM acts as “Cognitive Calibrator,” improving discrimination of ambiguous breeds.
Classification Error Analysis
  • Species classification: 98.5% accuracy, near-perfect feline vs. canine separation.
  • Coat: 90.7% accuracy; Size: 87.0% accuracy, minor confusion in adjacent categories.
  • Breed classification: 72.1% accuracy across 280+ classes, errors limited to visually indistinguishable sub-breeds.
  • Case study: Miniature vs. Toy Poodles misclassified due to 2D scale ambiguity; Lhasa vs. Shih Tzu confusion at 53.3% misidentification.
Breed-wide Distribution & Outlier Analysis
  • Out-of-Distribution items (Food, Furniture) correctly filtered at 100% accuracy.
  • LLM recovered low-confidence breeds (e.g., Toad +85%, Cocker Spaniel +55%).
  • Histogram shows ~10% average net accuracy increase across breeds; small subset experienced demotion (~6.9%).
  • Trade-offs: Birman and Chartreux performance decreased due to LLM over-correction.
Decision Path Analysis
  • CNN baseline correctly identified 42.4% of samples.
  • LLM recovered 13.9% of CNN errors, salvaging ~24.1% of misclassifications.
  • Total success rate: 56.37%; failure rate reduced to 36.7%.
LLM Efficiency & Edge Case Handling
  • Baseline CNN failed 100% on adversarial “Other” categories (food, furniture, cartoons).
  • Pixtral-12B Reasoning Layer correctly identified 100% of OOD samples, preventing hallucinations.
  • Local Llama 3.2 Vision deployment too slow (2–3 min/image, ~15 days total); migrated to Mistral AI Cloud API.
  • Cloud inference reduced latency to <10s per image, completing 6000-image dataset in ~7 hours.
Limitations of LLM Arbitration
  • Spatial constraints: 2D images lack absolute scale references, causing ambiguity in size-based breeds.
  • Low inter-class variability: LLM defaults to statistically common breeds when morphology is nearly identical.
  • Cognitive over-correction: 6.9% of breeds demoted due to excessive skepticism of CNN predictions.
Qualitative Case Studies
  • Case 1: Perfect identification of Pug as “dog” with 100% confidence.
  • Case 2: LLM rescued low-confidence CNN prediction, correcting breed label.
  • Case 3: Corrected misjudgment of highly similar cat breeds via Logic Guardrail.
  • Case 4: Logical error correction — CNN misclassified species, guardrail corrected to “Dog.”
  • Case 5: Safety test — correctly filtered “Roasted Pig” as “Food/Pork,” triggering “No Pet Detected.”
User Evaluation Plan
  • Survey designed with ethical considerations and consent requirements.
  • Forced-choice 4-point Likert scale used to measure feedback.
  • Three evaluation areas: Usefulness (Grad-CAM confidence), Value (market worth of health/mood reports), Usability (Streamlit interface and LLM response time).
Testing Result
Species Recognition
  • Achieved 95.2% overall accuracy across 8 biological classes.
  • Cats (98.3%) and Dogs (96.0%) reached near-perfect separation.
  • Challenging categories such as Amphibians and Fish maintained >75% accuracy.
  • Provided a reliable foundation for hierarchical classification and logic guardrails.
Breed Identification
  • Pure CNN baseline achieved 49.34% accuracy; hybrid CNN + LLM improved to 56.37% (+7.03% lift).
  • Breed classification accuracy reached 72.1% across 280+ classes, despite high cardinality.
  • LLM recovered ~24.1% of CNN errors, correcting ambiguous or low-confidence predictions.
  • ROC analysis confirmed improved discriminative power with higher AUC and near-zero false positives.
Attribute Classification
  • Coat Type: 93.0% accuracy.
  • Body Size: 91.8% accuracy, though scale ambiguity (e.g., Toy vs. Miniature Poodles) caused misclassifications.
  • Species-level sovereignty maintained at 99.1%, ensuring taxonomic correctness even when sub-breed confusion occurred.
Error & Outlier Analysis
  • Confusion matrices revealed minor “bleeding” between adjacent categories (e.g., short vs. long hair).
  • LLM successfully filtered Out-of-Distribution (OOD) items such as food and furniture with 100% accuracy.
  • Recovered low-confidence breeds (e.g., Toad +85%, Cocker Spaniel +55%) through morphological reasoning.
  • Trade-offs observed: Birman and Chartreux performance decreased due to LLM over-correction.
Decision Path Evaluation
  • CNN baseline correctly identified 42.4% of samples.
  • LLM salvaged 13.9% of total samples, narrowing failure rate to 36.7%.
  • Total system success rate: 56.37%, validating hybrid intelligence integration.
Qualitative Case Studies
  • Case 1: Perfect identification of Pug with 100% confidence.
  • Case 2: LLM rescued low-confidence CNN prediction, correcting breed label.
  • Case 3: Corrected misjudgment of highly similar cat breeds via Logic Guardrail.
  • Case 4: Logical error correction — CNN misclassified species, guardrail corrected to “Dog.”
  • Case 5: Safety test — correctly filtered “Roasted Pig” as “Food/Pork,” triggering “No Pet Detected.”

Figure 2. Pie chart of overall CNN species accuracy

Figure 3. Pie chart of PetVision decision path distribution

Figure 4. Result sample of Case 1

Figure 5. Result sample of Case 1

Implementation

Deployment Hardware (Remote Server)
  • Device: Remote GPU server provisioned for deep learning workloads.
  • Operating System: Linux-based environment optimized for CUDA and TensorFlow-GPU.
  • Processor: High-performance multi-core CPU with NVIDIA GPU acceleration.
  • Memory: 64GB RAM and 1TB SSD storage for dataset handling and fast I/O.
  • Networking: High-speed Ethernet/Wi-Fi ensuring stable real-time inference and data streaming.
  • AI Processing: Handles CNN training, LLM inference, Grad-CAM visualization, and logic guardrails.
  • Scalability: Supports multiple concurrent inference requests without reliance on third-party APIs.
Final System Workflow
  • Connection: User uploads pet images via Streamlit interface; images are sent to the server.
  • Perception Engine: MobileNetV2 backbone performs multi-task classification (species, breed, size, coat).
  • Logic Guardrails: Deterministic checks validate CNN outputs against biological mappings to prevent hallucinations.
  • Cognitive Layer: Pixtral-12B multimodal LLM synthesizes CNN outputs with contextual reasoning.
  • Output Generation: Grad-CAM heatmaps provide visual evidence; structured JSON reports deliver health/mood insights.
  • User Interface: Streamlit dashboard displays synchronized evidence (heatmaps) and insights (LLM reports) with adaptive accessibility features.

Conclusion

Summary of Achievements
  • Developed a Hybrid Intelligence System combining MobileNetV2-based Multi-Task Learning (MTL), deterministic logic guardrails, and LLM reasoning.
  • Achieved >99% accuracy in species classification, ensuring taxonomic reliability.
  • Integrated Grad-CAM for explainable AI, providing transparency and user trust in model decisions.
  • LLM (Pixtral-12B / Llama 3.2 Vision) enabled advanced reasoning, delivering holistic and actionable care insights.
  • UI/UX refinements (dual-entry upload, visual affordance, file sync lock, adaptive accessibility) improved usability, reduced cognitive load, and enhanced user confidence.
Limitations of the Solution
  • Rare breed accuracy remains limited due to insufficient training samples for uncommon sub-breeds.
  • Cognitive inference latency: local deployment of Llama 3.2 Vision required ~1 minute per inference, impacting real-time usability.
  • LLM arbitration occasionally over-corrected CNN predictions, leading to ~6.9% demoted breeds.
Suggestions for Future Work
  • Technical Refinement & Scalability: Fine-tune deeper MobileNetV2 layers for detailed texture recognition; implement RAG Lite with SQLite for faster, more specific LLM responses; strengthen security with MIME validation, file-size limits, and API rate-limiting.
  • Interactive Consultation: Transition from static diagnostic reports to interactive expert dialogs using Streamlit modal framework; enable contextual continuity and multi-modal deep inquiry for a “Virtual Veterinarian” experience; introduce focus mode UI for immersive consultations.
  • AI-Driven Veterinary Health Screening: Expand perception engine to detect clinical markers (dermatological issues, ocular abnormalities, postural analysis for joint pain or dysplasia).
  • Integration of Medical-Grade Knowledge: Implement Retrieval-Augmented Generation (RAG) linked to veterinary texts, enabling correlation of visual findings with breed-specific genetic diseases.
  • Health Tracking Over Time: Introduce longitudinal monitoring of body condition score (BCS) and coat quality, allowing proactive health management for pets.

Future Development

Technical Refinement & Scalability
  • Deep Layer Fine-Tuning: Unfreeze deeper layers of MobileNetV2 to capture fine-grained breed textures.
  • RAG Lite Implementation: Enhance the SQLite veterinary knowledge base to support Retrieval-Augmented Generation (RAG) for faster, more specific LLM responses without costly retraining.
  • Security Hardening: Protect the web application against malicious uploads and API abuse with MIME type validation, file-size limits, and rate-limiting (e.g., SlowAPI).
Interactive Consultation
  • Transition from static diagnostic reports to interactive expert dialogs using Streamlit modal framework.
  • Contextual continuity: AI expert will retain CNN results and structured attributes as context for follow-up inquiries.
  • Multi-modal deep inquiry: Enable users to ask detailed questions about health, care, training, and breed standards for a “Virtual Veterinarian” experience.
  • Focus mode UI: Provide immersive consultation with background-dimmed modal windows to reduce distractions.
AI-Driven Veterinary Health Screening
  • Expand perception engine to detect clinical markers such as dermatological issues (redness, alopecia), ocular abnormalities (lens cloudiness, discharge), and postural analysis for joint pain or dysplasia.
  • Shift system focus from “What kind of pet is this?” to “What is the status of this pet's health?” for added clinical and commercial value.
Integration of Medical-Grade Knowledge
  • Implement Retrieval-Augmented Generation (RAG) linked to professional veterinary texts.
  • Enable correlation of visual findings with breed-specific genetic diseases, providing pre-diagnostic insights for veterinarians.
Health Tracking Over Time
  • Introduce longitudinal monitoring of pets' body condition score (BCS) and coat quality.
  • Provide proactive health management by tracking changes across multiple uploads over time.