EnhanceNet: Leveraging Facial, Speech, and Textual Cues for Multimodal Emotion Recognition

Please use this identifier to cite or link to this item: http://hdl.handle.net/2080/5368

Title:	EnhanceNet: Leveraging Facial, Speech, and Textual Cues for Multimodal Emotion Recognition
Authors:	Sahoo, Prachyut Priyadarshi Patra, Dipti
Keywords:	Multimodal Emotion Recognition Deep Learning Facial Expression Recognition Speech Emotion Analysis Natural Language Processing Mental Health
Issue Date:	Oct-2025
Citation:	4th IEEE International Conference on Computer Vision and Machine Intelligence (CVMI), NIT, Rourkela, 12-13 October 2025
Abstract:	Multimodal emotion recognition plays a vital role in affective computing, with applications spanning mental health monitoring and adaptive learning systems. This paper introduces EnhanceNet, a comprehensive deep learning framework that fuses three key modalities: facial expression recognition, speech emotion analysis, and spoken language understanding. Each modality employs specialized neural network architectures—a residual CNN with squeeze-and-excitation blocks for facial cues, a CNN-LSTM model for speech signals, and a BiLSTM network for textual transcripts—trained on diverse, widely-used datasets. Unlike conventional late fusion approaches, EnhanceNet adopts an early fusion strategy by averaging predicted emotion vectors from each modality to form a robust, unified emotional profile: Efinal = 1/3 (Eface + Espeech + Etext). This approach leverages complementary strengths of individual modalities, mitigating challenges such as facial occlusion, ambiguous vocal intonation, or sparse linguistic content. The fused model achieves an overall accuracy of 78.23%, outperforming unimodal baselines. The system supports real-time facial expression analysis via webcam and asynchronous processing of audio and text inputs, demonstrating robustness to environmental variability such as lighting conditions and background noise. The results suggest EnhanceNet as a practical foundation for scalable, real-world multimodal emotion recognition systems, with potential impact on next-generation affective technologies.
Description:	Copyright belongs to the proceeding publisher.
URI:	http://hdl.handle.net/2080/5368
Appears in Collections:	Conference Papers

Files in This Item:

File	Description	Size	Format
2025_CVMI_PPSahoo_Enhance.pdf		2.72 MB	Adobe PDF	View/Open Request a copy

Show full item record