EnhanceNet: Leveraging Facial, Speech, and Textual Cues for Multimodal Emotion Recognition

Please use this identifier to cite or link to this item: http://hdl.handle.net/2080/5368

Full metadata record

DC Field	Value	Language
dc.contributor.author	Sahoo, Prachyut Priyadarshi	-
dc.contributor.author	Patra, Dipti	-
dc.date.accessioned	2025-11-19T13:08:08Z	-
dc.date.available	2025-11-19T13:08:08Z	-
dc.date.issued	2025-10	-
dc.identifier.citation	4th IEEE International Conference on Computer Vision and Machine Intelligence (CVMI), NIT, Rourkela, 12-13 October 2025	en_US
dc.identifier.uri	http://hdl.handle.net/2080/5368	-
dc.description	Copyright belongs to the proceeding publisher.	en_US
dc.description.abstract	Multimodal emotion recognition plays a vital role in affective computing, with applications spanning mental health monitoring and adaptive learning systems. This paper introduces EnhanceNet, a comprehensive deep learning framework that fuses three key modalities: facial expression recognition, speech emotion analysis, and spoken language understanding. Each modality employs specialized neural network architectures—a residual CNN with squeeze-and-excitation blocks for facial cues, a CNN-LSTM model for speech signals, and a BiLSTM network for textual transcripts—trained on diverse, widely-used datasets. Unlike conventional late fusion approaches, EnhanceNet adopts an early fusion strategy by averaging predicted emotion vectors from each modality to form a robust, unified emotional profile: Efinal = 1/3 (Eface + Espeech + Etext). This approach leverages complementary strengths of individual modalities, mitigating challenges such as facial occlusion, ambiguous vocal intonation, or sparse linguistic content. The fused model achieves an overall accuracy of 78.23%, outperforming unimodal baselines. The system supports real-time facial expression analysis via webcam and asynchronous processing of audio and text inputs, demonstrating robustness to environmental variability such as lighting conditions and background noise. The results suggest EnhanceNet as a practical foundation for scalable, real-world multimodal emotion recognition systems, with potential impact on next-generation affective technologies.	en_US
dc.subject	Multimodal Emotion Recognition	en_US
dc.subject	Deep Learning	en_US
dc.subject	Facial Expression Recognition	en_US
dc.subject	Speech Emotion Analysis	en_US
dc.subject	Natural Language Processing	en_US
dc.subject	Mental Health	en_US
dc.title	EnhanceNet: Leveraging Facial, Speech, and Textual Cues for Multimodal Emotion Recognition	en_US
dc.type	Article	en_US
Appears in Collections:	Conference Papers

Files in This Item:

File	Description	Size	Format
2025_CVMI_PPSahoo_Enhance.pdf		2.72 MB	Adobe PDF	View/Open Request a copy

Show simple item record