Please use this identifier to cite or link to this item: http://hdl.handle.net/2080/5368
Full metadata record
DC FieldValueLanguage
dc.contributor.authorSahoo, Prachyut Priyadarshi-
dc.contributor.authorPatra, Dipti-
dc.date.accessioned2025-11-19T13:08:08Z-
dc.date.available2025-11-19T13:08:08Z-
dc.date.issued2025-10-
dc.identifier.citation4th IEEE International Conference on Computer Vision and Machine Intelligence (CVMI), NIT, Rourkela, 12-13 October 2025en_US
dc.identifier.urihttp://hdl.handle.net/2080/5368-
dc.descriptionCopyright belongs to the proceeding publisher.en_US
dc.description.abstractMultimodal emotion recognition plays a vital role in affective computing, with applications spanning mental health monitoring and adaptive learning systems. This paper introduces EnhanceNet, a comprehensive deep learning framework that fuses three key modalities: facial expression recognition, speech emotion analysis, and spoken language understanding. Each modality employs specialized neural network architectures—a residual CNN with squeeze-and-excitation blocks for facial cues, a CNN-LSTM model for speech signals, and a BiLSTM network for textual transcripts—trained on diverse, widely-used datasets. Unlike conventional late fusion approaches, EnhanceNet adopts an early fusion strategy by averaging predicted emotion vectors from each modality to form a robust, unified emotional profile: Efinal = 1/3 (Eface + Espeech + Etext). This approach leverages complementary strengths of individual modalities, mitigating challenges such as facial occlusion, ambiguous vocal intonation, or sparse linguistic content. The fused model achieves an overall accuracy of 78.23%, outperforming unimodal baselines. The system supports real-time facial expression analysis via webcam and asynchronous processing of audio and text inputs, demonstrating robustness to environmental variability such as lighting conditions and background noise. The results suggest EnhanceNet as a practical foundation for scalable, real-world multimodal emotion recognition systems, with potential impact on next-generation affective technologies.en_US
dc.subjectMultimodal Emotion Recognitionen_US
dc.subjectDeep Learningen_US
dc.subjectFacial Expression Recognitionen_US
dc.subjectSpeech Emotion Analysisen_US
dc.subjectNatural Language Processingen_US
dc.subjectMental Healthen_US
dc.titleEnhanceNet: Leveraging Facial, Speech, and Textual Cues for Multimodal Emotion Recognitionen_US
dc.typeArticleen_US
Appears in Collections:Conference Papers

Files in This Item:
File Description SizeFormat 
2025_CVMI_PPSahoo_Enhance.pdf2.72 MBAdobe PDFView/Open    Request a copy


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.