Please use this identifier to cite or link to this item:
http://hdl.handle.net/2080/5368Full metadata record
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.author | Sahoo, Prachyut Priyadarshi | - |
| dc.contributor.author | Patra, Dipti | - |
| dc.date.accessioned | 2025-11-19T13:08:08Z | - |
| dc.date.available | 2025-11-19T13:08:08Z | - |
| dc.date.issued | 2025-10 | - |
| dc.identifier.citation | 4th IEEE International Conference on Computer Vision and Machine Intelligence (CVMI), NIT, Rourkela, 12-13 October 2025 | en_US |
| dc.identifier.uri | http://hdl.handle.net/2080/5368 | - |
| dc.description | Copyright belongs to the proceeding publisher. | en_US |
| dc.description.abstract | Multimodal emotion recognition plays a vital role in affective computing, with applications spanning mental health monitoring and adaptive learning systems. This paper introduces EnhanceNet, a comprehensive deep learning framework that fuses three key modalities: facial expression recognition, speech emotion analysis, and spoken language understanding. Each modality employs specialized neural network architectures—a residual CNN with squeeze-and-excitation blocks for facial cues, a CNN-LSTM model for speech signals, and a BiLSTM network for textual transcripts—trained on diverse, widely-used datasets. Unlike conventional late fusion approaches, EnhanceNet adopts an early fusion strategy by averaging predicted emotion vectors from each modality to form a robust, unified emotional profile: Efinal = 1/3 (Eface + Espeech + Etext). This approach leverages complementary strengths of individual modalities, mitigating challenges such as facial occlusion, ambiguous vocal intonation, or sparse linguistic content. The fused model achieves an overall accuracy of 78.23%, outperforming unimodal baselines. The system supports real-time facial expression analysis via webcam and asynchronous processing of audio and text inputs, demonstrating robustness to environmental variability such as lighting conditions and background noise. The results suggest EnhanceNet as a practical foundation for scalable, real-world multimodal emotion recognition systems, with potential impact on next-generation affective technologies. | en_US |
| dc.subject | Multimodal Emotion Recognition | en_US |
| dc.subject | Deep Learning | en_US |
| dc.subject | Facial Expression Recognition | en_US |
| dc.subject | Speech Emotion Analysis | en_US |
| dc.subject | Natural Language Processing | en_US |
| dc.subject | Mental Health | en_US |
| dc.title | EnhanceNet: Leveraging Facial, Speech, and Textual Cues for Multimodal Emotion Recognition | en_US |
| dc.type | Article | en_US |
| Appears in Collections: | Conference Papers | |
Files in This Item:
| File | Description | Size | Format | |
|---|---|---|---|---|
| 2025_CVMI_PPSahoo_Enhance.pdf | 2.72 MB | Adobe PDF | View/Open Request a copy |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.
