A sequential mixing fusion network for enhanced feature representations in multimodal sentiment analysis
Multimodal sentiment analysis exploits multiple modalities to understand a user’s sentiment state from video content. Recent work in this area integrates features derived from different modalities. However, current multimodal sentiment datasets are typically small with limited cross-modal interaction diversity, for which simple feature fusion mechanisms can lead to modality dependence and model overfitting. Consequently, how to augment diverse cross-modal samples and use non-verbal modalities to dynamically enhance text feature representations is still under-explored. In this paper, we propose a sequential mixing fusion network to tackle this research challenge. Using speech text content as a primary source, we design a sequential fusion strategy to maximise the feature expressiveness enhanced by auxiliary modalities, namely facial movements and audio features, and a random feature-level mixing algorithm to augment diverse cross-modality interactions. Experimental results on three benchmark datasets show that our proposed approach significantly outperforms current state-of-the-art methods, while demonstrating strong robustness capability when dealing with a missing modality.
Funding
Dalian Major Projects of Basic Research [2023JJ11CG002]
111 Project [D23006]
National Foreign Expert Project of China [D20240244]
Interdisciplinary Research Project of Dalian University [DLUXK-2024-YB-007]
Scientific Research Foundation of Education Department of Liaoning Province grant [LJKMZ20221839, JYTMS20230379]
History
School
- Science
Published in
Knowledge-Based SystemsVolume
320Publisher
Elsevier B.V.Version
- AM (Accepted Manuscript)
Rights holder
©Elsevier B.VPublisher statement
This manuscript version is made available under the CC-BY-NC-ND 4.0 license https://creativecommons.org/licenses/by-nc-nd/4.0/Publication date
2025-05-01Copyright date
2025ISSN
0950-7051eISSN
1872-7409Publisher version
Language
- en