Exploring data-efficient multi-modal learning in computer vision
With the advancement of deep learning, significant progress has been made in various downstream applications, particularly in visual tasks. However, as task requirements diversify, single-modal approaches are no longer sufficient to address the demands of modern real-world applications. Consequently, research on multi-modal fusion has become a prominent direction.
Despite its potential, incorporating multiple modalities introduces two major challenges: First, during inference in practical downstream applications, additional modalities substantially increase resource consumption, network complexity, and latency, rendering such solutions unsuitable for real-time scenarios. Second, multi-modal training amplifies the reliance on high-quality labeled data, further escalating resource costs due to the expensive nature of labeling processes.
This thesis tackles these challenges by proposing data-efficient solutions to address the limitations of multi-modal training. Using four benchmark computer vision tasks, we demonstrate the effectiveness and generalization capability of our proposed methods, particularly through multi-modal fusion approaches and data-efficient training strategies. These methods reduce dependence on labeled data while optimizing multi-modal performance.
Furthermore, with the growing interest in large language models (LLMs), research has shifted from designing task-specific models to creating versatile foundation models capable of addressing diverse downstream tasks. Our work also explores applications of cutting edge vision-language models (VLMs) and provides insights into the future development of multi-modal foundation models, paving the way for more efficient and generalized multimodal learning systems.
History
School
- Science
Department
- Computer Science
Publisher
Loughborough UniversityRights holder
© Linglin JingPublication date
2025Notes
A Doctoral Thesis. Submitted in partial fulfilment of the requirements for the award of the degree of Doctor of Philosophy of Loughborough University.Language
- en
Supervisor(s)
Hui FangQualification name
- PhD
Qualification level
- Doctoral