Loughborough University
Browse

Exploring data-efficient multi-modal learning in computer vision

Download (23.21 MB)
thesis
posted on 2025-06-17, 15:03 authored by Linglin Jing

With the advancement of deep learning, significant progress has been made in various downstream applications, particularly in visual tasks. However, as task requirements diversify, single-modal approaches are no longer sufficient to address the demands of modern real-world applications. Consequently, research on multi-modal fusion has become a prominent direction.

Despite its potential, incorporating multiple modalities introduces two major challenges: First, during inference in practical downstream applications, additional modalities substantially increase resource consumption, network complexity, and latency, rendering such solutions unsuitable for real-time scenarios. Second, multi-modal training amplifies the reliance on high-quality labeled data, further escalating resource costs due to the expensive nature of labeling processes.

This thesis tackles these challenges by proposing data-efficient solutions to address the limitations of multi-modal training. Using four benchmark computer vision tasks, we demonstrate the effectiveness and generalization capability of our proposed methods, particularly through multi-modal fusion approaches and data-efficient training strategies. These methods reduce dependence on labeled data while optimizing multi-modal performance.

Furthermore, with the growing interest in large language models (LLMs), research has shifted from designing task-specific models to creating versatile foundation models capable of addressing diverse downstream tasks. Our work also explores applications of cutting edge vision-language models (VLMs) and provides insights into the future development of multi-modal foundation models, paving the way for more efficient and generalized multimodal learning systems.

History

School

  • Science

Department

  • Computer Science

Publisher

Loughborough University

Rights holder

© Linglin Jing

Publication date

2025

Notes

A Doctoral Thesis. Submitted in partial fulfilment of the requirements for the award of the degree of Doctor of Philosophy of Loughborough University.

Language

  • en

Supervisor(s)

Hui Fang

Qualification name

  • PhD

Qualification level

  • Doctoral

Usage metrics

    Computer Science Theses

    Categories

    No categories selected

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC