Loughborough University
Doctoral_Thesis__Visual_Semantic_Embedding_Networks_for_Cross_Modal_Learning_and_Information_Retrieval_with_Search_Engine_Integration_Final.pdf (64.65 MB)

Visual-semantic embedding networks for cross-modal learning and information retrieval with search engine integration

Download (64.65 MB)
posted on 2023-11-16, 15:32 authored by Yan GongYan Gong

With the expansion of multimedia, cross-modal information retrieval has become increasingly critical, facilitating integrated retrieval across text, images, and videos. Visual-Semantic Embedding (VSE) networks represent state-of-the-art algorithms for cross-modal learning and information retrieval, aiming to extract the semantics of vision and language and then embed them into a shared latent space. This thesis focuses on image-to-text and text-to-image retrieval using VSE networks, and its motivations with associated contributions are outlined as follows.

1) In training VSE networks, the commonly used hard negatives loss function results in a low learning speed due to its fixed learning objective. This thesis proposes a novel Semantically-Enhanced Hard negatives Loss function (LSEH) that dynamically determines the learning objective based on the optimal similarity scores between irrelevant image–description pairs, improving both the learning speed and performance of cross-modal information retrieval for VSE networks. 2) For state-of-the-art VSE network architectures, they rely on Vision Transformer (ViT) for pre-training but are limited in their ability to focus on image region relations, as their primary focus is on globally matching images with relevant descriptions. This thesis introduces a novel network of VIsion Transformers with Relation-focus (VITR) trained with LSEH, enhancing ViT by incorporating a local encoder for regional relation reasoning. The results of relational reasoning are fused with pre-trained global knowledge using a novel fusion module, improving cross-modal information retrieval with a focus on relations. 3) Within the applications of VSE networks, the full potential of integrating them into search engines has yet to be fully explored. This thesis presents Boon, a cross-modal search engine integrating VITR to enhance user experience through relation-focused image-text retrieval and multilingual conversations with Generative Pre-trained Transformer (GPT)-3.5. Boon showcases its capabilities using the RefCOCOg real-world image dataset and the ArtUK painting image dataset. Additionally, Boon enables web multimedia search through integration with Google’s API.

Future work will explore VSE networks on unsupervised learning, text-video retrieval, and employing Large Language Model (LLM)s for efficient text embedding; additionally, it will focus on integrating image-to-image retrieval within cross-modal search engines.



  • Science


  • Computer Science


Loughborough University

Rights holder

©Yan Gong

Publication date



A Doctoral Thesis. Submitted in partial fulfilment of the requirements for the award of Doctor of Philosophy of Loughborough University


  • en


Georgina Cosma

Qualification name

  • PhD

Qualification level

  • Doctoral

This submission includes a signed certificate in addition to the thesis file(s)

  • I have submitted a signed certificate