Loughborough University
Browse

AI-based robust and secure speech signal processing in real-world communication scenarios: From signal restoration to deepfake detection

Download (61.97 MB)
thesis
posted on 2025-10-16, 13:34 authored by Haohan Shi
<p dir="ltr">In recent years, deep learning has led to remarkable advances in speech signal processing, enabling significant improvements in research topics such as speech enhancement, recognition, and synthesis. Despite the achievements, real-world speech communication systems continue to face a wide range of challenges. The first significant challenge arises from the degradations of audio signals introduced by practical transmission environments, such as codec compression, packet losses, jitter, and channel distortions, which severely impact audio quality, such as clarity and intelligibility. The second challenge is the growing threat posed by highly convincing deepfake audio technologies, driven by advances in artificial intelligence and sophisticated speech synthesis applications. These challenges underscore the need for robust remedies in Speech Inpainting (SI), Packet Loss Concealment (PLC), and Audio Deepfake Detection (ADD). This thesis focuses on providing innovative solutions in these three research directions by introducing a series of deep learning-based, robust, effective, efficient, and scalable frameworks, which are applicable to real-world communication scenarios.</p><p dir="ltr">First, a speech inpainting framework is proposed based on multi-layer Long Short-Term Memory (LSTM) networks, which formulates the recovery of missing audio segments in end-to-end telephony communications as a time-series prediction task. The models are trained and evaluated on diverse datasets with different numbers of speakers. Extensive experimental results demonstrate that the proposed method can restore up to one second of missing speech in conversations using time-domain features only, achieving Mean Opinion Scores (MOS) of 3-4 for speech gaps due to missing speech frames under 500ms and 2-3 for gaps between 500ms and 1s. It effectively reconstructs the temporal envelope and continuity of the signal; low-frequency spectral structure (smaller than 2.0 kHz) is recovered well, while similarity diminishes in the 2.0-8.0 kHz band. These results demonstrate the practicality of high-quality speech inpainting under real-world communication conditions.</p><p dir="ltr">Second, the thesis proposes a novel PLC approach for linear prediction-based speech codecs, which include the vast majority of contemporary audio/voice/speech codecs used in mobile communication systems and Voice over Internet Protocol applications. The method utilises attention mechanisms and an LSTM to reconstruct the Linear Predictive Coefficients of the coded speech. Specifically, a Multiscale Trend-aware Multi-head Self-attention architecture is designed to capture the long-term global correlations and short-term local dependencies of speech signals across different time scales, enabling effective global and local receptive fields while reconstructing lost speech packets. A new multiscale feature fusion method, Stack Fusion, is proposed to further enhance the reconstruction performance. It assigns higher weights to speech frames closer to the lost packet(s) and lower weights to those further away, enabling the effective integration of global and local features across various time scales. Additionally, a tailored loss function is proposed to guide model training by balancing the numerical precision, structural periodicity, and perceptual fidelity. Five evaluation metrics are used for evaluating the quality performance: the Perceptual Evaluation of Speech Quality in Wideband (PESQ-WB), Short-Time Objective Intelligibility (STOI), Log-Spectral Distance (LSD), Packet Loss Concealment for Mean Opinion Score (PLCMOS), and Word Error Rate (WER). Extensive experiments demonstrate that the proposed method outperforms the best-performing State-Of-The-Art (SOTA) benchmarks, achieving average improvements of 1.77 (PESQ-WB), 4.06\% (STOI), 50.00\% (LSD), 0.48 (PLCMOS), and 0.08 (WER). This highlights the significant enhancement of the perceived quality, intelligibility, spectral consistency, and accurate speech recognition performance of the reconstructed speech signals. In doing so, the proposed model requires 0.16 Giga Multiply-Accumulate Operations per second and an average of 2.98ms to infer a 20ms speech frame, underscoring its strong potential for high-quality real-time speech communication applications.</p><p dir="ltr">Third, the thesis investigates the vulnerability and robustness of existing ADD methods in real-world communication scenarios. A new benchmark and a new test dataset, ADD-C, are proposed to assess the robustness of ADD methods under varying codec compression and packet loss conditions. Extensive experiments reveal that current SOTA methods suffer substantial performance drops in robustness under such conditions. A novel data augmentation strategy is proposed to enhance the generalisation and robustness of ADD methods. Building on these findings, the thesis further proposes the first unified framework for robust ADD under such degradations, which is designed to effectively accommodate multiple types of Time-Frequency (TF) representations. The core of the framework is a new Multi-Granularity Adaptive Attention (MGAA) architecture, which employs a set of customizable multi-scale attention heads to capture both global and local receptive fields across varying TF granularities. A proposed adaptive fusion mechanism subsequently adjusts and fuses these attention branches based on the saliency of TF regions, allowing the model to dynamically reallocate its focus according to the characteristics of the degradation. This enables effective localisation and amplifies subtle forgery traces. Extensive experiments demonstrate that the proposed framework consistently outperforms SOTA baselines across various real-world communication degradations (spanning six speech codec compression and five levels of packet losses). Additionally, a comparative t-SNE analysis reveals that the MGAA-enhanced features significantly improve the separability between real and fake audio classes, sharpening the decision boundaries. These results highlight the robustness and practical deployment potential of the framework in real-world communication environments.</p>

Funding

Loughborough University

China Scholarship Council

History

School

  • Loughborough University, London

Publisher

Loughborough University

Rights holder

© Haohan Shi

Publication date

2025

Notes

A Doctoral Thesis. Submitted in partial fulfilment of the requirements for the award of the degree of Doctor of Philosophy of Loughborough University.

Language

  • en

Supervisor(s)

X Shi; S Dogan;

Qualification name

  • PhD

Qualification level

  • Doctoral

This submission includes a signed certificate in addition to the thesis file(s)

  • I have submitted a signed certificate

Usage metrics

    Loughborough University London Theses

    Categories

    No categories selected

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC