Speech inpainting based on multi-layer long short-term memory networks

Shi, Haohan; Shi, Xiyu; Dogan, Safak

futureinternet-16-00063-v2.pdf (18.94 MB)

Speech inpainting based on multi-layer long short-term memory networks

journal contribution

posted on 2024-03-01, 11:02 authored by Haohan ShiHaohan Shi, Xiyu ShiXiyu Shi, Safak DoganSafak Dogan

Audio inpainting plays an important role in addressing incomplete, damaged, or missing audio signals, contributing to improved quality of service and overall user experience in multimedia communications over the Internet and mobile networks. This paper presents an innovative solution for speech inpainting using Long Short-Term Memory (LSTM) networks, i.e., a restoring task where the missing parts of speech signals are recovered from the previous information in the time domain. The lost or corrupted speech signals are also referred to as gaps. We regard the speech inpainting task as a time-series prediction problem in this research work. To address this problem, we designed multi-layer LSTM networks and trained them on different speech datasets. Our study aims to investigate the inpainting performance of the proposed models on different datasets and with varying LSTM layers and explore the effect of multi-layer LSTM networks on the prediction of speech samples in terms of perceived audio quality. The inpainted speech quality is evaluated through the Mean Opinion Score (MOS) and a frequency analysis of the spectrogram. Our proposed multi-layer LSTM models are able to restore up to 1 s of gaps with high perceptual audio quality using the features captured from the time domain only. Specifically, for gap lengths under 500 ms, the MOS can reach up to 3~4, and for gap lengths ranging between 500 ms and 1 s, the MOS can reach up to 2~3. In the time domain, the proposed models can proficiently restore the envelope and trend of lost speech signals. In the frequency domain, the proposed models can restore spectrogram blocks with higher similarity to the original signals at frequencies less than 2.0 kHz and comparatively lower similarity at frequencies in the range of 2.0 kHz~8.0 kHz.

Funding

Loughborough University (Grant No. GS1016)

China Scholarship Council (Grant No. 202208060237)

History

School

Loughborough University, London

Published in

Future Internet

Volume

16

Issue

2

Publisher

MDPI

Version

VoR (Version of Record)

Rights holder

Publisher statement

This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Acceptance date

2024-02-10

Publication date

2024-02-17

Copyright date

2024

DOI

https://doi.org/10.3390/fi16020063

eISSN

1999-5903

Publisher version

https://doi.org/10.3390/fi16020063

Language

en

Depositor

Deposit date: 28 February 2024

Article number

63

Usage metrics

Keywords

Speech signal processing Speech inpainting Audio inpainting Long short-term memory Deep learning

Licence

CC BY 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Speech inpainting based on multi-layer long short-term memory networks

Funding

Loughborough University (Grant No. GS1016)

China Scholarship Council (Grant No. 202208060237)

History

School

Published in

Volume

Issue

Publisher

Version

Rights holder

Publisher statement

Acceptance date

Publication date

Copyright date

DOI

eISSN

Publisher version

Language

Depositor

Article number

Usage metrics

Categories

Keywords

Licence

Exports