A motion-based approach for audio-visual automatic speech recognition

Ahmad, Nasir

Thesis-2011-Ahmad.pdf (4.27 MB)

A motion-based approach for audio-visual automatic speech recognition

Version 2 2020-01-07, 11:27

Version 1 2011-07-04, 11:11

thesis

posted on 2020-01-07, 11:27 authored by Nasir Ahmad

The research work presented in this thesis introduces novel approaches for both visual region of interest extraction and visual feature extraction for use in audio-visual automatic speech recognition. In particular, the speaker‘s movement that occurs during speech is used to isolate the mouth region in video sequences and motionbased features obtained from this region are used to provide new visual features for audio-visual automatic speech recognition. The mouth region extraction approach proposed in this work is shown to give superior performance compared with existing colour-based lip segmentation methods. The new features are obtained from three separate representations of motion in the region of interest, namely the difference in luminance between successive images, block matching based motion vectors and optical flow. The new visual features are found to improve visual-only and audiovisual speech recognition performance when compared with the commonly-used appearance feature-based methods. In addition, a novel approach is proposed for visual feature extraction from either the discrete cosine transform or discrete wavelet transform representations of the mouth region of the speaker. In this work, the image transform is explored from a new viewpoint of data discrimination; in contrast to the more conventional data preservation viewpoint. The main findings of this work are that audio-visual automatic speech recognition systems using the new features extracted from the frequency bands selected according to their discriminatory abilities generally outperform those using features designed for data preservation. To establish the noise robustness of the new features proposed in this work, their performance has been studied in presence of a range of different types of noise and at various signal-to-noise ratios. In these experiments, the audio-visual automatic speech recognition systems based on the new approaches were found to give superior performance both to audio-visual systems using appearance based features and to audio-only speech recognition systems.

History

School

Mechanical, Electrical and Manufacturing Engineering

Publisher

Loughborough University

Rights holder

Publication date

2011

Notes

A Doctoral Thesis. Submitted in partial fulfillment of the requirements for the award of Doctor of Philosophy of Loughborough University.

EThOS Persistent ID

uk.bl.ethos.540914

Language

en

Supervisor(s)

David J. Mulavaney ; Sekharjit Datta

Qualification name

PhD

Qualification level

Doctoral

Administrator link

https://repository.lboro.ac.uk/account/articles/11535084

Usage metrics

Keywords

Automatic speech recognition (ASR)Audio-visual automatic speech recognition (AVASR)Bi-modal speech recognition Visual front-end Features extraction Visual ROI Speech dynamics Mechanical Engineering not elsewhere classified

Licence

CC BY-NC-ND 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

A motion-based approach for audio-visual automatic speech recognition

History

School

Publisher

Rights holder

Publication date

Notes

EThOS Persistent ID

Language

Supervisor(s)

Qualification name

Qualification level

Administrator link

Usage metrics

Categories

Keywords

Licence

Exports