Acceleration of Sphinx 3 for implementation in embedded systems

Hu, Sunyi

SunyiHu-PhDThesis-2011.pdf (10.71 MB)

Acceleration of Sphinx 3 for implementation in embedded systems

Version 2 2020-01-29, 16:21

Version 1 2011-12-20, 09:29

thesis

posted on 2020-01-29, 16:21 authored by Sunyi Hu

This thesis presents a fully pipelined and parameterised parallel hardware implementation of a large vocabulary, user-independent and continuous speech recognition system for use in mobile applications. Algorithm acceleration is achieved by realising in hardware the most time-consuming components of the speech recognition system. By adopting a parallel solution, the necessary calculations can be completed in a sufficiently short elapsed time for embedded target systems. Sphinx 3 is identified as an appropriate speech recognition system for this work and is profiled to determine the most time-consuming parts of the code. As these parts of the code employ calculations based on floating point operations, which are not suitable for the high-performance and low-power execution on embedded systems, these calculations have been converted to scaled integer operations. It is verified using the AN4, RM1 and TIMIT speech databases that the scaled integer version of the speech recognition system can achieve a similar word error rate to the original floating point version, while taking less than 8% of the calculation time used by the original version. The scaled integer version of the speech recognition system is redesigned in VHDL for parallel implementation in electronic hardware. The designs of a calculation module and a data module are described, both of which can be configured according to the number of parallel units and the data module can be configured according to the total numbers of feature vectors and senones used in the speech representation. The hardware designs are synthesised to a range of FPGAs and the results showed that the larger Virtex7 devices are capable of holding several thousands of senones which are sufficient for most recognition tasks. Hardware designs with different numbers of parallel calculation units are simulated at both behavioural level and platform-based level and the resulting implementations are able to operate in real time. The results show that the hardware implementation, even with only one calculation unit, can perform the same calculations almost 80 times faster than does a modern embedded microprocessor, even when operating at only one fifth of the clock frequency. With larger numbers of parallel calculation units, the whole design can operate at even lower clock frequencies, saving power while maintaining a rapid calculation speed. The hardware designs are also implemented on a physical system having both an FPGA and a microprocessor board to demonstrate the operational capabilities of a full system.

History

School

Mechanical, Electrical and Manufacturing Engineering

Publisher

Loughborough University

Rights holder

Publication date

2011

Notes

A Doctoral Thesis. Submitted in partial fulfillment of the requirements for the award of Doctor of Philosophy of Loughborough University.

EThOS Persistent ID

uk.bl.ethos.547319

Language

en

Supervisor(s)

David Mulvaney ; Sekharjit Datta

Qualification name

PhD

Qualification level

Doctoral

Administrator link

https://repository.lboro.ac.uk/account/articles/11764101

Usage metrics

Keywords

untagged Mechanical Engineering not elsewhere classified

Licence

CC BY-NC-ND 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Acceleration of Sphinx 3 for implementation in embedded systems

History

School

Publisher

Rights holder

Publication date

Notes

EThOS Persistent ID

Language

Supervisor(s)

Qualification name

Qualification level

Administrator link

Usage metrics

Categories

Keywords

Licence

Exports