posted on 2013-04-11, 13:14authored byStephen Haskey
As computers begin to pervade aspects of our everyday lives, so the problem of
communication from man-to-machine becomes increasingly evident. In recent years, there
has been a concerted interest in speech recognition offering a user to communicate freely with
a machine. However, this deceptively simple means for exchanging information is in fact
extremely complex. A single utterance can contain a wealth of varied information concerning
the speaker's gender, age, dialect and mood. Numerous subtle differences such as intonation,
rhythm and stress further add to the complexity, increasing the variability between inter- and
intra-speaker utterances. These differences pose an enormous problem, especially for a
multi-user system since it is impractical to train for every variation of every utterance from
every speaker. Consequently adaptation is of great importance, allowing a system with
limited knowledge to dynamically adapt towards a new speakers characteristics. A new
modified artificial neural network (ANN) was proposed incorporating One-Class-OneNetwork
(OCON) subnet architectures connected via a common front-end adaptation layer.
Using vowel phonemes from the TIMIT speech database, the adaptation was concentrated on
neurons within the front-end layer, resulting in only information common to all classes,
primarily speaker characteristics, being adapted. In addition, this prevented new utterances
from interfering with phoneme unique information in the corresponding OCON subnets.
Hence a more efficient adaptation procedure was created which, after adaptation towards a
single class, also aided in the recognition of the remaining classes within the network.
Compared with a conventional multi-layer perceptron network, results for inter- and intraspeaker
adaptation showed an equally marked improvement for the recognition of adapted
phonemes during both full neuron and front-layer neuron adaptation within the new modified
architecture. When testing the effects of adaptation on the remaining unadapted vowel
phonemes, the modified architecture (allowing only the neurons in the front-end layer to
adapt) yielded better results than the modified architecture allowing full neuron adaptation.
These results highlighted the storing of speaker information, common to all classes, in the
front-end layer allowing efficient inter- and intra-speaker dynamic adaptation.
History
School
Mechanical, Electrical and Manufacturing Engineering