Robust variational Bayesian clustering for underdetermined speech separation
2016-11-14T11:35:39Z (GMT) by
The main focus of this thesis is the enhancement of the statistical framework employed for underdetermined T-F masking blind separation of speech. While humans are capable of extracting a speech signal of interest in the presence of other interference and noise; actual speech recognition systems and hearing aids cannot match this psychoacoustic ability. They perform well in noise and reverberant free environments but suffer in realistic environments. Time-frequency masking algorithms based on computational auditory scene analysis attempt to separate multiple sound sources from only two reverberant stereo mixtures. They essentially rely on the sparsity that binaural cues exhibit in the time-frequency domain to generate masks which extract individual sources from their corresponding spectrogram points to solve the problem of underdetermined convolutive speech separation. Statistically, this can be interpreted as a classical clustering problem. Due to analytical simplicity, a finite mixture of Gaussian distributions is commonly used in T-F masking algorithms for modelling interaural cues. Such a model is however sensitive to outliers, therefore, a robust probabilistic model based on the Student's t-distribution is first proposed to improve the robustness of the statistical framework. This heavy tailed distribution, as compared to the Gaussian distribution, can potentially better capture outlier values and thereby lead to more accurate probabilistic masks for source separation. This non-Gaussian approach is applied to the state-of the-art MESSL algorithm and comparative studies are undertaken to confirm the improved separation quality. A Bayesian clustering framework that can better model uncertainties in reverberant environments is then exploited to replace the conventional expectation-maximization (EM) algorithm within a maximum likelihood estimation (MLE) framework. A variational Bayesian (VB) approach is then applied to the MESSL algorithm to cluster interaural phase differences thereby avoiding the drawbacks of MLE; specifically the probable presence of singularities and experimental results confirm an improvement in the separation performance. Finally, the joint modelling of the interaural phase and level differences and the integration of their non-Gaussian modelling within a variational Bayesian framework, is proposed. This approach combines the advantages of the robust estimation provided by the Student's t-distribution and the robust clustering inherent in the Bayesian approach. In other words, this general framework avoids the difficulties associated with MLE and makes use of the heavy tailed Student's t-distribution to improve the estimation of the soft probabilistic masks at various reverberation times particularly for sources in close proximity. Through an extensive set of simulation studies which compares the proposed approach with other T-F masking algorithms under different scenarios, a significant improvement in terms of objective and subjective performance measures is achieved.