posted on 2009-12-14, 12:18authored byWenwu Wang, Darren Cosker, Yulia Hicks, Saeid Sanei, Jonathon Chambers
In this paper we investigate the problem of integrating the
complementary audio and visual modalities for speech separation.
Rather than using independence criteria suggested
in most blind source separation (BSS) systems, we use the
visual feature from a video signal as additional information
to optimize the unmixing matrix. We achieve this by using a
statistical model characterizing the nonlinear coherence between
audio and visual features as a separation criterion for
both instantaneous and convolutive mixtures. We acquire
the model by applying the Bayesian framework to the fused
feature observations based on a training corporus. We point
out several key exisiting challenges to the success of the
system. Experimental results verify the proposed approach,
which outperforms the audio only separation system in a
noisy environment, and also provides a solution to the permutation
problem.
History
School
Mechanical, Electrical and Manufacturing Engineering
Citation
WANG, W. ... et al., 2005. Video assisted speech source separation. IN: Proceedings of 2005 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2005), Philadelphia, Pennsylvania, USA, 18-23 March, Vol.5, pp. 425-428.