Video-aided model-based source separation in real reverberant rooms

2134/12237 Muhammad Salman Khan Muhammad Salman Khan Mohsen Naqvi Mohsen Naqvi Ata ur-Rehman Ata ur-Rehman Wenwu Wang Wenwu Wang Jonathon Chambers Jonathon Chambers Video-aided model-based source separation in real reverberant rooms Loughborough University 2013 Source separation Reverberation Spatial cues Expectation-maximization Time-frequency masking Mechanical Engineering not elsewhere classified 2013-05-02 08:31:30 Journal contribution https://repository.lboro.ac.uk/articles/journal_contribution/Video-aided_model-based_source_separation_in_real_reverberant_rooms/9566018 Source separation algorithms that utilize only audio data can perform poorly if multiple sources or reverberation are present. In this paper we therefore propose a video-aided model-based source separation algorithm for a two-channel reverberant recording in which the sources are assumed static. By exploiting cues from video, we first localize individual speech sources in the enclosure and then estimate their directions. The interaural spatial cues, the interaural phase difference and the interaural level difference, as well as the mixing vectors are probabilistically modeled. The models make use of the source direction information and are evaluated at discrete timefrequency points. The model parameters are refined with the wellknown expectation-maximization (EM) algorithm. The algorithm outputs time-frequency masks that are used to reconstruct the individual sources. Simulation results show that by utilizing the visual modality the proposed algorithm can produce better timefrequency masks thereby giving improved source estimates. We provide experimental results to test the proposed algorithm in different scenarios and provide comparisons with both other audio-only and audio-visual algorithms and achieve improved performance both on synthetic and real data. We also include dereverberation based pre-processing in our algorithm in order to suppress the late reverberant components from the observed stereo mixture and further enhance the overall output of the algorithm. This advantage makes our algorithm a suitable candidate for use in under-determined highly reverberant settings where the performance of other audio-only and audio-visual methods is limited.