2134/12237
Muhammad Salman Khan
Muhammad Salman
Khan
Mohsen Naqvi
Mohsen
Naqvi
Ata ur-Rehman
Ata
ur-Rehman
Wenwu Wang
Wenwu
Wang
Jonathon Chambers
Jonathon
Chambers
Video-aided model-based source separation in real reverberant rooms
Loughborough University
2013
Source separation
Reverberation
Spatial cues
Expectation-maximization
Time-frequency masking
Mechanical Engineering not elsewhere classified
2013-05-02 08:31:30
Journal contribution
https://repository.lboro.ac.uk/articles/journal_contribution/Video-aided_model-based_source_separation_in_real_reverberant_rooms/9566018
Source separation algorithms that utilize only audio
data can perform poorly if multiple sources or reverberation
are present. In this paper we therefore propose a video-aided
model-based source separation algorithm for a two-channel
reverberant recording in which the sources are assumed static.
By exploiting cues from video, we first localize individual speech
sources in the enclosure and then estimate their directions.
The interaural spatial cues, the interaural phase difference and
the interaural level difference, as well as the mixing vectors
are probabilistically modeled. The models make use of the
source direction information and are evaluated at discrete timefrequency
points. The model parameters are refined with the wellknown
expectation-maximization (EM) algorithm. The algorithm
outputs time-frequency masks that are used to reconstruct the
individual sources. Simulation results show that by utilizing the
visual modality the proposed algorithm can produce better timefrequency
masks thereby giving improved source estimates. We
provide experimental results to test the proposed algorithm in
different scenarios and provide comparisons with both other
audio-only and audio-visual algorithms and achieve improved
performance both on synthetic and real data. We also include
dereverberation based pre-processing in our algorithm in order
to suppress the late reverberant components from the observed
stereo mixture and further enhance the overall output of the algorithm.
This advantage makes our algorithm a suitable candidate
for use in under-determined highly reverberant settings where
the performance of other audio-only and audio-visual methods
is limited.