Convolutive speech separation by combining probabilistic models employing the interaural spatial cues and properties of the room assisted by vision

In this paper a new combination of the model of the interaural spatial cues and a model that utilizes spatial properties of the sources is proposed to enhance speech separation in reverberant environments. The algorithm exploits the knowledge of the locations of the speech sources estimated through vision. The interaural phase difference, the interaural level difference and the contribution of each source to all mixture channels are each modeled as Gaussian distributions in the time-frequency domain and evaluated at individual time-frequency points. An expectation-maximization (EM) algorithm is employed to refine the estimates of the parameters of the models. The algorithm outputs enhanced time-frequency masks that are used to reconstruct individual speech sources. Experimental results confirm that the combined video-assisted method is promising to separate sources in real reverberant rooms.