Causal Deep CASA for Monaural Talker-Independent Speaker Separation.

Published in: IEEE/ACM transactions on audio, speech, and language processing (2020)

Talker-independent monaural speaker separation aims to separate concurrent speakers from a single-microphone recording. Inspired by human auditory scene analysis (ASA) mechanisms, a two-stage deep CASA approach has been proposed recently to address this problem, which achieves state-of-the-art results in separating mixtures of two or three speakers. A main limitation of deep CASA is that it is a non-causal system, while many speech processing applications, e.g., telecommunication and hearing prosthesis, require causal processing. In this study, we propose a causal version of deep CASA to address this limitation. First, we modify temporal connections, normalization and clustering algorithms in deep CASA so that no future information is used throughout the deep network. We then train a C-speaker (C ≥ 2) deep CASA system in a speaker-number-independent fashion, generalizable to speech mixtures with up to C speakers without the prior knowledge about the speaker number. Experimental results show that causal deep CASA achieves excellent speaker separation performance with known or unknown speaker numbers.

Keyphrases