Summary
Two approaches to detecting and tracking speakers in multispeaker audio are described. Both approaches use an adapted Gaussian mixture model, universal background model (GMM-UBM) speaker detection system as the core speaker recognition engine. In one approach, the individual log-likelihood ratio scores, which are produced on a frame-by-frame basis by the GMM-UBM system, are used to first partition the speech file into speaker homogenous regions and then to create scores for these regions. We refer to this approach as internal segmentation. Another approach uses an external segmentation algorithm, based on blind clustering, to partition the speech file into speaker homogenous regions. The adapted GMM-UBM system then scores each of these regions as in the single-speaker recognition case. We show that the external segmentation system outperforms the internal segmentation system for both detection and tracking. In addition, we show how different components of the detection and tracking algorithms contribute to the overall system performance.