Publications

Refine Results

(Filters Applied) Clear All

The 2019 NIST Speaker Recognition Evaluation CTS Challenge

Published in:
The Speaker and Language Recognition Workshop: Odyssey 2020, 1-5 November 2020.

Summary

In 2019, the U.S. National Institute of Standards and Technology (NIST) conducted a leaderboard style speaker recognition challenge using conversational telephone speech (CTS) data extracted from the unexposed portion of the Call My Net 2 (CMN2) corpus previously used in the 2018 Speaker Recognition Evaluation (SRE). The SRE19 CTS Challenge was organized in a similar manner to SRE18, except it offered only the open training condition. In addition, similar to the NIST i-vector challenge, the evaluation set consisted of two subsets: a progress subset, and a test subset. The progress subset comprised 30% of the trials and was used to monitor progress on the leaderboad, while the remaining 70% of the trials formed the test subset, which was used to generate the official final results determined at the end of the challenge. Which subset (i.e., progress or test) a trial belonged to was unknown to challenge participants, and each system submission had to contain outputs for all of trials. The CTS Challenge also served as a prerequisite for entrance to the main SRE19 whose primary task was audio-visual person recognition. A total of 67 organizations (forming 51 teams) from academia and industry participated in the CTS Challenge and submitted 1347 valid system outputs. This paper presents an overview of the evaluation and several analyses of system performance for all primary conditions in the CTS Challenge. Compared to the CTS track of the SRE18, the SRE19 CTS Challenge results indicate remarkable improvements in performance which are mainly attributed to 1) the availability of large amounts of in-domain development data from a large number of labeled speakers, 2) speaker representations (aka embeddings) extracted using extended and more complex end-to-end neural network frameworks, and 3) effective use of the provided large development set.
READ LESS

Summary

In 2019, the U.S. National Institute of Standards and Technology (NIST) conducted a leaderboard style speaker recognition challenge using conversational telephone speech (CTS) data extracted from the unexposed portion of the Call My Net 2 (CMN2) corpus previously used in the 2018 Speaker Recognition Evaluation (SRE). The SRE19 CTS Challenge...

READ MORE

The 2019 NIST Audio-Visual Speaker Recognition Evaluation

Published in:
The Speaker and Language Recognition Workshop: Odyssey 2020, 1-5 November 2020.

Summary

In 2019, the U.S. National Institute of Standards and Technology (NIST) conducted the most recent in an ongoing series of speaker recognition evaluations (SRE). There were two components to SRE19: 1) a leaderboard style Challenge using unexposed conversational telephone speech (CTS) data from the Call My Net 2 (CMN2) corpus, and 2) an Audio-Visual (AV) evaluation using video material extracted from the unexposed portions of the Video Annotation for Speech Technologies (VAST) corpus. This paper presents an overview of the Audio-Visual SRE19 activity including the task, the performance metric, data, and the evaluation protocol, results and system performance analyses. The Audio-Visual SRE19 was organized in a similar manner to the audio from video (AfV) track in SRE18, except it offered only the open training condition. In addition, instead of extracting and releasing only the AfV data, unexposed multimedia data from the VAST corpus was used to support the Audio-Visual SRE19. It featured two core evaluation tracks, namely audio only and audio-visual, as well as an optional visual only track. A total of 26 organizations (forming 14 teams) from academia and industry participated in the Audio-Visual SRE19 and submitted 102 valid system outputs. Evaluation results indicate: 1) notable performance improvements for the audio only speaker recognition task on the challenging amateur online video domain due to the use of more complex neural network architectures (e.g., ResNet) along with soft margin losses, 2) state-of-the-art speaker and face recognition technologies provide comparable person recognition performance on the amateur online video domain, and 3) audio-visual fusion results in remarkable performance gains (greater than 85% relative) over the audio only or visual only systems.
READ LESS

Summary

In 2019, the U.S. National Institute of Standards and Technology (NIST) conducted the most recent in an ongoing series of speaker recognition evaluations (SRE). There were two components to SRE19: 1) a leaderboard style Challenge using unexposed conversational telephone speech (CTS) data from the Call My Net 2 (CMN2) corpus...

READ MORE

The MITLL NIST LRE 2015 Language Recognition System

Summary

In this paper we describe the most recent MIT Lincoln Laboratory language recognition system developed for the NIST 2015 Language Recognition Evaluation (LRE). The submission features a fusion of five core classifiers, with most systems developed in the context of an i-vector framework. The 2015 evaluation presented new paradigms. First, the evaluation included fixed training and open training tracks for the first time; second, language classification performance was measured across 6 language clusters using 20 language classes instead of an N-way language task; and third, performance was measured across a nominal 3-30 second range. Results are presented for the overall performance across the six language clusters for both the fixed and open training tasks. On the 6-cluster metric the Lincoln system achieved overall costs of 0.173 and 0.168 for the fixed and open tasks respectively.
READ LESS

Summary

In this paper we describe the most recent MIT Lincoln Laboratory language recognition system developed for the NIST 2015 Language Recognition Evaluation (LRE). The submission features a fusion of five core classifiers, with most systems developed in the context of an i-vector framework. The 2015 evaluation presented new paradigms. First...

READ MORE

Domain mismatch compensation for speaker recognition using a library of whiteners

Published in:
IEEE Signal Process. Lett., Vol. 22, No. 11, November 2015, pp. 2000-2003.

Summary

The development of the i-vector framework for generating low dimensional representations of speech utterances has led to considerable improvements in speaker recognition performance. Although these gains have been achieved in periodic National Institute of Standards and Technology (NIST) evaluations, the problem of domain mismatch, where the system development data and the application data are collected from different sources, remains a challenging one. The impact of domain mismatch was a focus of the Johns Hopkins University (JHU) 2013 speaker recognition workshop, where a domain adaptation challenge (DAC13) corpus was created to address this problem. This paper proposes an approach to domain mismatch compensation for applications where in-domain development data is assumed to be unavailable. The method is based on a generalization of data whitening used in association with i-vector length normalization and utilizes a library of whitening transforms trained at system development time using strictly out-of-domain data. The approach is evaluated on the 2013 domain adaptation challenge task and is shown to compare favorably to in-domain conventional whitening and to nuisance attribute projection (NAP) inter-dataset variability compensation.
READ LESS

Summary

The development of the i-vector framework for generating low dimensional representations of speech utterances has led to considerable improvements in speaker recognition performance. Although these gains have been achieved in periodic National Institute of Standards and Technology (NIST) evaluations, the problem of domain mismatch, where the system development data and...

READ MORE

Query-by-example using speaker content graphs

Published in:
INTERSPEECH 2012: 13th Annual Conf. of the Int. Speech Communication Assoc., 9-13 September 2012.

Summary

We describe methods for constructing and using content graphs for query-by-example speaker recognition tasks within a large speech corpus. This goal is achieved as follows: First, we describe an algorithm for constructing speaker content graphs, where nodes represent speech signals and edges represent speaker similarity. Speech signal similarity can be based on any standard vector-based speaker comparison method, and the content graph can be constructed using an efficient incremental method for streaming data. Second, we apply random walk methods to the content graph to find matching examples to an unlabeled query set of speech signals. The content-graph based method is contrasted to a more traditional approach that uses supervised training and stack detectors. Performance is compared in terms of information retrieval measures and computational complexity. The new content-graph based method is shown to provide a promising low-complexity scalable alternative to standard speaker recognition methods.
READ LESS

Summary

We describe methods for constructing and using content graphs for query-by-example speaker recognition tasks within a large speech corpus. This goal is achieved as follows: First, we describe an algorithm for constructing speaker content graphs, where nodes represent speech signals and edges represent speaker similarity. Speech signal similarity can be...

READ MORE

The MITLL NIST LRE 2011 language recognition system

Summary

This paper presents a description of the MIT Lincoln Laboratory (MITLL) language recognition system developed for the NIST 2011 Language Recognition Evaluation (LRE). The submitted system consisted of a fusion of four core classifiers, three based on spectral similarity and one based on tokenization. Additional system improvements were achieved following the submission deadline. In a major departure from previous evaluations, the 2011 LRE task focused on closed-set pairwise performance so as to emphasize a system's ability to distinguish confusable language pairs. Results are presented for the 24-language confusable pair task at test utterance durations of 30, 10, and 3 seconds. Results are also shown using the standard detection metrics (DET, minDCF) and it is demonstrated the previous metrics adequately cover difficult pair performance. On the 30 s 24-language confusable pair task, the submitted and post-evaluation systems achieved average costs of 0.079 and 0.070 and standard detection costs of 0.038 and 0.033.
READ LESS

Summary

This paper presents a description of the MIT Lincoln Laboratory (MITLL) language recognition system developed for the NIST 2011 Language Recognition Evaluation (LRE). The submitted system consisted of a fusion of four core classifiers, three based on spectral similarity and one based on tokenization. Additional system improvements were achieved following...

READ MORE

The MITLL NIST LRE 2009 language recognition system

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP, 15 March 2010, pp. 4994-4997.

Summary

This paper presents a description of the MIT Lincoln Laboratory language recognition system submitted to the NIST 2009 Language Recognition Evaluation (LRE). This system consists of a fusion of three core recognizers, two based on spectral similarity and one based on tokenization. The 2009 LRE differed from previous ones in that test data included narrowband segments from worldwide Voice of America broadcasts as well as conventional recorded conversational telephone speech. Results are presented for the 23-language closed-set and open-set detection tasks at the 30, 10, and 3 second durations along with a discussion of the language-pair task. On the 30 second 23-language closed set detection task, the system achieved a 1.64 average error rate.
READ LESS

Summary

This paper presents a description of the MIT Lincoln Laboratory language recognition system submitted to the NIST 2009 Language Recognition Evaluation (LRE). This system consists of a fusion of three core recognizers, two based on spectral similarity and one based on tokenization. The 2009 LRE differed from previous ones in...

READ MORE

The MITLL NIST LRE 2007 language recognition system

Summary

This paper presents a description of the MIT Lincoln Laboratory language recognition system submitted to the NIST 2007 Language Recognition Evaluation. This system consists of a fusion of four core recognizers, two based on tokenization and two based on spectral similarity. Results for NIST?s 14-language detection task are presented for both the closed-set and open-set tasks and for the 30, 10 and 3 second durations. On the 30 second 14-language closed set detection task, the system achieves a 1% equal error rate.
READ LESS

Summary

This paper presents a description of the MIT Lincoln Laboratory language recognition system submitted to the NIST 2007 Language Recognition Evaluation. This system consists of a fusion of four core recognizers, two based on tokenization and two based on spectral similarity. Results for NIST?s 14-language detection task are presented for...

READ MORE

Beyond frame independence: parametric modelling of time duration in speaker and language recognition

Published in:
INTERSPEECH 2008, 22-26 September 2008, pp. 767-770.

Summary

In this work, we address the question of generating accurate likelihood estimates from multi-frame observations in speaker and language recognition. Using a simple theoretical model, we extend the basic assumption of independent frames to include two refinements: a local correlation model across neighboring frames, and a global uncertainty due to train/test channel mismatch. We present an algorithm for discriminative training of the resulting duration model based on logistic regression combined with a bisection search. We show that using this model we can achieve state-of-the-art performance for the NIST LRE07 task. Finally, we show that these more accurate class likelihood estimates can be combined to solve multiple problems using Bayes' rule, so that we can expand our single parametric backend to replace all six separate back-ends used in our NIST LRE submission for both closed and open sets.
READ LESS

Summary

In this work, we address the question of generating accurate likelihood estimates from multi-frame observations in speaker and language recognition. Using a simple theoretical model, we extend the basic assumption of independent frames to include two refinements: a local correlation model across neighboring frames, and a global uncertainty due to...

READ MORE

Triage framework for resource conservation in a speaker identification system

Published in:
Proc. 32nd IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP, April 2007, pp. IV-69 - IV-72.

Summary

We present a novel framework for triaging (prioritizing and discarding) data to conserve resources for a speaker identification (SID) system. Our work is motivated by applications that require a SID system to process an overwhelming volume of audio data. We design a triage filter whose goal is to conserve recognizer resources while preserving relevant content. We propose triage methods that use signal quality assessment tools, a scaled-down version of the main recognizer itself, and a fusion of these measures. We define a new precision-based measure of effectiveness for our triage framework. Our experimental results with the 35-speaker tactical SID corpus bear out the validity of our approach.
READ LESS

Summary

We present a novel framework for triaging (prioritizing and discarding) data to conserve resources for a speaker identification (SID) system. Our work is motivated by applications that require a SID system to process an overwhelming volume of audio data. We design a triage filter whose goal is to conserve recognizer...

READ MORE