Publications

Refine Results

(Filters Applied) Clear All

Bayesian estimation of PLDA with noisy training labels, with applications to speaker verification

Published in:
2020 IEEE Intl. Conf. on Acoustics, Speech and Signal Processing, ICASSP, 4-8 May 2020.

Summary

This paper proposes a method for Bayesian estimation of probabilistic linear discriminant analysis (PLDA) when training labels are noisy. Label errors can be expected during e.g. large or distributed data collections, or for crowd-sourced data labeling. By interpreting true labels as latent random variables, the observed labels are modeled as outputs of a discrete memoryless channel, and the maximum a posteriori (MAP) estimate of the PLDA model is derived via Variational Bayes. The proposed framework can be used for PLDA estimation, PLDA domain adaptation, or to infer the reliability of a PLDA training list. Although presented as a general method, the paper discusses specific applications for speaker verification. When applied to the Speakers in the Wild (SITW) Task, the proposed method achieves graceful performance degradation when label errors are introduced into the training or domain adaptation lists. When applied to the NIST 2018 Speaker Recognition Evaluation (SRE18) Task, which includes adaptation data with noisy speaker labels, the proposed technique provides performance improvements relative to unsupervised domain adaptation.
READ LESS

Summary

This paper proposes a method for Bayesian estimation of probabilistic linear discriminant analysis (PLDA) when training labels are noisy. Label errors can be expected during e.g. large or distributed data collections, or for crowd-sourced data labeling. By interpreting true labels as latent random variables, the observed labels are modeled as...

READ MORE

Discriminative PLDA for speaker verification with X-vectors

Published in:
IEEE Signal Processing Letters [submitted]

Summary

This paper proposes a novel approach to discrimina-tive training of probabilistic linear discriminant analysis (PLDA) for speaker verification with x-vectors. Model over-fitting is a well-known issue with discriminative PLDA (D-PLDA) forspeaker verification. As opposed to prior approaches which address this by limiting the number of trainable parameters, the proposed method parameterizes the discriminative PLDA (D-PLDA) model in a manner which allows for intuitive regularization, permitting the entire model to be optimized. Specifically, the within-class and across-class covariance matrices which comprise the PLDA model are expressed as products of orthonormal and diagonal matrices, and the structure of these matrices is enforced during model training. The proposed approach provides consistent performance improvements relative to previous D-PLDA methods when applied to a variety of speaker recognition evaluations, including the Speakers in the Wild Core-Core, SRE16, SRE18 CMN2, SRE19 CMN2, and VoxCeleb1 Tasks. Additionally, when implemented in Tensorflow using a modernGPU, D-PLDA optimization is highly efficient, requiring less than 20 minutes.
READ LESS

Summary

This paper proposes a novel approach to discrimina-tive training of probabilistic linear discriminant analysis (PLDA) for speaker verification with x-vectors. Model over-fitting is a well-known issue with discriminative PLDA (D-PLDA) forspeaker verification. As opposed to prior approaches which address this by limiting the number of trainable parameters, the proposed method...

READ MORE

State-of-the-art speaker recognition for telephone and video speech: the JHU-MIT submission for NIST SRE18

Summary

We present a condensed description of the joint effort of JHUCLSP, JHU-HLTCOE, MIT-LL., MIT CSAIL and LSE-EPITA for NIST SRE18. All the developed systems consisted of xvector/i-vector embeddings with some flavor of PLDA backend. Very deep x-vector architectures–Extended and Factorized TDNN, and ResNets– clearly outperformed shallower xvectors and i-vectors. The systems were tailored to the video (VAST) or to the telephone (CMN2) condition. The VAST data was challenging, yielding 4 times worse performance than other video based datasets like Speakers in the Wild. We were able to calibrate the VAST data with very few development trials by using careful adaptation and score normalization methods. The VAST primary fusion yielded EER=10.18% and Cprimary= 0.431. By improving calibration in post-eval, we reached Cprimary=0.369. In CMN2, we used unsupervised SPLDA adaptation based on agglomerative clustering and score normalization to correct the domain shift between English and Tunisian Arabic models. The CMN2 primary fusion yielded EER=4.5% and Cprimary=0.313. Extended TDNN x-vector was the best single system obtaining EER=11.1% and Cprimary=0.452 in VAST; and 4.95% and 0.354 in CMN2.
READ LESS

Summary

We present a condensed description of the joint effort of JHUCLSP, JHU-HLTCOE, MIT-LL., MIT CSAIL and LSE-EPITA for NIST SRE18. All the developed systems consisted of xvector/i-vector embeddings with some flavor of PLDA backend. Very deep x-vector architectures–Extended and Factorized TDNN, and ResNets– clearly outperformed shallower xvectors and i-vectors. The...

READ MORE

Discriminative PLDA for speaker verification with X-vectors

Published in:
International Conference on Acoustics, Speech, and Signal Processing, May 2019 [submitted]

Summary

This paper proposes a novel approach to discriminative training ofprobabilistic linear discriminant analysis (PLDA) for speaker veri-fication with x-vectors. The Newton Method is used to discrimi-natively train the PLDA model by minimizing the log loss of ver-ification trials. By diagonalizing the across-class and within-classcovariance matrices as a pre-processing step, the PLDA model canbe trained without relying on approximations, and while maintain-ing important properties of the underlying covariance matrices. Thetraining procedure is extended to allow for efficient domain adapta-tion. When applied to the Speakers in the Wild and SRE16 tasks, theproposed approach provides significant performance improvementsrelative to conventional PLDA.
READ LESS

Summary

This paper proposes a novel approach to discriminative training ofprobabilistic linear discriminant analysis (PLDA) for speaker veri-fication with x-vectors. The Newton Method is used to discrimi-natively train the PLDA model by minimizing the log loss of ver-ification trials. By diagonalizing the across-class and within-classcovariance matrices as a pre-processing step, the...

READ MORE

The MIT Lincoln Laboratory/JHU/EPITA-LSE LRE17 System

Summary

Competitive international language recognition evaluations have been hosted by NIST for over two decades. This paper describes the MIT Lincoln Laboratory (MITLL) and Johns Hopkins University (JHU) submission for the recent 2017 NIST language recognition evaluation (LRE17) [1]. The MITLL/JHU LRE17 submission represents a collaboration between researchers at MITLL and JHU with multiple sub-systems reflecting a range of language recognition technologies including traditional MFCC/SDC i-vector systems, deep neural network (DNN) bottleneck feature based i-vector systems, state-of-the-art DNN x-vector systems and a sparse coding system. Each sub-systems uses the same backend processing for domain adaptation and score calibration. Multiple sub-systems were fused using a simple logistic regression ([2]) to create system combinations. The MITLL/JHU submissions were selected based on the top ranking combinations of up to 5 sub-systems using development data provided by NIST. The MITLL/JHU primary submitted systems attained a Cavg of 0.181 and 0.163 for the fixed and open conditions respectively. Post evaluation analysis revealed the importance of carefully partitioning for the development data, using augmented training data and using a condition dependent backend. Addressing these issues - including retraining the x-vector system with augmented data - yielded gains in performance of over 17%: a Cavg of 0.149 for the fixed condition and 0.132 for the open condition.
READ LESS

Summary

Competitive international language recognition evaluations have been hosted by NIST for over two decades. This paper describes the MIT Lincoln Laboratory (MITLL) and Johns Hopkins University (JHU) submission for the recent 2017 NIST language recognition evaluation (LRE17) [1]. The MITLL/JHU LRE17 submission represents a collaboration between researchers at MITLL and...

READ MORE

Supervector LDA - a new approach to reduced-complexity i-vector language recognition

Published in:
INTERSPEECH 2012: 13th Annual Conf. of the Int. Speech Communication Assoc., 9-13 September 2012.

Summary

In this paper, we extend our previous analysis of Gaussian Mixture Model (GMM) subspace compensation techniques using Gaussian modeling in the supervector space combined with additive channel and observation noise. We show that under the modeling assumptions of a total-variability i-vector system, full Gaussian supervector scoring can also be performed cheaply in the total subspace, and that i-vector scoring can be viewed as an approximation to this. Next, we show that covariance matrix estimation in the i-vector space can be used to generate PCA estimates of supervector covariance matrices needed for Joint Factor Analysis (JFA). Finally, we derive a new technique for reduced-dimension i-vector extraction which we call Supervector LDA (SV-LDA), and demonstrate a 100-dimensional i-vector language recognition system with equivalent performance to a 600-dimensional version at much lower complexity.
READ LESS

Summary

In this paper, we extend our previous analysis of Gaussian Mixture Model (GMM) subspace compensation techniques using Gaussian modeling in the supervector space combined with additive channel and observation noise. We show that under the modeling assumptions of a total-variability i-vector system, full Gaussian supervector scoring can also be performed...

READ MORE

Exploring the impact of advanced front-end processing on NIST speaker recognition microphone tasks

Summary

The NIST speaker recognition evaluation (SRE) featured microphone data in the 2005-2010 evaluations. The preprocessing and use of this data has typically been performed with telephone bandwidth and quantization. Although this approach is viable, it ignores the richer properties of the microphone data-multiple channels, high-rate sampling, linear encoding, ambient noise properties, etc. In this paper, we explore alternate choices of preprocessing and examine their effects on speaker recognition performance. Specifically, we consider the effects of quantization, sampling rate, enhancement, and two-channel speech activity detection. Experiments on the NIST 2010 SRE interview microphone corpus demonstrate that performance can be dramatically improved with a different preprocessing chain.
READ LESS

Summary

The NIST speaker recognition evaluation (SRE) featured microphone data in the 2005-2010 evaluations. The preprocessing and use of this data has typically been performed with telephone bandwidth and quantization. Although this approach is viable, it ignores the richer properties of the microphone data-multiple channels, high-rate sampling, linear encoding, ambient noise...

READ MORE

Linear prediction modulation filtering for speaker recognition of reverberant speech

Published in:
Odyssey 2012, The Speaker and Language Recognition Workshop, 25-28 June 2012.

Summary

This paper proposes a framework for spectral enhancement of reverberant speech based on inversion of the modulation transfer function. All-pole modeling of modulation spectra of clean and degraded speech are utilized to derive the linear prediction inverse modulation transfer function (LP-IMTF) solution as a low-order IIR filter in the modulation envelope domain. By considering spectral estimation under speech presence uncertainty, speech presence probabilities are derived for the case of reverberation. Aside from enhancement, the LP-IMTF framework allows for blind estimation of reverberation time by extracting a minimum phase approximation of the short-time spectral channel impulse response. The proposed speech enhancement method is used as a front-end processing step for speaker recognition. When applied to the microphone condition of the NISTSRE 2010 with artificially added reverberation, the proposed spectral enhancement method yields significant improvements across a variety of performance metrics.
READ LESS

Summary

This paper proposes a framework for spectral enhancement of reverberant speech based on inversion of the modulation transfer function. All-pole modeling of modulation spectra of clean and degraded speech are utilized to derive the linear prediction inverse modulation transfer function (LP-IMTF) solution as a low-order IIR filter in the modulation...

READ MORE

Showing Results

1-8 of 8