Publications

Refine Results

(Filters Applied) Clear All

Bayesian estimation of PLDA with noisy training labels, with applications to speaker verification

Published in:
2020 IEEE Intl. Conf. on Acoustics, Speech and Signal Processing, ICASSP, 4-8 May 2020.

Summary

This paper proposes a method for Bayesian estimation of probabilistic linear discriminant analysis (PLDA) when training labels are noisy. Label errors can be expected during e.g. large or distributed data collections, or for crowd-sourced data labeling. By interpreting true labels as latent random variables, the observed labels are modeled as outputs of a discrete memoryless channel, and the maximum a posteriori (MAP) estimate of the PLDA model is derived via Variational Bayes. The proposed framework can be used for PLDA estimation, PLDA domain adaptation, or to infer the reliability of a PLDA training list. Although presented as a general method, the paper discusses specific applications for speaker verification. When applied to the Speakers in the Wild (SITW) Task, the proposed method achieves graceful performance degradation when label errors are introduced into the training or domain adaptation lists. When applied to the NIST 2018 Speaker Recognition Evaluation (SRE18) Task, which includes adaptation data with noisy speaker labels, the proposed technique provides performance improvements relative to unsupervised domain adaptation.
READ LESS

Summary

This paper proposes a method for Bayesian estimation of probabilistic linear discriminant analysis (PLDA) when training labels are noisy. Label errors can be expected during e.g. large or distributed data collections, or for crowd-sourced data labeling. By interpreting true labels as latent random variables, the observed labels are modeled as...

READ MORE

Discriminative PLDA for speaker verification with X-vectors

Published in:
IEEE Signal Processing Letters [submitted]

Summary

This paper proposes a novel approach to discrimina-tive training of probabilistic linear discriminant analysis (PLDA) for speaker verification with x-vectors. Model over-fitting is a well-known issue with discriminative PLDA (D-PLDA) forspeaker verification. As opposed to prior approaches which address this by limiting the number of trainable parameters, the proposed method parameterizes the discriminative PLDA (D-PLDA) model in a manner which allows for intuitive regularization, permitting the entire model to be optimized. Specifically, the within-class and across-class covariance matrices which comprise the PLDA model are expressed as products of orthonormal and diagonal matrices, and the structure of these matrices is enforced during model training. The proposed approach provides consistent performance improvements relative to previous D-PLDA methods when applied to a variety of speaker recognition evaluations, including the Speakers in the Wild Core-Core, SRE16, SRE18 CMN2, SRE19 CMN2, and VoxCeleb1 Tasks. Additionally, when implemented in Tensorflow using a modernGPU, D-PLDA optimization is highly efficient, requiring less than 20 minutes.
READ LESS

Summary

This paper proposes a novel approach to discrimina-tive training of probabilistic linear discriminant analysis (PLDA) for speaker verification with x-vectors. Model over-fitting is a well-known issue with discriminative PLDA (D-PLDA) forspeaker verification. As opposed to prior approaches which address this by limiting the number of trainable parameters, the proposed method...

READ MORE

State-of-the-art speaker recognition for telephone and video speech: the JHU-MIT submission for NIST SRE18

Summary

We present a condensed description of the joint effort of JHUCLSP, JHU-HLTCOE, MIT-LL., MIT CSAIL and LSE-EPITA for NIST SRE18. All the developed systems consisted of xvector/i-vector embeddings with some flavor of PLDA backend. Very deep x-vector architectures–Extended and Factorized TDNN, and ResNets– clearly outperformed shallower xvectors and i-vectors. The systems were tailored to the video (VAST) or to the telephone (CMN2) condition. The VAST data was challenging, yielding 4 times worse performance than other video based datasets like Speakers in the Wild. We were able to calibrate the VAST data with very few development trials by using careful adaptation and score normalization methods. The VAST primary fusion yielded EER=10.18% and Cprimary= 0.431. By improving calibration in post-eval, we reached Cprimary=0.369. In CMN2, we used unsupervised SPLDA adaptation based on agglomerative clustering and score normalization to correct the domain shift between English and Tunisian Arabic models. The CMN2 primary fusion yielded EER=4.5% and Cprimary=0.313. Extended TDNN x-vector was the best single system obtaining EER=11.1% and Cprimary=0.452 in VAST; and 4.95% and 0.354 in CMN2.
READ LESS

Summary

We present a condensed description of the joint effort of JHUCLSP, JHU-HLTCOE, MIT-LL., MIT CSAIL and LSE-EPITA for NIST SRE18. All the developed systems consisted of xvector/i-vector embeddings with some flavor of PLDA backend. Very deep x-vector architectures–Extended and Factorized TDNN, and ResNets– clearly outperformed shallower xvectors and i-vectors. The...

READ MORE

Discriminative PLDA for speaker verification with X-vectors

Published in:
International Conference on Acoustics, Speech, and Signal Processing, May 2019 [submitted]

Summary

This paper proposes a novel approach to discriminative training ofprobabilistic linear discriminant analysis (PLDA) for speaker veri-fication with x-vectors. The Newton Method is used to discrimi-natively train the PLDA model by minimizing the log loss of ver-ification trials. By diagonalizing the across-class and within-classcovariance matrices as a pre-processing step, the PLDA model canbe trained without relying on approximations, and while maintain-ing important properties of the underlying covariance matrices. Thetraining procedure is extended to allow for efficient domain adapta-tion. When applied to the Speakers in the Wild and SRE16 tasks, theproposed approach provides significant performance improvementsrelative to conventional PLDA.
READ LESS

Summary

This paper proposes a novel approach to discriminative training ofprobabilistic linear discriminant analysis (PLDA) for speaker veri-fication with x-vectors. The Newton Method is used to discrimi-natively train the PLDA model by minimizing the log loss of ver-ification trials. By diagonalizing the across-class and within-classcovariance matrices as a pre-processing step, the...

READ MORE

Supervector LDA - a new approach to reduced-complexity i-vector language recognition

Published in:
INTERSPEECH 2012: 13th Annual Conf. of the Int. Speech Communication Assoc., 9-13 September 2012.

Summary

In this paper, we extend our previous analysis of Gaussian Mixture Model (GMM) subspace compensation techniques using Gaussian modeling in the supervector space combined with additive channel and observation noise. We show that under the modeling assumptions of a total-variability i-vector system, full Gaussian supervector scoring can also be performed cheaply in the total subspace, and that i-vector scoring can be viewed as an approximation to this. Next, we show that covariance matrix estimation in the i-vector space can be used to generate PCA estimates of supervector covariance matrices needed for Joint Factor Analysis (JFA). Finally, we derive a new technique for reduced-dimension i-vector extraction which we call Supervector LDA (SV-LDA), and demonstrate a 100-dimensional i-vector language recognition system with equivalent performance to a 600-dimensional version at much lower complexity.
READ LESS

Summary

In this paper, we extend our previous analysis of Gaussian Mixture Model (GMM) subspace compensation techniques using Gaussian modeling in the supervector space combined with additive channel and observation noise. We show that under the modeling assumptions of a total-variability i-vector system, full Gaussian supervector scoring can also be performed...

READ MORE

Exploring the impact of advanced front-end processing on NIST speaker recognition microphone tasks

Summary

The NIST speaker recognition evaluation (SRE) featured microphone data in the 2005-2010 evaluations. The preprocessing and use of this data has typically been performed with telephone bandwidth and quantization. Although this approach is viable, it ignores the richer properties of the microphone data-multiple channels, high-rate sampling, linear encoding, ambient noise properties, etc. In this paper, we explore alternate choices of preprocessing and examine their effects on speaker recognition performance. Specifically, we consider the effects of quantization, sampling rate, enhancement, and two-channel speech activity detection. Experiments on the NIST 2010 SRE interview microphone corpus demonstrate that performance can be dramatically improved with a different preprocessing chain.
READ LESS

Summary

The NIST speaker recognition evaluation (SRE) featured microphone data in the 2005-2010 evaluations. The preprocessing and use of this data has typically been performed with telephone bandwidth and quantization. Although this approach is viable, it ignores the richer properties of the microphone data-multiple channels, high-rate sampling, linear encoding, ambient noise...

READ MORE

Linear prediction modulation filtering for speaker recognition of reverberant speech

Published in:
Odyssey 2012, The Speaker and Language Recognition Workshop, 25-28 June 2012.

Summary

This paper proposes a framework for spectral enhancement of reverberant speech based on inversion of the modulation transfer function. All-pole modeling of modulation spectra of clean and degraded speech are utilized to derive the linear prediction inverse modulation transfer function (LP-IMTF) solution as a low-order IIR filter in the modulation envelope domain. By considering spectral estimation under speech presence uncertainty, speech presence probabilities are derived for the case of reverberation. Aside from enhancement, the LP-IMTF framework allows for blind estimation of reverberation time by extracting a minimum phase approximation of the short-time spectral channel impulse response. The proposed speech enhancement method is used as a front-end processing step for speaker recognition. When applied to the microphone condition of the NISTSRE 2010 with artificially added reverberation, the proposed spectral enhancement method yields significant improvements across a variety of performance metrics.
READ LESS

Summary

This paper proposes a framework for spectral enhancement of reverberant speech based on inversion of the modulation transfer function. All-pole modeling of modulation spectra of clean and degraded speech are utilized to derive the linear prediction inverse modulation transfer function (LP-IMTF) solution as a low-order IIR filter in the modulation...

READ MORE

Showing Results

1-7 of 7