Publications

Refine Results

(Filters Applied) Clear All

Bayesian estimation of PLDA with noisy training labels, with applications to speaker verification

Published in:
2020 IEEE Intl. Conf. on Acoustics, Speech and Signal Processing, ICASSP, 4-8 May 2020.

Summary

This paper proposes a method for Bayesian estimation of probabilistic linear discriminant analysis (PLDA) when training labels are noisy. Label errors can be expected during e.g. large or distributed data collections, or for crowd-sourced data labeling. By interpreting true labels as latent random variables, the observed labels are modeled as outputs of a discrete memoryless channel, and the maximum a posteriori (MAP) estimate of the PLDA model is derived via Variational Bayes. The proposed framework can be used for PLDA estimation, PLDA domain adaptation, or to infer the reliability of a PLDA training list. Although presented as a general method, the paper discusses specific applications for speaker verification. When applied to the Speakers in the Wild (SITW) Task, the proposed method achieves graceful performance degradation when label errors are introduced into the training or domain adaptation lists. When applied to the NIST 2018 Speaker Recognition Evaluation (SRE18) Task, which includes adaptation data with noisy speaker labels, the proposed technique provides performance improvements relative to unsupervised domain adaptation.
READ LESS

Summary

This paper proposes a method for Bayesian estimation of probabilistic linear discriminant analysis (PLDA) when training labels are noisy. Label errors can be expected during e.g. large or distributed data collections, or for crowd-sourced data labeling. By interpreting true labels as latent random variables, the observed labels are modeled as...

READ MORE

Discriminative PLDA for speaker verification with X-vectors

Published in:
IEEE Signal Processing Letters [submitted]

Summary

This paper proposes a novel approach to discrimina-tive training of probabilistic linear discriminant analysis (PLDA) for speaker verification with x-vectors. Model over-fitting is a well-known issue with discriminative PLDA (D-PLDA) forspeaker verification. As opposed to prior approaches which address this by limiting the number of trainable parameters, the proposed method parameterizes the discriminative PLDA (D-PLDA) model in a manner which allows for intuitive regularization, permitting the entire model to be optimized. Specifically, the within-class and across-class covariance matrices which comprise the PLDA model are expressed as products of orthonormal and diagonal matrices, and the structure of these matrices is enforced during model training. The proposed approach provides consistent performance improvements relative to previous D-PLDA methods when applied to a variety of speaker recognition evaluations, including the Speakers in the Wild Core-Core, SRE16, SRE18 CMN2, SRE19 CMN2, and VoxCeleb1 Tasks. Additionally, when implemented in Tensorflow using a modernGPU, D-PLDA optimization is highly efficient, requiring less than 20 minutes.
READ LESS

Summary

This paper proposes a novel approach to discrimina-tive training of probabilistic linear discriminant analysis (PLDA) for speaker verification with x-vectors. Model over-fitting is a well-known issue with discriminative PLDA (D-PLDA) forspeaker verification. As opposed to prior approaches which address this by limiting the number of trainable parameters, the proposed method...

READ MORE

State-of-the-art speaker recognition for telephone and video speech: the JHU-MIT submission for NIST SRE18

Summary

We present a condensed description of the joint effort of JHUCLSP, JHU-HLTCOE, MIT-LL., MIT CSAIL and LSE-EPITA for NIST SRE18. All the developed systems consisted of xvector/i-vector embeddings with some flavor of PLDA backend. Very deep x-vector architectures–Extended and Factorized TDNN, and ResNets– clearly outperformed shallower xvectors and i-vectors. The systems were tailored to the video (VAST) or to the telephone (CMN2) condition. The VAST data was challenging, yielding 4 times worse performance than other video based datasets like Speakers in the Wild. We were able to calibrate the VAST data with very few development trials by using careful adaptation and score normalization methods. The VAST primary fusion yielded EER=10.18% and Cprimary= 0.431. By improving calibration in post-eval, we reached Cprimary=0.369. In CMN2, we used unsupervised SPLDA adaptation based on agglomerative clustering and score normalization to correct the domain shift between English and Tunisian Arabic models. The CMN2 primary fusion yielded EER=4.5% and Cprimary=0.313. Extended TDNN x-vector was the best single system obtaining EER=11.1% and Cprimary=0.452 in VAST; and 4.95% and 0.354 in CMN2.
READ LESS

Summary

We present a condensed description of the joint effort of JHUCLSP, JHU-HLTCOE, MIT-LL., MIT CSAIL and LSE-EPITA for NIST SRE18. All the developed systems consisted of xvector/i-vector embeddings with some flavor of PLDA backend. Very deep x-vector architectures–Extended and Factorized TDNN, and ResNets– clearly outperformed shallower xvectors and i-vectors. The...

READ MORE

Discriminative PLDA for speaker verification with X-vectors

Published in:
International Conference on Acoustics, Speech, and Signal Processing, May 2019 [submitted]

Summary

This paper proposes a novel approach to discriminative training ofprobabilistic linear discriminant analysis (PLDA) for speaker veri-fication with x-vectors. The Newton Method is used to discrimi-natively train the PLDA model by minimizing the log loss of ver-ification trials. By diagonalizing the across-class and within-classcovariance matrices as a pre-processing step, the PLDA model canbe trained without relying on approximations, and while maintain-ing important properties of the underlying covariance matrices. Thetraining procedure is extended to allow for efficient domain adapta-tion. When applied to the Speakers in the Wild and SRE16 tasks, theproposed approach provides significant performance improvementsrelative to conventional PLDA.
READ LESS

Summary

This paper proposes a novel approach to discriminative training ofprobabilistic linear discriminant analysis (PLDA) for speaker veri-fication with x-vectors. The Newton Method is used to discrimi-natively train the PLDA model by minimizing the log loss of ver-ification trials. By diagonalizing the across-class and within-classcovariance matrices as a pre-processing step, the...

READ MORE

Corpora for the evaluation of robust speaker recognition systems

Published in:
INTERSPEECH 2016: 16th Annual Conf. of the Int. Speech Communication Assoc., 8-12 September 2016.

Summary

The goal of this paper is to describe significant corpora available to support speaker recognition research and evaluation, along with details about the corpora collection and design. We describe the attributes of high-quality speaker recognition corpora. Considerations of the application, domain, and performance metrics are also discussed. Additionally, a literature survey of corpora used in speaker recognition research over the last 10 years is presented. Finally we show the most common corpora used in the research community and review them on their success in enabling meaningful speaker recognition research.
READ LESS

Summary

The goal of this paper is to describe significant corpora available to support speaker recognition research and evaluation, along with details about the corpora collection and design. We describe the attributes of high-quality speaker recognition corpora. Considerations of the application, domain, and performance metrics are also discussed. Additionally, a literature...

READ MORE

Relating estimated cyclic spectral peak frequency to measured epilarynx length using magnetic resonance imaging

Published in:
INTERSPEECH 2016: 16th Annual Conf. of the Int. Speech Communication Assoc., 8-12 September 2016.

Summary

The epilarynx plays an important role in speech production, carrying information about the individual speaker and manner of articulation. However, precise acoustic behavior of this lower vocal tract structure is difficult to establish. Focusing on acoustics observable in natural speech, recent spectral processing techniques isolate a unique resonance with characteristics of the epilarynx previously shown via simulation, specifically cyclicity (i.e. energy differences between the closed and open phases of the glottal cycle) in a 3-5kHz region observed across vowels. Using Magnetic Resonance Imaging (MRI), the present work relates this estimated cyclic peak frequency to measured epilarynx length. Assuming a simple quarter wavelength relationship, the cavity length estimated from the cyclic peak frequency is shown to be directly proportional (linear fit slope =1.1) and highly correlated (p = 0.85, pval<10^?4) to the measured epilarynx length across speakers. Results are discussed, as are implications in speech science and application domains.
READ LESS

Summary

The epilarynx plays an important role in speech production, carrying information about the individual speaker and manner of articulation. However, precise acoustic behavior of this lower vocal tract structure is difficult to establish. Focusing on acoustics observable in natural speech, recent spectral processing techniques isolate a unique resonance with characteristics...

READ MORE

Speaker linking and applications using non-parametric hashing methods

Published in:
INTERSPEECH 2016: 16th Annual Conf. of the Int. Speech Communication Assoc., 8-12 September 2016.

Summary

Large unstructured audio data sets have become ubiquitous and present a challenge for organization and search. One logical approach for structuring data is to find common speakers and link occurrences across different recordings. Prior approaches to this problem have focused on basic methodology for the linking task. In this paper, we introduce a novel trainable nonparametric hashing method for indexing large speaker recording data sets. This approach leads to tunable computational complexity methods for speaker linking. We focus on a scalable clustering method based on hashing canopy-clustering. We apply this method to a large corpus of speaker recordings, demonstrate performance tradeoffs, and compare to other hashing methods.
READ LESS

Summary

Large unstructured audio data sets have become ubiquitous and present a challenge for organization and search. One logical approach for structuring data is to find common speakers and link occurrences across different recordings. Prior approaches to this problem have focused on basic methodology for the linking task. In this paper...

READ MORE

Speaker recognition using real vs synthetic parallel data for DNN channel compensation

Published in:
INTERSPEECH 2016: 16th Annual Conf. of the Int. Speech Communication Assoc., 8-12 September 2016.

Summary

Recent work has shown large performance gains using denoising DNNs for speech processing tasks under challenging acoustic conditions. However, training these DNNs requires large amounts of parallel multichannel speech data which can be impractical or expensive to collect. The effective use of synthetic parallel data as an alternative has been demonstrated for several speech technologies including automatic speech recognition and speaker recognition (SR). This paper demonstrates that denoising DNNs trained with real Mixer 2 multichannel data perform only slightly better than DNNs trained with synthetic multichannel data for microphone SR on Mixer 6. Large reductions in pooled error rates of 50% EER and 30% min DCF are achieved using DNNs trained on real Mixer 2 data. Nearly the same performance gains are achieved using synthetic data generated with a limited number of room impulse responses (RIRs) and noise sources derived from Mixer 2. Using RIRs from three publicly available sources used in the Kaldi ASpIRE recipe yields somewhat lower pooled gains of 34% EER and 25% min DCF. These results confirm the effective use of synthetic parallel data for DNN channel compensation even when the RIRs used for synthesizing the data are not particularly well matched to the task.
READ LESS

Summary

Recent work has shown large performance gains using denoising DNNs for speech processing tasks under challenging acoustic conditions. However, training these DNNs requires large amounts of parallel multichannel speech data which can be impractical or expensive to collect. The effective use of synthetic parallel data as an alternative has been...

READ MORE

Channel compensation for speaker recognition using MAP adapted PLDA and denoising DNNs

Published in:
Odyssey 2016, The Speaker and Language Recognition Workshop, 21-24 June 2016.

Summary

Over several decades, speaker recognition performance has steadily improved for applications using telephone speech. A big part of this improvement has been the availability of large quantities of speaker-labeled data from telephone recordings. For new data applications, such as audio from room microphones, we would like to effectively use existing telephone data to build systems with high accuracy while maintaining good performance on existing telephone tasks. In this paper we compare and combine approaches to compensate models parameters and features for this purpose. For model adaptation we explore MAP adaptation of hyper-parameters and for feature compensation we examine the use of denoising DNNs. On a multi-room, multi-microphone speaker recognition experiment we show a reduction of 61% in EER with a combination of these approaches while slightly improving performance on telephone data.
READ LESS

Summary

Over several decades, speaker recognition performance has steadily improved for applications using telephone speech. A big part of this improvement has been the availability of large quantities of speaker-labeled data from telephone recordings. For new data applications, such as audio from room microphones, we would like to effectively use existing...

READ MORE

Domain mismatch compensation for speaker recognition using a library of whiteners

Published in:
IEEE Signal Process. Lett., Vol. 22, No. 11, November 2015, pp. 2000-2003.

Summary

The development of the i-vector framework for generating low dimensional representations of speech utterances has led to considerable improvements in speaker recognition performance. Although these gains have been achieved in periodic National Institute of Standards and Technology (NIST) evaluations, the problem of domain mismatch, where the system development data and the application data are collected from different sources, remains a challenging one. The impact of domain mismatch was a focus of the Johns Hopkins University (JHU) 2013 speaker recognition workshop, where a domain adaptation challenge (DAC13) corpus was created to address this problem. This paper proposes an approach to domain mismatch compensation for applications where in-domain development data is assumed to be unavailable. The method is based on a generalization of data whitening used in association with i-vector length normalization and utilizes a library of whitening transforms trained at system development time using strictly out-of-domain data. The approach is evaluated on the 2013 domain adaptation challenge task and is shown to compare favorably to in-domain conventional whitening and to nuisance attribute projection (NAP) inter-dataset variability compensation.
READ LESS

Summary

The development of the i-vector framework for generating low dimensional representations of speech utterances has led to considerable improvements in speaker recognition performance. Although these gains have been achieved in periodic National Institute of Standards and Technology (NIST) evaluations, the problem of domain mismatch, where the system development data and...

READ MORE