Publications

Refine Results

(Filters Applied) Clear All

R&D Areas

R&D Groups

Year

Items per page

Tagged As

speaker recognition Clear filter

Advances in cross-lingual and cross-source audio-visual speaker recognition: The JHU-MIT system for NIST SRE21

June 28, 2022

Conference Paper

Author:

Jesus Villalba

…

Published in:

Speaker and Language Recognition Workshop, Odyssey 2022, pp. 213-220.

Topic:

speaker recognition

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

We present a condensed description of the joint effort of JHUCLSP/HLTCOE, MIT-LL and AGH for NIST SRE21. NIST SRE21 consisted of speaker detection over multilingual conversational telephone speech (CTS) and audio from video (AfV). Besides the regular audio track, the evaluation also contains visual (face recognition) and multi-modal tracks. This evaluation exposes new challenges, including cross-source–i.e., CTS vs. AfV– and cross-language trials. Each speaker can speak two or three languages among English, Mandarin and Cantonese. For the audio track, we evaluated embeddings based on Res2Net and ECAPA-TDNN, where the former performed the best. We used PLDA based back-ends trained on previous SRE and VoxCeleb and adapted to a subset of Mandarin/Cantonese speakers. Some novel contributions of this submission are: the use of neural bandwidth extension (BWE) to reduce the mismatch between the AFV and CTS conditions; and invariant representation learning (IRL) to make the embeddings from a given speaker invariant to language. Res2Net with neural BWE was the best monolithic system. We used a pre-trained RetinaFace face detector and ArcFace embeddings for the visual track, following our NIST SRE19 work. We also included a new system using a deep pyramid single shot face detector and face embeddings trained on Crystal loss and probabilistic triplet loss, which performed the best. The number of face embeddings in the test video was reduced by agglomerative clustering or weighting the embedding based on the face detection confidence. Cosine scoring was used to compare embeddings. For the multi-modal track, we just added the calibrated likelihood ratios of the audio and visual conditions, assuming independence between modalities. The multi-modal fusion improved Cprimary by 72% w.r.t. audio.

READ LESS

Summary

Advances in cross-lingual and cross-source audio-visual speaker recognition: The JHU-MIT system for NIST SRE21

Unsupervised Bayesian adaptation of PLDA for speaker verification

August 30, 2021

Conference Paper

Author:

Bengt J. Borgstrom

Published in:

Interspeech, 30 August - 3 September 2021.

Topic:

speech

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

This paper presents a Bayesian framework for unsupervised domain adaptation of Probabilistic Linear Discriminant Analysis (PLDA). By interpreting class labels as latent random variables, Variational Bayes (VB) is used to derive a maximum a posterior (MAP) solution of the adapted PLDA model when labels are missing, referred to as VB-MAP. The VB solution iteratively infers class labels and updates PLDA hyperparameters, offering a systematic framework for dealing with unlabeled data. While presented as a general solution, this paper includes experimental results for domain adaptation in speaker verification. VBMAP estimation is applied to the 2016 and 2018 NIST Speaker Recognition Evaluations (SREs), both of which included small and unlabeled in-domain data sets, and is shown to provide performance improvements over a variety of state-of-the-art domain adaptation methods. Additionally, VB-MAP estimation is used to train a fully unsupervised PLDA model, suffering only minor performance degradation relative to conventional supervised training, offering promise for training PLDA models when no relevant labeled data exists.

READ LESS

Summary

Unsupervised Bayesian adaptation of PLDA for speaker verification

The 2019 NIST Audio-Visual Speaker Recognition Evaluation

November 1, 2020

Conference Paper

Author:

Seyed Omid Sadjadi

…

Published in:

The Speaker and Language Recognition Workshop: Odyssey 2020, 1-5 November 2020.

Topic:

speaker recognition

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

In 2019, the U.S. National Institute of Standards and Technology (NIST) conducted the most recent in an ongoing series of speaker recognition evaluations (SRE). There were two components to SRE19: 1) a leaderboard style Challenge using unexposed conversational telephone speech (CTS) data from the Call My Net 2 (CMN2) corpus, and 2) an Audio-Visual (AV) evaluation using video material extracted from the unexposed portions of the Video Annotation for Speech Technologies (VAST) corpus. This paper presents an overview of the Audio-Visual SRE19 activity including the task, the performance metric, data, and the evaluation protocol, results and system performance analyses. The Audio-Visual SRE19 was organized in a similar manner to the audio from video (AfV) track in SRE18, except it offered only the open training condition. In addition, instead of extracting and releasing only the AfV data, unexposed multimedia data from the VAST corpus was used to support the Audio-Visual SRE19. It featured two core evaluation tracks, namely audio only and audio-visual, as well as an optional visual only track. A total of 26 organizations (forming 14 teams) from academia and industry participated in the Audio-Visual SRE19 and submitted 102 valid system outputs. Evaluation results indicate: 1) notable performance improvements for the audio only speaker recognition task on the challenging amateur online video domain due to the use of more complex neural network architectures (e.g., ResNet) along with soft margin losses, 2) state-of-the-art speaker and face recognition technologies provide comparable person recognition performance on the amateur online video domain, and 3) audio-visual fusion results in remarkable performance gains (greater than 85% relative) over the audio only or visual only systems.

READ LESS

Summary

The 2019 NIST Audio-Visual Speaker Recognition Evaluation

The 2019 NIST Speaker Recognition Evaluation CTS Challenge

November 1, 2020

Conference Paper

Author:

Seyed Omid Sadjadi

…

Published in:

The Speaker and Language Recognition Workshop: Odyssey 2020, 1-5 November 2020.

Topic:

human language technology

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

In 2019, the U.S. National Institute of Standards and Technology (NIST) conducted a leaderboard style speaker recognition challenge using conversational telephone speech (CTS) data extracted from the unexposed portion of the Call My Net 2 (CMN2) corpus previously used in the 2018 Speaker Recognition Evaluation (SRE). The SRE19 CTS Challenge was organized in a similar manner to SRE18, except it offered only the open training condition. In addition, similar to the NIST i-vector challenge, the evaluation set consisted of two subsets: a progress subset, and a test subset. The progress subset comprised 30% of the trials and was used to monitor progress on the leaderboad, while the remaining 70% of the trials formed the test subset, which was used to generate the official final results determined at the end of the challenge. Which subset (i.e., progress or test) a trial belonged to was unknown to challenge participants, and each system submission had to contain outputs for all of trials. The CTS Challenge also served as a prerequisite for entrance to the main SRE19 whose primary task was audio-visual person recognition. A total of 67 organizations (forming 51 teams) from academia and industry participated in the CTS Challenge and submitted 1347 valid system outputs. This paper presents an overview of the evaluation and several analyses of system performance for all primary conditions in the CTS Challenge. Compared to the CTS track of the SRE18, the SRE19 CTS Challenge results indicate remarkable improvements in performance which are mainly attributed to 1) the availability of large amounts of in-domain development data from a large number of labeled speakers, 2) speaker representations (aka embeddings) extracted using extended and more complex end-to-end neural network frameworks, and 3) effective use of the provided large development set.

READ LESS

Summary

The 2019 NIST Speaker Recognition Evaluation CTS Challenge

Bayesian estimation of PLDA with noisy training labels, with applications to speaker verification

May 4, 2020

Conference Paper

Author:

Bengt J. Borgstrom

…

Pedro A. Torres-Carrasquillo

Published in:

2020 IEEE Intl. Conf. on Acoustics, Speech and Signal Processing, ICASSP, 4-8 May 2020.

Topic:

speaker recognition

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

This paper proposes a method for Bayesian estimation of probabilistic linear discriminant analysis (PLDA) when training labels are noisy. Label errors can be expected during e.g. large or distributed data collections, or for crowd-sourced data labeling. By interpreting true labels as latent random variables, the observed labels are modeled as outputs of a discrete memoryless channel, and the maximum a posteriori (MAP) estimate of the PLDA model is derived via Variational Bayes. The proposed framework can be used for PLDA estimation, PLDA domain adaptation, or to infer the reliability of a PLDA training list. Although presented as a general method, the paper discusses specific applications for speaker verification. When applied to the Speakers in the Wild (SITW) Task, the proposed method achieves graceful performance degradation when label errors are introduced into the training or domain adaptation lists. When applied to the NIST 2018 Speaker Recognition Evaluation (SRE18) Task, which includes adaptation data with noisy speaker labels, the proposed technique provides performance improvements relative to unsupervised domain adaptation.

READ LESS

Summary

Bayesian estimation of PLDA with noisy training labels, with applications to speaker verification

Discriminative PLDA for speaker verification with X-vectors

April 9, 2020

Journal Article

Author:

Bengt J. Borgstrom

Published in:

IEEE Signal Processing Letters [submitted]

Topic:

speaker recognition

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

This paper proposes a novel approach to discrimina-tive training of probabilistic linear discriminant analysis (PLDA) for speaker verification with x-vectors. Model over-fitting is a well-known issue with discriminative PLDA (D-PLDA) forspeaker verification. As opposed to prior approaches which address this by limiting the number of trainable parameters, the proposed method parameterizes the discriminative PLDA (D-PLDA) model in a manner which allows for intuitive regularization, permitting the entire model to be optimized. Specifically, the within-class and across-class covariance matrices which comprise the PLDA model are expressed as products of orthonormal and diagonal matrices, and the structure of these matrices is enforced during model training. The proposed approach provides consistent performance improvements relative to previous D-PLDA methods when applied to a variety of speaker recognition evaluations, including the Speakers in the Wild Core-Core, SRE16, SRE18 CMN2, SRE19 CMN2, and VoxCeleb1 Tasks. Additionally, when implemented in Tensorflow using a modernGPU, D-PLDA optimization is highly efficient, requiring less than 20 minutes.

READ LESS

Summary

Discriminative PLDA for speaker verification with X-vectors

State-of-the-art speaker recognition for telephone and video speech: the JHU-MIT submission for NIST SRE18

September 15, 2019

Conference Paper

Author:

Jesus Villalba

…

Published in:

Interspeech, 15-19 September 2019.

Topic:

speaker recognition

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

We present a condensed description of the joint effort of JHUCLSP, JHU-HLTCOE, MIT-LL., MIT CSAIL and LSE-EPITA for NIST SRE18. All the developed systems consisted of xvector/i-vector embeddings with some flavor of PLDA backend. Very deep x-vector architectures–Extended and Factorized TDNN, and ResNets– clearly outperformed shallower xvectors and i-vectors. The systems were tailored to the video (VAST) or to the telephone (CMN2) condition. The VAST data was challenging, yielding 4 times worse performance than other video based datasets like Speakers in the Wild. We were able to calibrate the VAST data with very few development trials by using careful adaptation and score normalization methods. The VAST primary fusion yielded EER=10.18% and Cprimary= 0.431. By improving calibration in post-eval, we reached Cprimary=0.369. In CMN2, we used unsupervised SPLDA adaptation based on agglomerative clustering and score normalization to correct the domain shift between English and Tunisian Arabic models. The CMN2 primary fusion yielded EER=4.5% and Cprimary=0.313. Extended TDNN x-vector was the best single system obtaining EER=11.1% and Cprimary=0.452 in VAST; and 4.95% and 0.354 in CMN2.

READ LESS

Summary

State-of-the-art speaker recognition for telephone and video speech: the JHU-MIT submission for NIST SRE18

Discriminative PLDA for speaker verification with X-vectors

May 12, 2019

Conference Paper

Author:

Bengt J. Borgstrom

Published in:

International Conference on Acoustics, Speech, and Signal Processing, May 2019 [submitted]

Topic:

speaker recognition

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

This paper proposes a novel approach to discriminative training ofprobabilistic linear discriminant analysis (PLDA) for speaker veri-fication with x-vectors. The Newton Method is used to discrimi-natively train the PLDA model by minimizing the log loss of ver-ification trials. By diagonalizing the across-class and within-classcovariance matrices as a pre-processing step, the PLDA model canbe trained without relying on approximations, and while maintain-ing important properties of the underlying covariance matrices. Thetraining procedure is extended to allow for efficient domain adapta-tion. When applied to the Speakers in the Wild and SRE16 tasks, theproposed approach provides significant performance improvementsrelative to conventional PLDA.

READ LESS

Summary

Discriminative PLDA for speaker verification with X-vectors

Speaker recognition using real vs synthetic parallel data for DNN channel compensation

September 8, 2016

Conference Paper

Author:

Frederick S. Richardson

…

Published in:

INTERSPEECH 2016: 16th Annual Conf. of the Int. Speech Communication Assoc., 8-12 September 2016.

Topic:

speaker recognition

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

Recent work has shown large performance gains using denoising DNNs for speech processing tasks under challenging acoustic conditions. However, training these DNNs requires large amounts of parallel multichannel speech data which can be impractical or expensive to collect. The effective use of synthetic parallel data as an alternative has been demonstrated for several speech technologies including automatic speech recognition and speaker recognition (SR). This paper demonstrates that denoising DNNs trained with real Mixer 2 multichannel data perform only slightly better than DNNs trained with synthetic multichannel data for microphone SR on Mixer 6. Large reductions in pooled error rates of 50% EER and 30% min DCF are achieved using DNNs trained on real Mixer 2 data. Nearly the same performance gains are achieved using synthetic data generated with a limited number of room impulse responses (RIRs) and noise sources derived from Mixer 2. Using RIRs from three publicly available sources used in the Kaldi ASpIRE recipe yields somewhat lower pooled gains of 34% EER and 25% min DCF. These results confirm the effective use of synthetic parallel data for DNN channel compensation even when the RIRs used for synthesizing the data are not particularly well matched to the task.

READ LESS

Summary

Speaker recognition using real vs synthetic parallel data for DNN channel compensation

Speaker linking and applications using non-parametric hashing methods

September 8, 2016

Conference Paper

Author:

Douglas E. Sturim

…

William M. Campbell

Published in:

INTERSPEECH 2016: 16th Annual Conf. of the Int. Speech Communication Assoc., 8-12 September 2016.

Topic:

speaker recognition

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

Large unstructured audio data sets have become ubiquitous and present a challenge for organization and search. One logical approach for structuring data is to find common speakers and link occurrences across different recordings. Prior approaches to this problem have focused on basic methodology for the linking task. In this paper, we introduce a novel trainable nonparametric hashing method for indexing large speaker recording data sets. This approach leads to tunable computational complexity methods for speaker linking. We focus on a scalable clustering method based on hashing canopy-clustering. We apply this method to a large corpus of speaker recordings, demonstrate performance tradeoffs, and compare to other hashing methods.

READ LESS

Summary

Speaker linking and applications using non-parametric hashing methods

Publications

Refine Results

Tagged As

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Showing Results