Publications

Refine Results

(Filters Applied) Clear All

A generative approach to condition-aware score calibration for speaker verification

Published in:
IEEE/ACM Trans. Audio, Speech, Language Process., Vol. 31, 2023, pp. 891-901.

Summary

In speaker verification, score calibration is employed to transform verification scores to log-likelihood ratios (LLRs) which are statistically interpretable. Conventional calibration techniques apply a global score transform. However, in condition-aware (CA) calibration, information conveying signal conditions is provided as input, allowing calibration to be adaptive. This paper explores a generative approach to condition-aware score calibration. It proposes a novel generative model for speaker verification trials, each which includes a trial score, a trial label, and the associated pair of speaker embeddings. Trials are assumed to be drawn from a discrete set of underlying signal conditions which are modeled as latent Categorical random variables, so that trial scores and speaker embeddings are drawn from condition-dependent distributions. An Expectation-Maximization (EM) Algorithm for parameter estimation of the proposed model is presented, which does not require condition labels and instead discovers relevant conditions in an unsupervised manner. The generative condition-aware (GCA) calibration transform is then derived as the log-likelihood ratio of a verification score given the observed pair of embeddings. Experimental results show the proposed approach to provide performance improvements on a variety of speaker verification tasks, outperforming static and condition-aware baseline calibration methods. GCA calibration is observed to improve the discriminative ability of the speaker verification system, as well as provide good calibration performance across a range of operating points. The benefits of the proposed method are observed for task-dependent models where signal conditions are known, for universal models which are robust across a range of conditions, and when facing unseen signal conditions.
READ LESS

Summary

In speaker verification, score calibration is employed to transform verification scores to log-likelihood ratios (LLRs) which are statistically interpretable. Conventional calibration techniques apply a global score transform. However, in condition-aware (CA) calibration, information conveying signal conditions is provided as input, allowing calibration to be adaptive. This paper explores a generative...

READ MORE

The JHU-MIT System Description for NIST SRE19 AV

Summary

This document represents the SRE19 AV submission by the team composed of JHU-CLSP, JHU-HLTCOE and MIT Lincoln Labs. All the developed systems for the audio and videoconditions consisted of Neural network embeddings with some flavor of PLDA/cosine back-end. Primary fusions obtained Actual DCF of 0.250 on SRE18 VAST eval, 0.183 on SRE19 AV dev audio, 0.140 on SRE19 AV dev video and 0.054 on SRE19AV multi-modal.
READ LESS

Summary

This document represents the SRE19 AV submission by the team composed of JHU-CLSP, JHU-HLTCOE and MIT Lincoln Labs. All the developed systems for the audio and videoconditions consisted of Neural network embeddings with some flavor of PLDA/cosine back-end. Primary fusions obtained Actual DCF of 0.250 on SRE18 VAST eval, 0.183...

READ MORE

Approaches for language identification in mismatched environments

Summary

In this paper, we consider the task of language identification in the context of mismatch conditions. Specifically, we address the issue of using unlabeled data in the domain of interest to improve the performance of a state-of-the-art system. The evaluation is performed on a 9-language set that includes data in both conversational telephone speech and narrowband broadcast speech. Multiple experiments are conducted to assess the performance of the system in this condition and a number of alternatives to ameliorate the drop in performance. The best system evaluated is based on deep neural network (DNN) bottleneck features using i-vectors utilizing a combination of all the approaches proposed in this work. The resulting system improved baseline DNN system performance by 30%.
READ LESS

Summary

In this paper, we consider the task of language identification in the context of mismatch conditions. Specifically, we address the issue of using unlabeled data in the domain of interest to improve the performance of a state-of-the-art system. The evaluation is performed on a 9-language set that includes data in...

READ MORE

Multi-modal audio, video and physiological sensor learning for continuous emotion prediction

Summary

The automatic determination of emotional state from multimedia content is an inherently challenging problem with a broad range of applications including biomedical diagnostics, multimedia retrieval, and human computer interfaces. The Audio Video Emotion Challenge (AVEC) 2016 provides a well-defined framework for developing and rigorously evaluating innovative approaches for estimating the arousal and valence states of emotion as a function of time. It presents the opportunity for investigating multimodal solutions that include audio, video, and physiological sensor signals. This paper provides an overview of our AVEC Emotion Challenge system, which uses multi-feature learning and fusion across all available modalities. It includes a number of technical contributions, including the development of novel high- and low-level features for modeling emotion in the audio, video, and physiological channels. Low-level features include modeling arousal in audio with minimal prosodic-based descriptors. High-level features are derived from supervised and unsupervised machine learning approaches based on sparse coding and deep learning. Finally, a state space estimation approach is applied for score fusion that demonstrates the importance of exploiting the time-series nature of the arousal and valence states. The resulting system outperforms the baseline systems [10] on the test evaluation set with an achieved Concordant Correlation Coefficient (CCC) for arousal of 0.770 vs 0.702 (baseline) and for valence of 0.687 vs 0.638. Future work will focus on exploiting the time-varying nature of individual channels in the multi-modal framework.
READ LESS

Summary

The automatic determination of emotional state from multimedia content is an inherently challenging problem with a broad range of applications including biomedical diagnostics, multimedia retrieval, and human computer interfaces. The Audio Video Emotion Challenge (AVEC) 2016 provides a well-defined framework for developing and rigorously evaluating innovative approaches for estimating the...

READ MORE

Detecting depression using vocal, facial and semantic communication cues

Summary

Major depressive disorder (MDD) is known to result in neurophysiological and neurocognitive changes that affect control of motor, linguistic, and cognitive functions. MDD's impact on these processes is reflected in an individual's communication via coupled mechanisms: vocal articulation, facial gesturing and choice of content to convey in a dialogue. In particular, MDD-induced neurophysiological changes are associated with a decline in dynamics and coordination of speech and facial motor control, while neurocognitive changes influence dialogue semantics. In this paper, biomarkers are derived from all of these modalities, drawing first from previously developed neurophysiologically motivated speech and facial coordination and timing features. In addition, a novel indicator of lower vocal tract constriction in articulation is incorporated that relates to vocal projection. Semantic features are analyzed for subject/avatar dialogue content using a sparse coded lexical embedding space, and for contextual clues related to the subject's present or past depression status. The features and depression classification system were developed for the 6th International Audio/Video Emotion Challenge (AVEC), which provides data consisting of audio, video-based facial action units, and transcribed text of individuals communicating with the human-controlled avatar. A clinical Patient Health Questionnaire (PHQ) score and binary depression decision are provided for each participant. PHQ predictions were obtained by fusing outputs from a Gaussian staircase regressor for each feature set, with results on the development set of mean F1=0.81, RMSE=5.31, and MAE=3.34. These compare favorably to the challenge baseline development results of mean F1=0.73, RMSE=6.62, and MAE=5.52. On test set evaluation, our system obtained a mean F1=0.70, which is similar to the challenge baseline test result. Future work calls for consideration of joint feature analyses across modalities in an effort to detect neurological disorders based on the interplay of motor, linguistic, affective, and cognitive components of communication.
READ LESS

Summary

Major depressive disorder (MDD) is known to result in neurophysiological and neurocognitive changes that affect control of motor, linguistic, and cognitive functions. MDD's impact on these processes is reflected in an individual's communication via coupled mechanisms: vocal articulation, facial gesturing and choice of content to convey in a dialogue. In...

READ MORE

I-vector speaker and language recognition system on Android

Published in:
HPEC 2016: IEEE Conf. on High Performance Extreme Computing, 13-15 September 2016.

Summary

I-Vector based speaker and language identification provides state of the art performance. However, this comes as a more computationally complex solution, which can often lead to challenges in resource-limited devices, such as phones or tablets. We present the implementation of an I-Vector speaker and language recognition system on the Android platform in the form of a fully functional application that allows speaker enrollment and language/speaker scoring within mobile contexts. We include a detailed account of the challenges to port the system and its dependencies, which were necessary to optimize matrix operations in the I-Vector implementation. The system was benchmarked on a for a Google Nexus 6, showing a speed increase of 61.68% in scoring and 82.63% in enrollment operations with the implemented optimizations. The application was tested in mobile settings on a Nexus 7 tablet with forty participants, showing a rough accuracy of 84%. The optimized platform showed the capacity to perform near real-time recognition within a mobile setting and showcases the viability of I-Vector systems on resource-limited environments.
READ LESS

Summary

I-Vector based speaker and language identification provides state of the art performance. However, this comes as a more computationally complex solution, which can often lead to challenges in resource-limited devices, such as phones or tablets. We present the implementation of an I-Vector speaker and language recognition system on the Android...

READ MORE

The MITLL NIST LRE 2015 Language Recognition System

Summary

In this paper we describe the most recent MIT Lincoln Laboratory language recognition system developed for the NIST 2015 Language Recognition Evaluation (LRE). The submission features a fusion of five core classifiers, with most systems developed in the context of an i-vector framework. The 2015 evaluation presented new paradigms. First, the evaluation included fixed training and open training tracks for the first time; second, language classification performance was measured across 6 language clusters using 20 language classes instead of an N-way language task; and third, performance was measured across a nominal 3-30 second range. Results are presented for the overall performance across the six language clusters for both the fixed and open training tasks. On the 6-cluster metric the Lincoln system achieved overall costs of 0.173 and 0.168 for the fixed and open tasks respectively.
READ LESS

Summary

In this paper we describe the most recent MIT Lincoln Laboratory language recognition system developed for the NIST 2015 Language Recognition Evaluation (LRE). The submission features a fusion of five core classifiers, with most systems developed in the context of an i-vector framework. The 2015 evaluation presented new paradigms. First...

READ MORE

Estimating lower vocal tract features with closed-open phase spectral analyses

Published in:
INTERSPEECH 2015: 15th Annual Conf. of the Int. Speech Communication Assoc., 6-10 September 2015.

Summary

Previous studies have shown that, in addition to being speaker-dependent yet context-independent, lower vocal tract acoustics significantly impact the speech spectrum at mid-to-high frequencies (e.g 3-6kHz). The present work automatically estimates spectral features that exhibit acoustic properties of the lower vocal tract. Specifically aiming to capture the cyclicity property of the epilarynx tube, a novel multi-resolution approach to spectral analyses is presented that exploits significant differences between the closed and open phases of a glottal cycle. A prominent null linked to the piriform fossa is also estimated. Examples of the feature estimation on natural speech of the VOICES multi-speaker corpus illustrate that a salient spectral pattern indeed emerges between 3-6kHz across all speakers. Moreover, the observed pattern is consistent with that canonically shown for the lower vocal tract in previous works. Additionally, an instance of a speaker's formant (i.e. spectral peak around 3kHz that has been well-established as a characteristic of voice projection) is quantified here for the VOICES template speaker in relation to epilarynx acoustics. The corresponding peak is shown to be double the power on average compared to the other speakers (20 vs 10 dB).
READ LESS

Summary

Previous studies have shown that, in addition to being speaker-dependent yet context-independent, lower vocal tract acoustics significantly impact the speech spectrum at mid-to-high frequencies (e.g 3-6kHz). The present work automatically estimates spectral features that exhibit acoustic properties of the lower vocal tract. Specifically aiming to capture the cyclicity property of...

READ MORE

Language recognition via i-vectors and dimensionality reduction

Published in:
2011 INTERSPEECH, 27-31 August 2011, pp. 857-860.

Summary

In this paper, a new language identification system is presented based on the total variability approach previously developed in the field of speaker identification. Various techniques are employed to extract the most salient features in the lower dimensional i-vector space and the system developed results in excellent performance on the 2009 LRE evaluation set without the need for any post-processing or backend techniques. Additional performance gains are observed when the system is combined with other acoustic systems.
READ LESS

Summary

In this paper, a new language identification system is presented based on the total variability approach previously developed in the field of speaker identification. Various techniques are employed to extract the most salient features in the lower dimensional i-vector space and the system developed results in excellent performance on the...

READ MORE

The MITLL NIST LRE 2009 language recognition system

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP, 15 March 2010, pp. 4994-4997.

Summary

This paper presents a description of the MIT Lincoln Laboratory language recognition system submitted to the NIST 2009 Language Recognition Evaluation (LRE). This system consists of a fusion of three core recognizers, two based on spectral similarity and one based on tokenization. The 2009 LRE differed from previous ones in that test data included narrowband segments from worldwide Voice of America broadcasts as well as conventional recorded conversational telephone speech. Results are presented for the 23-language closed-set and open-set detection tasks at the 30, 10, and 3 second durations along with a discussion of the language-pair task. On the 30 second 23-language closed set detection task, the system achieved a 1.64 average error rate.
READ LESS

Summary

This paper presents a description of the MIT Lincoln Laboratory language recognition system submitted to the NIST 2009 Language Recognition Evaluation (LRE). This system consists of a fusion of three core recognizers, two based on spectral similarity and one based on tokenization. The 2009 LRE differed from previous ones in...

READ MORE