Publications

Refine Results

(Filters Applied) Clear All

Compensating for mismatch in high-level speaker recognition

Published in:
2006 IEEE Odyssey, the Speaker and Language Recognition Workshop, 28-30 June 2006.

Summary

Speaker recognition using high-level features has been a successful area of exploration. Features obtained from many different levels phones, words, prosodic events, etc. are used to characterize the speaker. A good modeling technique for these features is the support vector machine (SVM). SVMs model the n-gram frequencies from speaker utterances in a high-dimensional SVM feature space and have shown excellent performance over a wide variety of high-level features. A complimentary method of recent exploration in SVM speaker recognition is the use of nuisance attribute projection (NAP). NAP removes directions from SVM feature space that are superfluous to the task of speaker recognition channel information, session variability, etc. In this paper, we consider the application of NAP to high-level speaker recognition. We describe the difficulties in applying this method and propose solutions. We also conduct experiments showing that NAP can reduce variability in SVM feature space leading to improved performance.
READ LESS

Summary

Speaker recognition using high-level features has been a successful area of exploration. Features obtained from many different levels phones, words, prosodic events, etc. are used to characterize the speaker. A good modeling technique for these features is the support vector machine (SVM). SVMs model the n-gram frequencies from speaker utterances...

READ MORE

Understanding scores in forensic speaker recognition

Summary

Recent work in forensic speaker recognition has introduced many new scoring methodologies. First, confidence scores (posterior probabilities) have become a useful method of presenting results to an analyst. The introduction of an objective measure of confidence score quality, the normalized cross entropy, has resulted in a systematic manner of evaluating and designing these systems. A second scoring methodology that has become popular is support vector machines (SVMs) for high-level features. SVMs are accurate and produce excellent results across a wide variety of token types-words, phones, and prosodic features. In both cases, an analyst may be at a loss to explain the significance and meaning of the score produced by these methods. We tackle the problem of interpretation by exploring concepts from the statistical and pattern classification literature. In both cases, our preliminary results show interesting aspects of scores not obvious from viewing them "only as numbers."
READ LESS

Summary

Recent work in forensic speaker recognition has introduced many new scoring methodologies. First, confidence scores (posterior probabilities) have become a useful method of presenting results to an analyst. The introduction of an objective measure of confidence score quality, the normalized cross entropy, has resulted in a systematic manner of evaluating...

READ MORE

The mixer and transcript reading corpora: resources for multilingual, crosschannel speaker recognition research

Summary

This paper describes the planning and creation of the Mixer and Transcript Reading corpora, their properties and yields, and reports on the lessons learned during their development.
READ LESS

Summary

This paper describes the planning and creation of the Mixer and Transcript Reading corpora, their properties and yields, and reports on the lessons learned during their development.

READ MORE

SVM based speaker verification using a GMM supervector kernel and NAP variability compensation

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Speech and Language Processing, ICASSP, Vol. 1, 14-19 May 2006, pp. 97-100.

Summary

Gaussian mixture models with universal backgrounds (UBMs) have become the standard method for speaker recognition. Typically, a speaker model is constructed by MAP adaptation of the means of the UBM. A GMM supervector is constructed by stacking the means of the adapted mixture components. A recent discovery is that latent factor analysis of this GMM supervector is an effective method for variability compensation. We consider this GMM supervector in the context of support vector machines. We construct a support vector machine kernel using the GMM supervector. We show similarities based on this kernel between the method of SVM nuisance attribute projection (NAP) and the recent results in latent factor analysis. Experiments on a NIST SRE 2005 corpus demonstrate the effectiveness of the new technique.
READ LESS

Summary

Gaussian mixture models with universal backgrounds (UBMs) have become the standard method for speaker recognition. Typically, a speaker model is constructed by MAP adaptation of the means of the UBM. A GMM supervector is constructed by stacking the means of the adapted mixture components. A recent discovery is that latent...

READ MORE

Support vector machines using GMM supervectors for speaker verification

Published in:
IEEE Signal Process. Lett., Vol. 13, No. 5, May 2006, pp. 308-311.

Summary

Gaussian mixture models (GMMs) have proven extremely successful for text-independent speaker recognition. The standard training method for GMMmodels is to use MAP adaptation of the means of the mixture components based on speech from a target speaker. Recent methods in compensation for speaker and channel variability have proposed the idea of stacking the means of the GMM model to form a GMM mean supervector. We examine the idea of using the GMM supervector in a support vector machine (SVM) classifier. We propose two new SVM kernels based on distance metrics between GMM models. We show that these SVM kernels produce excellent classification accuracy in a NIST speaker recognition evaluation task.
READ LESS

Summary

Gaussian mixture models (GMMs) have proven extremely successful for text-independent speaker recognition. The standard training method for GMMmodels is to use MAP adaptation of the means of the mixture components based on speech from a target speaker. Recent methods in compensation for speaker and channel variability have proposed the idea...

READ MORE

Support vector machines for speaker and language recognition

Published in:
Comput. Speech Lang., Vol. 20, No. 2-3, April/July 2006, pp. 210-229.

Summary

Support vector machines (SVMs) have proven to be a powerful technique for pattern classification. SVMs map inputs into a high-dimensional space and then separate classes with a hyperplane. A critical aspect of using SVMs successfully is the design of the inner product, the kernel, induced by the high dimensional mapping. We consider the application of SVMs to speaker and language recognition. A key part of our approach is the use of a kernel that compares sequences of feature vectors and produces a measure of similarity. Our sequence kernel is based upon generalized linear discriminants. We show that this strategy has several important properties. First, the kernel uses an explicit expansion into SVM feature space - this property makes it possible to collapse all support vectors into a single model vector and have low computational complexity. Second, the SVM builds upon a simpler mean-squared error classifier to produce a more accurate system. Finally, the system is competitive and complimentary to other approaches, such as Gaussian mixture models (GMMs). We give results for the 2003 NIST speaker and language evaluations of the system and also show fusion with the traditional GMM approach.
READ LESS

Summary

Support vector machines (SVMs) have proven to be a powerful technique for pattern classification. SVMs map inputs into a high-dimensional space and then separate classes with a hyperplane. A critical aspect of using SVMs successfully is the design of the inner product, the kernel, induced by the high dimensional mapping...

READ MORE

Exploiting nonacoustic sensors for speech encoding

Summary

The intelligibility of speech transmitted through low-rate coders is severely degraded when high levels of acoustic noise are present in the acoustic environment. Recent advances in nonacoustic sensors, including microwave radar, skin vibration, and bone conduction sensors, provide the exciting possibility of both glottal excitation and, more generally, vocal tract measurements that are relatively immune to acoustic disturbances and can supplement the acoustic speech waveform. We are currently investigating methods of combining the output of these sensors for use in low-rate encoding according to their capability in representing specific speech characteristics in different frequency bands. Nonacoustic sensors have the ability to reveal certain speech attributes lost in the noisy acoustic signal; for example, low-energy consonant voice bars, nasality, and glottalized excitation. By fusing nonacoustic low-frequency and pitch content with acoustic-microphone content, we have achieved significant intelligibility performance gains using the DRT across a variety of environments over the government standard 2400-bps MELPe coder. By fusing quantized high-band 4-to-8-kHz speech, requiring only an additional 116 bps, we obtain further DRT performance gains by exploiting the ear's insensitivity to fine spectral detail in this frequency region.
READ LESS

Summary

The intelligibility of speech transmitted through low-rate coders is severely degraded when high levels of acoustic noise are present in the acoustic environment. Recent advances in nonacoustic sensors, including microwave radar, skin vibration, and bone conduction sensors, provide the exciting possibility of both glottal excitation and, more generally, vocal tract...

READ MORE

The 2004 MIT Lincoln Laboratory speaker recognition system

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. 1, 19-23 March 2005, pp. I-177 - I-180.

Summary

The MIT Lincoln Laboratory submission for the 2004 NIST Speaker Recognition Evaluation (SRE) was built upon seven core systems using speaker information from short-term acoustics, pitch and duration prosodic behavior, and phoneme and word usage. These different levels of information were modeled and classified using Gaussian Mixture Models, Support Vector Machines and N-gram language models and were combined using a single layer perception fuser. The 2004 SRE used a new multi-lingual, multi-channel speech corpus that provided a challenging speaker detection task for the above systems. In this paper we describe the core systems used and provide an overview of their performance on the 2004 SRE detection tasks.
READ LESS

Summary

The MIT Lincoln Laboratory submission for the 2004 NIST Speaker Recognition Evaluation (SRE) was built upon seven core systems using speaker information from short-term acoustics, pitch and duration prosodic behavior, and phoneme and word usage. These different levels of information were modeled and classified using Gaussian Mixture Models, Support Vector...

READ MORE

Speaker adaptive cohort selection for Tnorm in text-independent speaker verification

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. 1, 19-23 March 2005, pp. I-741 - I-744.

Summary

In this paper we discuss an extension to the widely used score normalization technique of test normalization (Tnorm) for text-independent speaker verification. A new method of speaker Adaptive-Tnorm that offers advantages over the standard Tnorm by adjusting the speaker set to the target model is presented. Examples of this improvement using the 2004 NIST SRE data are also presented.
READ LESS

Summary

In this paper we discuss an extension to the widely used score normalization technique of test normalization (Tnorm) for text-independent speaker verification. A new method of speaker Adaptive-Tnorm that offers advantages over the standard Tnorm by adjusting the speaker set to the target model is presented. Examples of this improvement...

READ MORE

Advances in channel compensation for SVM speaker recognition

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP, Vol. 1, 19-23 March 2005, pp. I-629 - I-631.

Summary

Cross-channel degradation is one of the significant challenges facing speaker recognition systems. We study the problem for speaker recognition using support vector machines (SVMs). We perform channel compensation in SVM modeling by removing non-speaker nuisance dimensions in the SVM expansion space via projections. Training to remove these dimensions is accomplished via an eigenvalue problem. The eigenvalue problem attempts to reduce multisession variation for the same speaker, reduce different channel effects, and increase "distance" between different speakers. We apply our methods to a subset of the Switchboard 2 corpus. Experiments show dramatic improvement in performance for the cross-channel case.
READ LESS

Summary

Cross-channel degradation is one of the significant challenges facing speaker recognition systems. We study the problem for speaker recognition using support vector machines (SVMs). We perform channel compensation in SVM modeling by removing non-speaker nuisance dimensions in the SVM expansion space via projections. Training to remove these dimensions is accomplished...

READ MORE