Publications

Refine Results

(Filters Applied) Clear All

Understanding scores in forensic speaker recognition

Summary

Recent work in forensic speaker recognition has introduced many new scoring methodologies. First, confidence scores (posterior probabilities) have become a useful method of presenting results to an analyst. The introduction of an objective measure of confidence score quality, the normalized cross entropy, has resulted in a systematic manner of evaluating and designing these systems. A second scoring methodology that has become popular is support vector machines (SVMs) for high-level features. SVMs are accurate and produce excellent results across a wide variety of token types-words, phones, and prosodic features. In both cases, an analyst may be at a loss to explain the significance and meaning of the score produced by these methods. We tackle the problem of interpretation by exploring concepts from the statistical and pattern classification literature. In both cases, our preliminary results show interesting aspects of scores not obvious from viewing them "only as numbers."
READ LESS

Summary

Recent work in forensic speaker recognition has introduced many new scoring methodologies. First, confidence scores (posterior probabilities) have become a useful method of presenting results to an analyst. The introduction of an objective measure of confidence score quality, the normalized cross entropy, has resulted in a systematic manner of evaluating...

READ MORE

Nonlinear equalization for RF receivers

Published in:
Proc. Conf. on High Performance Computer Modernization Program, 26-29 June 2006, pp. 303-307.

Summary

This paper describes the need for High Performance Computing (HPC) to facilitate the development and implementation of a nonlinear equalizer that is capable of mitigating and/or eliminating nonlinear distortion to extend the dynamic range of radar front-end receivers decades beyond the analog state-of-the-art. The search space for the optimal nonlinear equalization (NLEQ) solution is computationally intractable using only a single desktop computer. However, we have been able to leverage a combination of an efficient greedy search with the high performance computing technologies of LLGrid and MatlabMPI to construct an NLEQ architecture that is capable of extending the dynamic range of Radar front-end receivers by over 25dB.
READ LESS

Summary

This paper describes the need for High Performance Computing (HPC) to facilitate the development and implementation of a nonlinear equalizer that is capable of mitigating and/or eliminating nonlinear distortion to extend the dynamic range of radar front-end receivers decades beyond the analog state-of-the-art. The search space for the optimal nonlinear...

READ MORE

The mixer and transcript reading corpora: resources for multilingual, crosschannel speaker recognition research

Summary

This paper describes the planning and creation of the Mixer and Transcript Reading corpora, their properties and yields, and reports on the lessons learned during their development.
READ LESS

Summary

This paper describes the planning and creation of the Mixer and Transcript Reading corpora, their properties and yields, and reports on the lessons learned during their development.

READ MORE

A scalable phonetic vocoder framework using joint predictive vector quantization of MELP parameters

Author:
Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Speech and Language Processing, ICASSP, 14-19 May 2006, pp. 705-708.

Summary

We present the framework for a Scalable Phonetic Vocoder (SPV) capable of operating at bit rates from 300 - 1100 bps. The underlying system uses an HMM-based phonetic speech recognizer to estimate the parameters for MELP speech synthesis. We extend this baseline technique in three ways. First, we introduce the concept of predictive time evolution to generate a smoother path for the synthesizer parameters, and show that it improves speech quality. Then, since the output speech from the phonetic vocoder is still limited by such low bit rates, we propose a scalable system where the accuracy of the MELP parameters is increased by vector quantizing the error signal between the true and phonetic-estimated MELP parameters. Finally, we apply an extremely flexible technique for exploiting correlations in these parameters over time, which we call Joint Predictive Vector Quantization (JPVQ).We show that significant quality improvement can be attained by adding as few as 400 bps to the baseline phonetic vocoder using JPVQ. The resulting SPV system provides a flexible platform for adjusting the phonetic vocoder bit rate and speech quality.
READ LESS

Summary

We present the framework for a Scalable Phonetic Vocoder (SPV) capable of operating at bit rates from 300 - 1100 bps. The underlying system uses an HMM-based phonetic speech recognizer to estimate the parameters for MELP speech synthesis. We extend this baseline technique in three ways. First, we introduce the...

READ MORE

SVM based speaker verification using a GMM supervector kernel and NAP variability compensation

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Speech and Language Processing, ICASSP, Vol. 1, 14-19 May 2006, pp. 97-100.

Summary

Gaussian mixture models with universal backgrounds (UBMs) have become the standard method for speaker recognition. Typically, a speaker model is constructed by MAP adaptation of the means of the UBM. A GMM supervector is constructed by stacking the means of the adapted mixture components. A recent discovery is that latent factor analysis of this GMM supervector is an effective method for variability compensation. We consider this GMM supervector in the context of support vector machines. We construct a support vector machine kernel using the GMM supervector. We show similarities based on this kernel between the method of SVM nuisance attribute projection (NAP) and the recent results in latent factor analysis. Experiments on a NIST SRE 2005 corpus demonstrate the effectiveness of the new technique.
READ LESS

Summary

Gaussian mixture models with universal backgrounds (UBMs) have become the standard method for speaker recognition. Typically, a speaker model is constructed by MAP adaptation of the means of the UBM. A GMM supervector is constructed by stacking the means of the adapted mixture components. A recent discovery is that latent...

READ MORE

Support vector machines using GMM supervectors for speaker verification

Published in:
IEEE Signal Process. Lett., Vol. 13, No. 5, May 2006, pp. 308-311.

Summary

Gaussian mixture models (GMMs) have proven extremely successful for text-independent speaker recognition. The standard training method for GMMmodels is to use MAP adaptation of the means of the mixture components based on speech from a target speaker. Recent methods in compensation for speaker and channel variability have proposed the idea of stacking the means of the GMM model to form a GMM mean supervector. We examine the idea of using the GMM supervector in a support vector machine (SVM) classifier. We propose two new SVM kernels based on distance metrics between GMM models. We show that these SVM kernels produce excellent classification accuracy in a NIST speaker recognition evaluation task.
READ LESS

Summary

Gaussian mixture models (GMMs) have proven extremely successful for text-independent speaker recognition. The standard training method for GMMmodels is to use MAP adaptation of the means of the mixture components based on speech from a target speaker. Recent methods in compensation for speaker and channel variability have proposed the idea...

READ MORE

Support vector machines for speaker and language recognition

Published in:
Comput. Speech Lang., Vol. 20, No. 2-3, April/July 2006, pp. 210-229.

Summary

Support vector machines (SVMs) have proven to be a powerful technique for pattern classification. SVMs map inputs into a high-dimensional space and then separate classes with a hyperplane. A critical aspect of using SVMs successfully is the design of the inner product, the kernel, induced by the high dimensional mapping. We consider the application of SVMs to speaker and language recognition. A key part of our approach is the use of a kernel that compares sequences of feature vectors and produces a measure of similarity. Our sequence kernel is based upon generalized linear discriminants. We show that this strategy has several important properties. First, the kernel uses an explicit expansion into SVM feature space - this property makes it possible to collapse all support vectors into a single model vector and have low computational complexity. Second, the SVM builds upon a simpler mean-squared error classifier to produce a more accurate system. Finally, the system is competitive and complimentary to other approaches, such as Gaussian mixture models (GMMs). We give results for the 2003 NIST speaker and language evaluations of the system and also show fusion with the traditional GMM approach.
READ LESS

Summary

Support vector machines (SVMs) have proven to be a powerful technique for pattern classification. SVMs map inputs into a high-dimensional space and then separate classes with a hyperplane. A critical aspect of using SVMs successfully is the design of the inner product, the kernel, induced by the high dimensional mapping...

READ MORE

Exploiting nonacoustic sensors for speech encoding

Summary

The intelligibility of speech transmitted through low-rate coders is severely degraded when high levels of acoustic noise are present in the acoustic environment. Recent advances in nonacoustic sensors, including microwave radar, skin vibration, and bone conduction sensors, provide the exciting possibility of both glottal excitation and, more generally, vocal tract measurements that are relatively immune to acoustic disturbances and can supplement the acoustic speech waveform. We are currently investigating methods of combining the output of these sensors for use in low-rate encoding according to their capability in representing specific speech characteristics in different frequency bands. Nonacoustic sensors have the ability to reveal certain speech attributes lost in the noisy acoustic signal; for example, low-energy consonant voice bars, nasality, and glottalized excitation. By fusing nonacoustic low-frequency and pitch content with acoustic-microphone content, we have achieved significant intelligibility performance gains using the DRT across a variety of environments over the government standard 2400-bps MELPe coder. By fusing quantized high-band 4-to-8-kHz speech, requiring only an additional 116 bps, we obtain further DRT performance gains by exploiting the ear's insensitivity to fine spectral detail in this frequency region.
READ LESS

Summary

The intelligibility of speech transmitted through low-rate coders is severely degraded when high levels of acoustic noise are present in the acoustic environment. Recent advances in nonacoustic sensors, including microwave radar, skin vibration, and bone conduction sensors, provide the exciting possibility of both glottal excitation and, more generally, vocal tract...

READ MORE

The MIT-LL/AFRL MT System

Published in:
Int. Workshop on Spoken Language Translation, IWSLT, 24-25 October 2005.

Summary

The MITLL/AFRL MT system is a statistical phrase-based translation system that implements many modern SMT training and decoding techniques. Our system was designed with the long term goal of dealing with corrupted ASR input for Speech-to-Speech MT applications. This paper will discuss the architecture of the MITLL/AFRL MT system, and experiments with manual and ASR transcription data that were run as part of the IWSLT-2005 Chinese-to-English evaluation campaign.
READ LESS

Summary

The MITLL/AFRL MT system is a statistical phrase-based translation system that implements many modern SMT training and decoding techniques. Our system was designed with the long term goal of dealing with corrupted ASR input for Speech-to-Speech MT applications. This paper will discuss the architecture of the MITLL/AFRL MT system, and...

READ MORE

Synthesis, analysis, and pitch modification of the breathy vowel

Published in:
2005 Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 16-19 October 2005, pp. 199-202.

Summary

Breathiness is an aspect of voice quality that is difficult to analyze and synthesize, especially since its periodic and noise components are typically overlapping in frequency. The decomposition and manipulation of these two components is of importance in a variety of speech application areas such as text-to-speech synthesis, speech encoding, and clinical assessment of disordered voices. This paper first investigates the perceptual relevance of a speech production model that assumes the speech noise component is modulated by the glottal airflow waveform. After verifying the importance of noise modulation in breathy vowels, we use the modulation model to address the particular problem of pitch modification of this signal class. Using a decomposition method referred to as pitch-scaled harmonic filtering to extract the additive noise component, we introduce a pitch modification algorithm that explicitly modifies the modulation characteristic of this noise component. The approach applies envelope shaping to the noise source that is derived from the inverse-filtered noise component. Modification examples using synthetic and real breathy vowels indicate promising performance with spectrally-overlapping periodic and noise components.
READ LESS

Summary

Breathiness is an aspect of voice quality that is difficult to analyze and synthesize, especially since its periodic and noise components are typically overlapping in frequency. The decomposition and manipulation of these two components is of importance in a variety of speech application areas such as text-to-speech synthesis, speech encoding...

READ MORE