This paper introduces and motivates the use of Gaussian mixture models (GMM) for robust text-independent speaker identification. The individual Gaussian components of a GMM are shown to represent some general speaker-dependent spectral shapes that are effective for modeling speaker identify. The focus of this work is on applications which require high identification rates using short utterance from unconstrained conversational speech and robustness to degradations produced by transmission over a telephone channel. A complete experimental evaluation of the Gaussian mixture speaker model is conducted on a 49 speaker, conversational telephone speech database. The experiments examine algorithmic issues (initializations, variance limiting, model order selection), spectral variability robustness techniques, large population performance, and comparisons to other speaker modeling techniques (uni-modal Gaussian, VQ codebook, tied Gaussian mixture, and radial basis functions). The Gaussian mixture speaker model attains 96.8% identification accuracy using 5 second clean speech utterances and 80.8% accuracy using 15 second telephone speech utterances with a 49 speaker population and is shown to outperform the other speaker modeling techniques on an identical 16 speaker telephone speech task.

READ LESS

Summary

Robust text-independent speaker identification using Gaussian mixture speaker models

Speaker identification and verification using Gaussian mixture speaker models

January 1, 1995

Conference Paper

Author:

Douglas A. Reynolds

Published in:

Speech Commun., Vol. 17, 1995, pp. 91-108.

Topic:

speaker recognition

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

This paper presents high performance speaker identification and verification systems based on Gaussian mixture speaker models: robust, statistically based representations of speaker identification. The identification system is a maximum likelihood classifier and the verification system is a likelihood ratio hypothesis tester using background speaker normalization. The systems are evaluated on four publically available speech databases: TIMIT, NTIMIT, Switchboard and YOHO. The different levels of degradation and variabilities found in these databases allow the examination of system performance for different task domains. Constraints on the speech range from vocabulary-dependent to extemporaneous and speech quality varies from near-ideal, clean speech to noisy, telephone speech. Closed set identification accuracies on the 630 speaker TIMIT and NTIMIT databases were 99.5% and 60.7% respectively. On a 113 speaker population from the Switchboard database the identification accuracy was 82.8%. Global threshold equal error rates of 0.24%, 7.19%, 5.15% and 0.51% were obtained in verification experiments on the TIMIT, NTIMIT, Switchboard and YOHO databases, respectively.

READ LESS

Summary

Speaker identification and verification using Gaussian mixture speaker models

Energy onset times for speaker identification

November 1, 1994

Journal Article

Author:

Thomas F. Quatieri

…

Published in:

IEEE Signal Process. Lett., Vol. 1, No. 11, November 1994, pp. 160-162.

Topic:

speaker recognition

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

Onset times of resonant energy pulses are measured with the high-resolution Teager operator and used as features in the Reynolds Gaussian-mixture speaker identification algorithm. Feature sets are constructed with primary pitch and secondary pulse locations derived from low and high speech formants. Preliminary testing was performed with a confusable 40-speaker subset from the NTIMIT (telephone channel) database. Speaker identification improved from 55 to 70% correct classification when the full set of new resonant energy-based features were added as an independent stream to conventional mel-cepstra.

READ LESS

Summary

Energy onset times for speaker identification

Formant AM-FM for speaker identification

October 25, 1994

Conference Paper

Author:

Charles R. Jankowski Jr

…

Published in:

Proc. IEEE-SP Int. Symp. on Time-Frequency and Time-Scale Analysis, 25-28 October 1994, pp. 608-611.

Topic:

speaker recognition

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

The performance of systems for speaker identification (SID) can be quite good with clean speech, though much lower with degraded speech. Thus it is useful to search for new features for SID, particularly features that are robust over a degraded channel. This paper investigates features that are robust over a degraded channel. This paper investigates features that are based on amplitude and frequency modulations of speech formants. Such modulations are measured using a high-resolution energy operator and related algorithms for recovering amplitude and frequency from an AM-FM signal. When these features are added to traditional features using an existing SID system with a telephone speech database, SID performance improved by as much as 15%. Energy onset time measurements that yielded improved SID performance are also discussed.

READ LESS

Summary

Formant AM-FM for speaker identification

Experimental evaluation of features for robust speaker identification

October 1, 1994

Journal Article

Author:

Douglas A. Reynolds

Published in:

IEEE Trans. Speech Audio Process., Vol. 2, No. 4, October 1994, pp. 639-643.

Topic:

speaker recognition

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

This correspondence presents an experimental evaluation of different features and channel compensation techniques for robust speaker identification. The goal is to keep all processing and classification steps constant and to vary only the features and compensations used to allow a controlled comparison. A general, maximum-likelihood classifier based on Gaussian mixture densities is used as the classifier, and experiments are conducted on the King speech database, a conversational, telephone-speech database. The features examined are mel-frequency and linear-frequency filterbank cepstral coefficients, linear prediction ceptral coefficients. The channel compensation techniques examined are cepstral mean removal, RASTA processing, and a quadratic trend removal technique. It is shown for this database that performance difference between the basic features is small, and the major gains are due to the channel compensation techniques. The best "across-the-divide" recognition accuracy of 92% is obtained for both high-order LPC features and band-limited filterbank features.

READ LESS

Summary

Experimental evaluation of features for robust speaker identification

Large population speaker recognition using wideband and telephone speech

July 28, 1994

Conference Paper

Author:

Douglas A. Reynolds

Published in:

Proc. SPIE, Vol. 2277, Automatic Systems for the Identification and Inspection of Humans, 28-29 July 1994, pp. 111-120.

Topic:

speaker recognition

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

The two largest factors affecting automatic speaker identification performance are the size of the population to be distinguished among and the degradations introduced by noisy communication channels (e.g. telephone transmission). To experimentally examine these two factors, this paper presents text-independent speaker identification results for varying speaker population sizes up to 630 speakers for both clean, wideband speech and telephone speech. A system based on Gaussian mixture speaker models is used for speaker identification and experiments are conducted on the TIMIT and NTIMIT databases. The aims of this study are to (1) establish how well text-independent speaker identification can perform under near ideal conditions for very large populations (using the TIMIT database), (2) gauge the performance loss incurred by transmitting the speech over the telephone network (using the NTIMIT database), and (3) examine the validity of current models of telephone degradations commonly used in developing compensation techniques (using the NTIMIT calibration signals). This is believed to be the first speaker identification experiments on the complete 630 speaker TIMIT and NTIMIT databases and the largest text-independent speaker identification task reported to date. Identification accuracies of 99.5% and 60.7% are achieved on the TIMIT and NTIMIT databases, respectively.

READ LESS

Summary

Large population speaker recognition using wideband and telephone speech

Integrated models of signal and background with application to speaker identification in noise

April 1, 1994

Journal Article

Author:

Richard C. Rose

…

Published in:

IEEE Trans. Speech Audio Process., Vol. 2, No. 2, April 1994, pp. 245-257.

Topic:

speaker recognition

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

This paper is concerned with the problem of robust parametric model estimation and classification in noisy acoustic environments. Characterization and modeling of the external noise sources in these environments is in itself an important issue in noise compensation. The techniques described here provide a mechanism for integrating parametric models of acoustic background with the signal model so that noise compensation is tightly coupled with signal model training and classification. Prior information about the acoustic background process is provided using a maximum likelihood parameter estimation procedure that integrates an a priori model of acoustic background with the signal model. An experimental study is presented in the paper on the application of this approach to text-independent speaker identification in noisy acoustic environments. Considerable improvement in speaker classification performance was obtained for classifying unlabeled sections of conversational speech utterances from a 16-speaker population under cross-environment training and testing conditions.

READ LESS

Summary

Integrated models of signal and background with application to speaker identification in noise

An integrated speech-background model for robust speaker identification

March 23, 1992

Conference Paper

Author:

Douglas A. Reynolds

…

Richard C. Rose

Published in:

Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. 2, 23-26 March 1992, pp. 185-188.

Topic:

speaker recognition

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

This paper examines a procedure for text independent speaker identification in noisy environments where the interfering background signals cannot be characterized using traditional broadband or impulsive noise models. In the procedure, both the speaker and the background processes are modeled using mixtures of Gaussians. Speaker and background models are integrated into a unified statistical framework allowing the decoupling of the underlying speech process from the noise corrupted observations via the expectation-maximization algorithm. Using this formalism, speaker model parameters are estimated in the presence of the background process, and a scoring procedure is implemented for computing the speaker likelihood in the noise corrupted environment. Performance is evaluated using a 16 speaker conversational speech database with both "speech babble" and white noise background processes.

READ LESS

Summary

An integrated speech-background model for robust speaker identification

Publications

Refine Results

By

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Showing Results