Publications

Refine Results

(Filters Applied) Clear All

Large population speaker identification using clean and telephone speech

Published in:
IEEE Signal Process. Lett., Vol. 2, No. 3, March 1995, pp. 46-48.

Summary

This paper presents text-independent speaker identification results for varying speaker population sizes up to 630 speakers for both clean, wideband speech, and telephone speech. A system based on Gaussian mixture speaker models is used for speaker identification, and experiments are conducted on the TIMIT and NTIMIT databases. The TIMIT results show large population performance under near-ideal conditions, and the NTIMIT results show the corresponding accuracy loss due to telephone transmission. These are believed to be the first speaker identification experiments on the complete 630 speaker TIMIT and NTIMIT databases and the largest text-independent speaker identification task reported to date. Identification accuracies of 99.5 and 60.7% were achieved on the TIMIT and NTIMIT databases, respectively.
READ LESS

Summary

This paper presents text-independent speaker identification results for varying speaker population sizes up to 630 speakers for both clean, wideband speech, and telephone speech. A system based on Gaussian mixture speaker models is used for speaker identification, and experiments are conducted on the TIMIT and NTIMIT databases. The TIMIT results...

READ MORE

Robust text-independent speaker identification using Gaussian mixture speaker models

Published in:
IEEE Trans. Speech Audio Process., Vol. 3, No. 1, January 1995, pp. 72-83.

Summary

This paper introduces and motivates the use of Gaussian mixture models (GMM) for robust text-independent speaker identification. The individual Gaussian components of a GMM are shown to represent some general speaker-dependent spectral shapes that are effective for modeling speaker identify. The focus of this work is on applications which require high identification rates using short utterance from unconstrained conversational speech and robustness to degradations produced by transmission over a telephone channel. A complete experimental evaluation of the Gaussian mixture speaker model is conducted on a 49 speaker, conversational telephone speech database. The experiments examine algorithmic issues (initializations, variance limiting, model order selection), spectral variability robustness techniques, large population performance, and comparisons to other speaker modeling techniques (uni-modal Gaussian, VQ codebook, tied Gaussian mixture, and radial basis functions). The Gaussian mixture speaker model attains 96.8% identification accuracy using 5 second clean speech utterances and 80.8% accuracy using 15 second telephone speech utterances with a 49 speaker population and is shown to outperform the other speaker modeling techniques on an identical 16 speaker telephone speech task.
READ LESS

Summary

This paper introduces and motivates the use of Gaussian mixture models (GMM) for robust text-independent speaker identification. The individual Gaussian components of a GMM are shown to represent some general speaker-dependent spectral shapes that are effective for modeling speaker identify. The focus of this work is on applications which require...

READ MORE

Sinusoidal coding

Published in:
Chapter 4 in Speech Coding and Synthesis, Elsevier Science Publishers, 1995, pp. 121-173.

Summary

This chapter summarizes the sinewave-based pitch extractor, and the high-order all-pole modelling techniques that provided the basis for the multirate Sinusoidal Transform Coder and its application to multi-speaker conferencing.
READ LESS

Summary

This chapter summarizes the sinewave-based pitch extractor, and the high-order all-pole modelling techniques that provided the basis for the multirate Sinusoidal Transform Coder and its application to multi-speaker conferencing.

READ MORE

Speaker identification and verification using Gaussian mixture speaker models

Published in:
Speech Commun., Vol. 17, 1995, pp. 91-108.

Summary

This paper presents high performance speaker identification and verification systems based on Gaussian mixture speaker models: robust, statistically based representations of speaker identification. The identification system is a maximum likelihood classifier and the verification system is a likelihood ratio hypothesis tester using background speaker normalization. The systems are evaluated on four publically available speech databases: TIMIT, NTIMIT, Switchboard and YOHO. The different levels of degradation and variabilities found in these databases allow the examination of system performance for different task domains. Constraints on the speech range from vocabulary-dependent to extemporaneous and speech quality varies from near-ideal, clean speech to noisy, telephone speech. Closed set identification accuracies on the 630 speaker TIMIT and NTIMIT databases were 99.5% and 60.7% respectively. On a 113 speaker population from the Switchboard database the identification accuracy was 82.8%. Global threshold equal error rates of 0.24%, 7.19%, 5.15% and 0.51% were obtained in verification experiments on the TIMIT, NTIMIT, Switchboard and YOHO databases, respectively.
READ LESS

Summary

This paper presents high performance speaker identification and verification systems based on Gaussian mixture speaker models: robust, statistically based representations of speaker identification. The identification system is a maximum likelihood classifier and the verification system is a likelihood ratio hypothesis tester using background speaker normalization. The systems are evaluated on...

READ MORE

Energy onset times for speaker identification

Published in:
IEEE Signal Process. Lett., Vol. 1, No. 11, November 1994, pp. 160-162.

Summary

Onset times of resonant energy pulses are measured with the high-resolution Teager operator and used as features in the Reynolds Gaussian-mixture speaker identification algorithm. Feature sets are constructed with primary pitch and secondary pulse locations derived from low and high speech formants. Preliminary testing was performed with a confusable 40-speaker subset from the NTIMIT (telephone channel) database. Speaker identification improved from 55 to 70% correct classification when the full set of new resonant energy-based features were added as an independent stream to conventional mel-cepstra.
READ LESS

Summary

Onset times of resonant energy pulses are measured with the high-resolution Teager operator and used as features in the Reynolds Gaussian-mixture speaker identification algorithm. Feature sets are constructed with primary pitch and secondary pulse locations derived from low and high speech formants. Preliminary testing was performed with a confusable 40-speaker...

READ MORE

Formant AM-FM for speaker identification

Published in:
Proc. IEEE-SP Int. Symp. on Time-Frequency and Time-Scale Analysis, 25-28 October 1994, pp. 608-611.

Summary

The performance of systems for speaker identification (SID) can be quite good with clean speech, though much lower with degraded speech. Thus it is useful to search for new features for SID, particularly features that are robust over a degraded channel. This paper investigates features that are robust over a degraded channel. This paper investigates features that are based on amplitude and frequency modulations of speech formants. Such modulations are measured using a high-resolution energy operator and related algorithms for recovering amplitude and frequency from an AM-FM signal. When these features are added to traditional features using an existing SID system with a telephone speech database, SID performance improved by as much as 15%. Energy onset time measurements that yielded improved SID performance are also discussed.
READ LESS

Summary

The performance of systems for speaker identification (SID) can be quite good with clean speech, though much lower with degraded speech. Thus it is useful to search for new features for SID, particularly features that are robust over a degraded channel. This paper investigates features that are robust over a...

READ MORE

Experimental evaluation of features for robust speaker identification

Published in:
IEEE Trans. Speech Audio Process., Vol. 2, No. 4, October 1994, pp. 639-643.

Summary

This correspondence presents an experimental evaluation of different features and channel compensation techniques for robust speaker identification. The goal is to keep all processing and classification steps constant and to vary only the features and compensations used to allow a controlled comparison. A general, maximum-likelihood classifier based on Gaussian mixture densities is used as the classifier, and experiments are conducted on the King speech database, a conversational, telephone-speech database. The features examined are mel-frequency and linear-frequency filterbank cepstral coefficients, linear prediction ceptral coefficients. The channel compensation techniques examined are cepstral mean removal, RASTA processing, and a quadratic trend removal technique. It is shown for this database that performance difference between the basic features is small, and the major gains are due to the channel compensation techniques. The best "across-the-divide" recognition accuracy of 92% is obtained for both high-order LPC features and band-limited filterbank features.
READ LESS

Summary

This correspondence presents an experimental evaluation of different features and channel compensation techniques for robust speaker identification. The goal is to keep all processing and classification steps constant and to vary only the features and compensations used to allow a controlled comparison. A general, maximum-likelihood classifier based on Gaussian mixture...

READ MORE

Large population speaker recognition using wideband and telephone speech

Published in:
Proc. SPIE, Vol. 2277, Automatic Systems for the Identification and Inspection of Humans, 28-29 July 1994, pp. 111-120.

Summary

The two largest factors affecting automatic speaker identification performance are the size of the population to be distinguished among and the degradations introduced by noisy communication channels (e.g. telephone transmission). To experimentally examine these two factors, this paper presents text-independent speaker identification results for varying speaker population sizes up to 630 speakers for both clean, wideband speech and telephone speech. A system based on Gaussian mixture speaker models is used for speaker identification and experiments are conducted on the TIMIT and NTIMIT databases. The aims of this study are to (1) establish how well text-independent speaker identification can perform under near ideal conditions for very large populations (using the TIMIT database), (2) gauge the performance loss incurred by transmitting the speech over the telephone network (using the NTIMIT database), and (3) examine the validity of current models of telephone degradations commonly used in developing compensation techniques (using the NTIMIT calibration signals). This is believed to be the first speaker identification experiments on the complete 630 speaker TIMIT and NTIMIT databases and the largest text-independent speaker identification task reported to date. Identification accuracies of 99.5% and 60.7% are achieved on the TIMIT and NTIMIT databases, respectively.
READ LESS

Summary

The two largest factors affecting automatic speaker identification performance are the size of the population to be distinguished among and the degradations introduced by noisy communication channels (e.g. telephone transmission). To experimentally examine these two factors, this paper presents text-independent speaker identification results for varying speaker population sizes up to...

READ MORE

Wordspotter training using figure-of-merit back propagation

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. 1, Speech Processing, 19-22 April 1994, pp. 389-392.

Summary

A new approach to wordspotter training is presented which directly maximizes the Figure of Merit (FOM) defined as the average detection rate over a specified range of false alarm rates. This systematic approach to discriminant training for wordspotters eliminates the necessity of ad hoc thresholds and tuning. It improves the FOM of wordspotters tested using cross-validation on the credit-card speech corpus training conversations by 4 to 5 percentage points to roughly 70% This improved performance requires little extra complexity during wordspotting and only two extra passes through the training data during training. The FOM gradient is computed analytically for each putative hit, back-propagated through HMM word models using the Viterbi alignment, and used to adjust RBF hidden node centers and state-weights associated with every node in HMM keyword models.
READ LESS

Summary

A new approach to wordspotter training is presented which directly maximizes the Figure of Merit (FOM) defined as the average detection rate over a specified range of false alarm rates. This systematic approach to discriminant training for wordspotters eliminates the necessity of ad hoc thresholds and tuning. It improves the...

READ MORE

Automatic language identification of telephone speech messages using phoneme recognition and N-gram modeling

Author:
Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. 1, Speech Processing, 19-22 April 1994, pp. 305-308.

Summary

This paper compares the performance of four approaches to automatic language identification (LID) of telephone speech messages: Gaussian mixture model classification (GMM), language-independent phoneme recognition followed by language-dependent language modeling (PRLM), parallel PRLM (PRLM-P), and language-dependent parallel phoneme recognition (PPR). These approaches span a wide range of training requirements and levels of recognition complexity. All approaches were tested on the development test subset of the OGI multi-language telephone speech corpus. Generally, system performance was directly related to system complexity, with PRLM-P and PPR performing best. On 45 second test utterance, average two language, closed-set, forced-choice classification performance, reached 94.5% correct. The best 10 language, closed-set, forced-choice performance was 79.2% correct.
READ LESS

Summary

This paper compares the performance of four approaches to automatic language identification (LID) of telephone speech messages: Gaussian mixture model classification (GMM), language-independent phoneme recognition followed by language-dependent language modeling (PRLM), parallel PRLM (PRLM-P), and language-dependent parallel phoneme recognition (PPR). These approaches span a wide range of training requirements and...

READ MORE