Publications

Refine Results

(Filters Applied) Clear All

Blind clustering of speech utterances based on speaker and language characteristics

Published in:
5th Int. Conf. Spoken Language Processing (ICSLP), 30 November - 4 December 1998.

Summary

Classical speaker and language recognition techniques can be applied to the classification of unknown utterances by computing the likelihoods of the utterances given a set of well trained target models. This paper addresses the problem of grouping unknown utterances when no information is available regarding the speaker or language classes or even the total number of classes. Approaches to blind message clustering are presented based on conventional hierarchical clustering techniques and an integrated cluster generation and selection method called the d* algorithm. Results are presented using message sets derived from the Switchboard and Callfriend corpora. Potential applications include automatic indexing of recorded speech corpora by speaker/language tags and automatic or semiautomatic selection of speaker specific speech utterances for speaker recognition adaptation.
READ LESS

Summary

Classical speaker and language recognition techniques can be applied to the classification of unknown utterances by computing the likelihoods of the utterances given a set of well trained target models. This paper addresses the problem of grouping unknown utterances when no information is available regarding the speaker or language classes...

READ MORE

Sheep, goats, lambs and wolves: a statistical analysis of speaker performance in the NIST 1998 speaker recognition evaluation

Summary

Performance variability in speech and speaker recognition systems can be attributed to many factors. One major factor, which is often acknowledged but seldom analyzed, is inherent differences in the recognizability of different speakers. In speaker recognition systems such differences are characterized by the use of animal names for different types of speakers, including sheep, goats, lambs and wolves, depending on their behavior with respect to automatic recognition systems. In this paper we propose statistical tests for the existence of these animals and apply these tests to hunt for such animals using results from the 1998 NIST speaker recognition evaluation.
READ LESS

Summary

Performance variability in speech and speaker recognition systems can be attributed to many factors. One major factor, which is often acknowledged but seldom analyzed, is inherent differences in the recognizability of different speakers. In speaker recognition systems such differences are characterized by the use of animal names for different types...

READ MORE

Magnitude-only estimation of handset nonlinearity with application to speaker recognition

Published in:
Proc. of the 1998 IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. II, Speech Processing II; Neural Networks for Signal Processing, 12-15 May 1998, pp. 745-748.

Summary

A method is described for estimating telephone handset nonlinearity by matching the spectral magnitude of the distorted signal to the output of a nonlinear channel model, driven by an undistorted reference. The "magnitude-only" representation allows the model to directly match unwanted speech formants that arise over nonlinear channels and that are a potential source of degradation in speaker and speech recognition algorithms. As such, the method is particularly suited to algorithms that use only spectral magnitude information. The distortion model consists of a memoryless polynomial nonlinearity sandwiched between two finite-length linear filters. Minimization of a mean-squared spectral magnitude error, with respect to model parameters, relies on iterative estimation via a gradient descent technique, using a Jacobian in the iterative correction term with gradients calculated by finite-element approximation. Initial work has demonstrated the algorithm's usefulness in speaker recognition over telephone channels by reducing mismatch between high- and low-quality handset conditions.
READ LESS

Summary

A method is described for estimating telephone handset nonlinearity by matching the spectral magnitude of the distorted signal to the output of a nonlinear channel model, driven by an undistorted reference. The "magnitude-only" representation allows the model to directly match unwanted speech formants that arise over nonlinear channels and that...

READ MORE

Comparison of background normalization methods for text-independent speaker verification

Published in:
5th European Conf. on Speech Communication and Technology, EUROSPEECH, 22-25 September 1997.

Summary

This paper compares two approaches to background model representation for a text-independent speaker verification task using Gaussian mixture models. We compare speaker-dependent background speaker sets to the use of a universal, speaker-independent background model (UBM). For the UBM, we describe how Bayesian adaptation can be used to derive claimant speaker models, providing a structure leading to significant computational savings during recognition. Experiments are conducted on the 1996 NIST Speaker Recognition Evaluation corpus and it is clearly shown that a system using a UBM and Bayesian adaptation of claimant models produces superior performance compared to speaker-dependent background sets or the UBM with independent claimant models. In addition, the creation and use of a telephone handset-type detector and a procedure called hnorm is also described which shows further, large improvements in verification performance, especially under the difficult mismatched handset conditions. This is believed to be the first use of applying a handset-type detector and explicit handset-type normalization for the speaker verification task.
READ LESS

Summary

This paper compares two approaches to background model representation for a text-independent speaker verification task using Gaussian mixture models. We compare speaker-dependent background speaker sets to the use of a universal, speaker-independent background model (UBM). For the UBM, we describe how Bayesian adaptation can be used to derive claimant speaker...

READ MORE

HTIMIT and LLHDB: speech corpora for the study of handset transducer effects

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. 2, 21-24 April 1997, pp. 1535-1538.

Summary

This paper describes two corpora collected at Lincoln Laboratory for the study of handset transducer effects on the speech signal: the handset TIMIT (HTIMIT) corpus and the Lincoln Laboratory Handset Database (LLHDB). The goal of these corpora are to minimize all confounding factors and to produce speech predominately differing only in handset transducer effects. The speech is recorded directly from a telephone unit in a sound-booth using prompted text and extemporaneous photograph descriptions. The two corpora allow comparison of speech collected from a person speaking into a handset (LLHDB) versus speech played through a loudspeaker into a headset (HTIMIT). A comparison of analysis and results between the two corpora will address the realism of artificially creating handset degraded speech by playing recorded speech through handsets. The corpora are designed primarily for speaker recognition experimentation (in terms of amount of speech and level of transcription), but since both speaker and speech recognition systems operate on the same acoustic features affected by the handset, knowledge gleaned is directly transferable to speech recognizers.
READ LESS

Summary

This paper describes two corpora collected at Lincoln Laboratory for the study of handset transducer effects on the speech signal: the handset TIMIT (HTIMIT) corpus and the Lincoln Laboratory Handset Database (LLHDB). The goal of these corpora are to minimize all confounding factors and to produce speech predominately differing only...

READ MORE

Fine structure features for speaker identification

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. 2, Speech (Part II), 7-10 May 1996, pp. 689-692.

Summary

The performance of speaker identification (SID) systems can be improved by the addition of the rapidly varying "fine structure" features of formant amplitude and/or frequency modulation and multiple excitation pulses. This paper shows how the estimation of such fine structure features can be improved further by obtaining better estimates of formant frequency locations and uncovering various sources of error in the feature extraction systems. Most female telephone speech showed "spurious" formants, due to distortion in the telephone network. Nevertheless, SID performance was greatest with these spurious formants as formant estimates. A new feature has also been identified which can increase SID performance: cepstral coefficients from noise in the estimated excitation waveform. Finally, statistical tools have been developed to explore the relative importance of features used for SID, with the ultimate goal of uncovering the source of the features that provide SID performance improvement.
READ LESS

Summary

The performance of speaker identification (SID) systems can be improved by the addition of the rapidly varying "fine structure" features of formant amplitude and/or frequency modulation and multiple excitation pulses. This paper shows how the estimation of such fine structure features can be improved further by obtaining better estimates of...

READ MORE

The effects of handset variability on speaker recognition performance: experiments on the switchboard corpus

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. 1, 7-10 May 1996, pp. 113-116.

Summary

This paper presents an empirical study of the effects of handset variability on text-independent speaker recognition performance using the Switchboard corpus. Handset variability occurs when training speech is collected using one type of handset, but a different handset is used for collecting test speech. For the Switchboard corpus, the calling telephone number associated with a file is used to imply the handset used. Analysis of experiments designed to focus on handset variability on the SPIDRE database and the May95 NIST speaker recognition evaluation database, show that a performance gap between matched and mismatched handset tests persists even after applying several standard channel compensation techniques. Error rates for the mismatched tests are over 4 times those for the matched tests. Lastly, a new energy dependent cepstral mean subtraction technique is proposed to compensate for nonlinear distortions, but is not found to improve performance on the databases used.
READ LESS

Summary

This paper presents an empirical study of the effects of handset variability on text-independent speaker recognition performance using the Switchboard corpus. Handset variability occurs when training speech is collected using one type of handset, but a different handset is used for collecting test speech. For the Switchboard corpus, the calling...

READ MORE

Measuring fine structure in speech: application to speaker identification

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. 1, 9-12 May 1995, pp. 325-328.

Summary

The performance of systems for speaker identification (SID) can be quite good with clean speech, though much lower with degraded speech. Thus it is useful to search for new features for SID, particularly features that are robust over a degraded channel. This paper investigates features that are based on amplitude and frequency modulations of speech formants, high resolution measurement of fundamental frequency and location of "secondary pulses," measured using a high-resolution energy operator. When these features are added to traditional features using an existing SID system with a 168 speaker telephone speech database, SID performance improved by as much as 4% for male speakers and 8.2% for female speakers.
READ LESS

Summary

The performance of systems for speaker identification (SID) can be quite good with clean speech, though much lower with degraded speech. Thus it is useful to search for new features for SID, particularly features that are robust over a degraded channel. This paper investigates features that are based on amplitude...

READ MORE

The effects of telephone transmission degradations on speaker recognition performance

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. 1, Speech, 9-12 May 1995, pp. 329-332.

Summary

The two largest factors affecting automatic speaker identification performance are the size of the population an the degradations introduced by noisy communication, channels (e.g., telephone transmission). To examine experimentally these two factors, this paper presents text-independent speaker identification results for varying speaker population sizes up to 630 speakers for both clean, wideband speech and telephone speech. A system based on Gaussian mixture speaker identification and experiments are conducted on the TIMIT and NTIMIT databases. This is believed to be the first speaker identification experiments on the complete 630 speaker TIMIT and NTIMIT databases and the largest text-independent speaker identification task reported to date. Identification accuracies of 99.5% and 60.7% are achieved on the TIMIT and NTIMIT databases, respectively. This paper also presents experiments which examine and attempt to quantify the performance loss associated with various telephone degradations by systematically degrading the TIMIT speech in a manner consistent with measured NTIMIT degradations and measuring the performance loss at each step. It is found that the standard degradations of filtering and additive noise do not account for all of the performance gap between the TIMIT and NTIMIT data. Measurements of nonlinear microphone distortions are also...
READ LESS

Summary

The two largest factors affecting automatic speaker identification performance are the size of the population an the degradations introduced by noisy communication, channels (e.g., telephone transmission). To examine experimentally these two factors, this paper presents text-independent speaker identification results for varying speaker population sizes up to 630 speakers for both...

READ MORE

Large population speaker identification using clean and telephone speech

Published in:
IEEE Signal Process. Lett., Vol. 2, No. 3, March 1995, pp. 46-48.

Summary

This paper presents text-independent speaker identification results for varying speaker population sizes up to 630 speakers for both clean, wideband speech, and telephone speech. A system based on Gaussian mixture speaker models is used for speaker identification, and experiments are conducted on the TIMIT and NTIMIT databases. The TIMIT results show large population performance under near-ideal conditions, and the NTIMIT results show the corresponding accuracy loss due to telephone transmission. These are believed to be the first speaker identification experiments on the complete 630 speaker TIMIT and NTIMIT databases and the largest text-independent speaker identification task reported to date. Identification accuracies of 99.5 and 60.7% were achieved on the TIMIT and NTIMIT databases, respectively.
READ LESS

Summary

This paper presents text-independent speaker identification results for varying speaker population sizes up to 630 speakers for both clean, wideband speech, and telephone speech. A system based on Gaussian mixture speaker models is used for speaker identification, and experiments are conducted on the TIMIT and NTIMIT databases. The TIMIT results...

READ MORE