Publications

Refine Results

(Filters Applied) Clear All

An overview of automatic speaker recognition technology

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. IV, 13-17 May 2002, pp. IV-4072 - IV-4075.

Summary

In this paper we provide a brief overview of the area of speaker recognition, describing applications, underlying techniques and some indications, of performance. Following this overview we will discuss some of the strengths and weaknesses of current speaker recognition technologies and outline some potential future trends in research, development and applications conducting other speech interactions (background verification). As speaker and speech recognition system merge and speech recognition accuracy improves, the distinction between text- independent and -dependent applications will decrease. Of the two basic tasks, text-dependent speaker verification is currently
READ LESS

Summary

In this paper we provide a brief overview of the area of speaker recognition, describing applications, underlying techniques and some indications, of performance. Following this overview we will discuss some of the strengths and weaknesses of current speaker recognition technologies and outline some potential future trends in research, development and...

READ MORE

Speaker verification using text-constrained Gaussian mixture models

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. I, 13-17 May 2002, pp. I-677 - I-680.

Summary

In this paper we present an approach to close the gap between text-dependent and text-independent speaker verification performance. Text-constrained GMM-UBM systems are created using word segmentations produced by a LVCSR system on conversational speech allowing the system to focus on speaker differences over a constrained set of acoustic units. Results on the 2001 NiST extended data task show this approach can be used to produce an equal error rate of < 1%.
READ LESS

Summary

In this paper we present an approach to close the gap between text-dependent and text-independent speaker verification performance. Text-constrained GMM-UBM systems are created using word segmentations produced by a LVCSR system on conversational speech allowing the system to focus on speaker differences over a constrained set of acoustic units. Results...

READ MORE

Speaker detection and tracking for telephone transactions

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP, 13-17 May 2002, pp. 129-132.

Summary

As ever greater numbers of telephone transactions are being conducted solely between a caller and an automated answering system, the need increases for software which can automatically identify and authenticate these callers without the need for an onerous speaker enrollment process. In this paper we introduce and investigate a novel speaker detection and tracking (SDT) technique, which dynamically merges the traditional enrollment and recognition phases of the static speaker recognition task. In this speaker recognition application, no prior speaker models exist and the goal is to detect and model new speakers as they call into the system while also recognizing utterances from the previously modeled callers. New speakers are added to the enrolled set of speakers and speech from speakers in the currently enrolled set is used to update models. We describe a system based on a GMM speaker identification (SID) system and develop a new measure to evaluate the performance of the system on the SDT task. Results for both static, open-set detection and the SDT task are presented using a portion of the Switchboard corpus of telephone speech communications. Static open-set detection produces an equal error rate of about 5%. As expected, performance for SDT is quite varied, depending greatly on the speaker set and ordering of the test sequence. These initial results, however, are quite promising and point to potential areas in which to improve the system performance.
READ LESS

Summary

As ever greater numbers of telephone transactions are being conducted solely between a caller and an automated answering system, the need increases for software which can automatically identify and authenticate these callers without the need for an onerous speaker enrollment process. In this paper we introduce and investigate a novel...

READ MORE

Gender-dependent phonetic refraction for speaker recognition

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, 13-17 May 2002, Vol. 1, pp. 149-152.

Summary

This paper describes improvement to an innovative high-performance speaker recognition system. Recent experiments showed that with sufficient training data phone strings from multiple languages are exceptional features for speaker recognition. The prototype phonetic speaker recognition system used phone sequences from six languages to produce an equal error rate of 11.5% on Switchboard-I audio files. The improved system described in this paper reduces the equal error rate to less than 4%. This is accomplished by incorporating gender-dependent phone models, pre-processing the speech files to remove cross-talk, and developing more sophisticated fusion techniques for the multi-language likelihood scores.
READ LESS

Summary

This paper describes improvement to an innovative high-performance speaker recognition system. Recent experiments showed that with sufficient training data phone strings from multiple languages are exceptional features for speaker recognition. The prototype phonetic speaker recognition system used phone sequences from six languages to produce an equal error rate of 11.5%...

READ MORE

Speaker recognition from coded speech and the effects of score normalization

Published in:
Proc. Thirty-Fifth Asilomar Conf. on Signals, Systems and Computers, Vol. 2, 4-7 November 2001, pp. 1562-1567.

Summary

We investigate the effect of speech coding on automatic speaker recognition when training and testing conditions are matched and mismatched. Experiments used standard speech coding algorithms (GSM, G.729, G.723, MELP) and a speaker recognition system based on Gaussian mixture models adapted from a universal background model. There is little loss in recognition performance for toll quality speech coders and slightly more loss when lower quality speech coders are used. Speaker recognition from coded speech using handset dependent score normalization and test score normalization are examined. Both types of score normalization significantly improve performance, and can eliminate the performance loss that occurs when there is a mismatch between training and testing conditions.
READ LESS

Summary

We investigate the effect of speech coding on automatic speaker recognition when training and testing conditions are matched and mismatched. Experiments used standard speech coding algorithms (GSM, G.729, G.723, MELP) and a speaker recognition system based on Gaussian mixture models adapted from a universal background model. There is little loss...

READ MORE

Speaker recognition from coded speech in matched and mismatched conditions

Published in:
Proc. 2001: A Speaker Odyssey, The Speaker Recognition Workshop, 18-22 June 2001, pp. 115-20.

Summary

We investigate the effect of speech coding on automatic speaker recognition when training and testing conditions are matched and mismatched. Experiments use standard speech coding algorithms (GSM, G.729, G.723, MELP) and a speaker recognition system based on Gaussian mixture models adapted from a universal background model. There is little loss in recognition performance for toll quality speech coders and slightly more loss when lower quality speech coders are used. Speaker recognition from coded speech using handset dependent score normalization is examined, and we find that this significantly improves performance, particularly when there is a mismatch between training and testing conditions.
READ LESS

Summary

We investigate the effect of speech coding on automatic speaker recognition when training and testing conditions are matched and mismatched. Experiments use standard speech coding algorithms (GSM, G.729, G.723, MELP) and a speaker recognition system based on Gaussian mixture models adapted from a universal background model. There is little loss...

READ MORE

Speaker indexing in large audio databases using anchor models

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. 1, 7-11 May 2001, pp. 429-432.

Summary

This paper introduces the technique of anchor modeling in the applications of speaker detection and speaker indexing. The anchor modeling algorithm is refined by pruning the number of models needed. The system is applied to the speaker detection problem where its performance is shown to fall short of the state-of-the-art Gaussian Mixture Model with Universal Background Model (GMM-UBM) system. However, it is further shown that its computational efficiency lends itself to speaker indexing for searching large audio databases for desired speakers. Here, excessive computation may prohibit the use of the GMM-UBM recognition system. Finally, the paper presents a method for cascading anchor model and GMM-UBM detectors for speaker indexing. This approach benefits from the efficiency of anchor modeling and high accuracy of GMM-UBM recognition.
READ LESS

Summary

This paper introduces the technique of anchor modeling in the applications of speaker detection and speaker indexing. The anchor modeling algorithm is refined by pruning the number of models needed. The system is applied to the speaker detection problem where its performance is shown to fall short of the state-of-the-art...

READ MORE

The Lincoln speaker recognition system: NIST EVAL2000

Published in:
6th Int. Conf. on Spoken Language, ICSLP, 16-20 October 2000.

Summary

This paper presents an overview of the Lincoln Laboratory systems fielded for the 2000 NIST speaker recognition evaluation (SRE00). In addition to the standard one-speaker detection tasks, this year's evaluation, as in 1999, included multi-speaker spokes dealing with detection, tracking and segmentation. The design approach for the Lincoln system in SRE00 was to develop a set of core one-speaker detection and multi-speaker clustering tools that could be applied to all the tasks. This paper will describe these core systems, how they are applied to the SRE00 tasks and the results they produce. Additionally, a new channel normalization technique known as handset-dependent test-score norm (HTnorm) is introduced.
READ LESS

Summary

This paper presents an overview of the Lincoln Laboratory systems fielded for the 2000 NIST speaker recognition evaluation (SRE00). In addition to the standard one-speaker detection tasks, this year's evaluation, as in 1999, included multi-speaker spokes dealing with detection, tracking and segmentation. The design approach for the Lincoln system in...

READ MORE

Estimation of handset nonlinearity with application to speaker recognition

Published in:
IEEE Trans. Speech Audio Process., Vol. 8, No. 5, September 2000, pp. 567-584.

Summary

A method is described for estimating telephone handset nonlinearity by matching the spectral magnitude of the distorted signal to the output of a nonlinear channel model, driven by an undistorted reference. This "magnitude-only" representation allows the model to directly match unwanted speech formants that arise over nonlinear channels and that are a potential source of degradation in speaker and speech recognition algorithms. As such, the method is particularly suited to algorithms that use only spectral magnitude information. The distortion model consists of a memoryless nonlinearity sandwiched between two finite-length linear filters. Nonlinearities considered include arbitrary finite-order polynomials and parametric sigmoidal functionals derived from a carbon-button handset model. Minimization of a mean-squared spectral magnitude distance with respect to model parameters relies on iterative estimation via a gradient descent technique. Initial work has demonstrated the importance of addressing handset nonlinearity, in addition to linear distortion, in speaker recognition over telephone channels. A nonlinear handset "mapping" applied to training or testing data to reduce mismatch between different types of handset microphone outputs, improves speaker verification performance relative to linear compensation only. Finally, a method is proposed to merge the mapper strategy with a method of likelihood score normalization (hnorm) for further mismatch reduction and speaker verification performance improvement.
READ LESS

Summary

A method is described for estimating telephone handset nonlinearity by matching the spectral magnitude of the distorted signal to the output of a nonlinear channel model, driven by an undistorted reference. This "magnitude-only" representation allows the model to directly match unwanted speech formants that arise over nonlinear channels and that...

READ MORE

Speaker recognition using G.729 speech codec parameters

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. II, 5-9 June 2000, pp. 1089-1092.

Summary

Experiments in Gaussian-mixture-model speaker recognition from mel-filter bank energies (MFBs) of the G.729 codec all-pole spectral envelope, showed significant performance loss relative to the standard mel-cepstral coefficients of G.729 synthesized (coded) speech. In this paper, we investigate two approaches to recover speaker recognition performance from G.729 parameters, rather than deriving cepstra from MFBs of an all-pole spectrum. Specifically, the G.729 LSFs are converted to "direct" cepstral coefficients for which there exists a one-to-one correspondence with the LSFs. The G.729 residual is also considered; in particular, appending G.729 pitch as a single parameter to the direct cepstral coefficients gives further performance gain. The second nonparametric approach uses the original MFB paradigm, but adds harmonic striations to the G.729 all-pole spectral envelope. Although obtaining considerable performance gains with these methods, we have yet to match the performance of G.729 synthesized speech, motivating the need for representing additional fine structure of the G.729 residual.
READ LESS

Summary

Experiments in Gaussian-mixture-model speaker recognition from mel-filter bank energies (MFBs) of the G.729 codec all-pole spectral envelope, showed significant performance loss relative to the standard mel-cepstral coefficients of G.729 synthesized (coded) speech. In this paper, we investigate two approaches to recover speaker recognition performance from G.729 parameters, rather than deriving...

READ MORE