Publications

Refine Results

(Filters Applied) Clear All

Speaker verification using text-constrained Gaussian mixture models

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. I, 13-17 May 2002, pp. I-677 - I-680.

Summary

In this paper we present an approach to close the gap between text-dependent and text-independent speaker verification performance. Text-constrained GMM-UBM systems are created using word segmentations produced by a LVCSR system on conversational speech allowing the system to focus on speaker differences over a constrained set of acoustic units. Results on the 2001 NiST extended data task show this approach can be used to produce an equal error rate of < 1%.
READ LESS

Summary

In this paper we present an approach to close the gap between text-dependent and text-independent speaker verification performance. Text-constrained GMM-UBM systems are created using word segmentations produced by a LVCSR system on conversational speech allowing the system to focus on speaker differences over a constrained set of acoustic units. Results...

READ MORE

Speaker detection and tracking for telephone transactions

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP, 13-17 May 2002, pp. 129-132.

Summary

As ever greater numbers of telephone transactions are being conducted solely between a caller and an automated answering system, the need increases for software which can automatically identify and authenticate these callers without the need for an onerous speaker enrollment process. In this paper we introduce and investigate a novel speaker detection and tracking (SDT) technique, which dynamically merges the traditional enrollment and recognition phases of the static speaker recognition task. In this speaker recognition application, no prior speaker models exist and the goal is to detect and model new speakers as they call into the system while also recognizing utterances from the previously modeled callers. New speakers are added to the enrolled set of speakers and speech from speakers in the currently enrolled set is used to update models. We describe a system based on a GMM speaker identification (SID) system and develop a new measure to evaluate the performance of the system on the SDT task. Results for both static, open-set detection and the SDT task are presented using a portion of the Switchboard corpus of telephone speech communications. Static open-set detection produces an equal error rate of about 5%. As expected, performance for SDT is quite varied, depending greatly on the speaker set and ordering of the test sequence. These initial results, however, are quite promising and point to potential areas in which to improve the system performance.
READ LESS

Summary

As ever greater numbers of telephone transactions are being conducted solely between a caller and an automated answering system, the need increases for software which can automatically identify and authenticate these callers without the need for an onerous speaker enrollment process. In this paper we introduce and investigate a novel...

READ MORE

Speech enhancement based on auditory spectral change

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. I, Speech Processing Neural Networks for Signal Processing, 13-17 May 2002, pp. I-257 - I-260.

Summary

In this paper, an adaptive approach to the enhancement of speech signals is developed based on auditory spectral change. The algorithm is motivated by sensitivity of aural biologic systems to signal dynamics, by evidence that noise is aurally masked by rapid changes in a signal, and by analogies to these two aural phenomena in biologic visual processing. Emphasis is on preserving nonstationarity, i.e., speech transient and time-varying components, such as plosive bursts, formant transitions, and vowel onsets, while suppressing additive noise. The essence of the enhancement technique is a Wiener filter that uses a desired signal spectrum whose estimation adapts to stationarity of the measured signal. The degree of stationarity is derived from a signal change measurement, based on an auditory spectrum that accentuates change in spectral bands. The adaptive filter is applied in an unconventional overlap-add analysis/synthesis framework, using a very short 4-ms analysis window and a 1-ms frame interval. In informal listening, the reconstructions are judged to be "crisp" corresponding to good temporal resolution of transient and rapidly-moving speech events.
READ LESS

Summary

In this paper, an adaptive approach to the enhancement of speech signals is developed based on auditory spectral change. The algorithm is motivated by sensitivity of aural biologic systems to signal dynamics, by evidence that noise is aurally masked by rapid changes in a signal, and by analogies to these...

READ MORE

Speech-to-speech translation: technology and applications study

Published in:
MIT Lincoln Laboratory Report TR-1080

Summary

This report describes a study effort on the state-of-the-art and lessons learned in automated, two- way, speech-to-speech translation and its potential application to military problems. The study includes and comments upon an extensive set of references on prior and current work in speech translation. The study includes recommendations on future military applications and on R&D needed to successfully achieve those applications. Key findings of the study include: (1) R&D speech translation systems have been demonstrated, but only in limited domains, and their performance is inadequate for operational use; (2) as far as we have been able to determine, there are currently no operational two-way speech translation systems; (3) intensive, sustained R&D will be needed to develop usable two-way speech translation systems. Major recommendations include: (1) a substantial R&D program in speech translation is needed, especially including full end-to-end system prototyping and evaluation; (2) close cooperation among researchers and users speaking multiple languages will be needed for the development of useful application systems; (3) to get military users involved and interacting in a mode which enables them to provide useful inputs and feedback on system requirements and performance, it will be necessary to provide them at the start with a fairly robust, open-domain system which works to the degree that some two-way speech translation is operational.
READ LESS

Summary

This report describes a study effort on the state-of-the-art and lessons learned in automated, two- way, speech-to-speech translation and its potential application to military problems. The study includes and comments upon an extensive set of references on prior and current work in speech translation. The study includes recommendations on future...

READ MORE

Gender-dependent phonetic refraction for speaker recognition

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, 13-17 May 2002, Vol. 1, pp. 149-152.

Summary

This paper describes improvement to an innovative high-performance speaker recognition system. Recent experiments showed that with sufficient training data phone strings from multiple languages are exceptional features for speaker recognition. The prototype phonetic speaker recognition system used phone sequences from six languages to produce an equal error rate of 11.5% on Switchboard-I audio files. The improved system described in this paper reduces the equal error rate to less than 4%. This is accomplished by incorporating gender-dependent phone models, pre-processing the speech files to remove cross-talk, and developing more sophisticated fusion techniques for the multi-language likelihood scores.
READ LESS

Summary

This paper describes improvement to an innovative high-performance speaker recognition system. Recent experiments showed that with sufficient training data phone strings from multiple languages are exceptional features for speaker recognition. The prototype phonetic speaker recognition system used phone sequences from six languages to produce an equal error rate of 11.5%...

READ MORE

Language identification using Gaussian mixture model tokenization

Published in:
Proc. IEEE Int. Conf., on Acoustics, Speech and Signal Processing, ICASSP, Vol. I, 13-17 May 2002, pp. I-757 - I-760.

Summary

Phone tokenization followed by n-gram language modeling has consistently provided good results for the task of language identification. In this paper, this technique is generalized by using Gaussian mixture models as the basis for tokenizing. Performance results are presented for a system employing a GMM tokenizer in conjunction with multiple language processing and score combination techniques. On the 1996 CallFriend LID evaluation set, a 12-way closed set error rate of 17% was obtained.
READ LESS

Summary

Phone tokenization followed by n-gram language modeling has consistently provided good results for the task of language identification. In this paper, this technique is generalized by using Gaussian mixture models as the basis for tokenizing. Performance results are presented for a system employing a GMM tokenizer in conjunction with multiple...

READ MORE

Interlingua-based English-Korean two-way speech translation of doctor-patient dialogues with CCLINC

Published in:
Machine Trans. Vol. 17, No. 3, 2002, pp. 213-243.

Summary

Development of a robust two-way real-time speech translation system exposes researchers and system developers to various challenges of machine translation (MT) and spoken language dialogues. The need for communicating in at least two different languages poses problems not present for a monolingual spoken language dialogue system, where no MT engine is embedded within the process flow. Integration of various component modules for real-time operation poses challenges not present for text translation. In this paper, we present the CCLINC (Common Coalition Language System at Lincoln Laboratory) English-Korean two-way speech translation system prototype trained on doctor-patient dialogues, which integrates various techniques to tackle the challenges of automatic real-time speech translation. Key features of the system include (i) language-independent meaning representation which preserves the hierarchical predicate-argument structure of an input utterance, providing a powerful mechanism for discourse understanding of utterances originating from different languages, word-sense disambiguation and generation of various word orders of many languages, (ii) adoption of the DARPA Communicator architecture, a plug-and-play distributed system architecture which facilitates integration of component modules and system operation in real time, and (iii) automatic acquisition of grammar rules and lexicons for easy porting of the system to different languages and domains. We describe these features in detail and present experimental results.
READ LESS

Summary

Development of a robust two-way real-time speech translation system exposes researchers and system developers to various challenges of machine translation (MT) and spoken language dialogues. The need for communicating in at least two different languages poses problems not present for a monolingual spoken language dialogue system, where no MT engine...

READ MORE

Speaker recognition from coded speech and the effects of score normalization

Published in:
Proc. Thirty-Fifth Asilomar Conf. on Signals, Systems and Computers, Vol. 2, 4-7 November 2001, pp. 1562-1567.

Summary

We investigate the effect of speech coding on automatic speaker recognition when training and testing conditions are matched and mismatched. Experiments used standard speech coding algorithms (GSM, G.729, G.723, MELP) and a speaker recognition system based on Gaussian mixture models adapted from a universal background model. There is little loss in recognition performance for toll quality speech coders and slightly more loss when lower quality speech coders are used. Speaker recognition from coded speech using handset dependent score normalization and test score normalization are examined. Both types of score normalization significantly improve performance, and can eliminate the performance loss that occurs when there is a mismatch between training and testing conditions.
READ LESS

Summary

We investigate the effect of speech coding on automatic speaker recognition when training and testing conditions are matched and mismatched. Experiments used standard speech coding algorithms (GSM, G.729, G.723, MELP) and a speaker recognition system based on Gaussian mixture models adapted from a universal background model. There is little loss...

READ MORE

Toward an improved concept-based information retrieval system

Published in:
Proc. of the 24th Annual ACM SIGIR Conf. on Research and Development in Information Retrieval, 9-13 September 2001, pp. 384-385.

Summary

This paper presents a novel information retrieval system that includes 1) the addition of concepts to facilitate the identification of the correct word sense, 2) a natural language query interface, 3) the inclusion of weights and penalties for proper nouns that build upon the Okapi weighting scheme, and 4) a term clustering technique that exploits the spatial proximity of search terms in a document to further improve the performance. The effectiveness of the system is validated by experimental results.
READ LESS

Summary

This paper presents a novel information retrieval system that includes 1) the addition of concepts to facilitate the identification of the correct word sense, 2) a natural language query interface, 3) the inclusion of weights and penalties for proper nouns that build upon the Okapi weighting scheme, and 4) a...

READ MORE

Preliminary speaker recognition experiments on the NATO N4 corpus

Published in:
Proc. Workshop on Multilingual Speech and Language Processing, 8 Spetember 2001.

Summary

The NATO N4 corpus contains speech collected at naval training schools within several NATO countries. The speech utterances comprising the corpus are short, tactical transmissions typical of NATO naval communications. In this paper, we report the results of some preliminary speaker recognition experiments on the N4 corpus. We compare the performance of three speaker recognition systems developed at TNO Human Factors, the US Air Force Research Laboratory, Information Directorate and MIT Lincoln Laboratory on the segment of N4 data collected in the Netherlands. Performance is reported as a function of both training and test data duration. We also investigate the impact of cross-language training and testing.
READ LESS

Summary

The NATO N4 corpus contains speech collected at naval training schools within several NATO countries. The speech utterances comprising the corpus are short, tactical transmissions typical of NATO naval communications. In this paper, we report the results of some preliminary speaker recognition experiments on the N4 corpus. We compare the...

READ MORE