Publications

Refine Results

(Filters Applied) Clear All

A subband approach to time-scale expansion of complex acoustic signals

Published in:
IEEE Trans. Speech Audio Process., Vol. 3, No. 6, November 1995, pp. 515-519.

Summary

A new approach to time-scale expansion of short-duration complex acoustic signals is introduced. Using a subband signal representation, channel phases are selected to preserve a desired time-scaled temporal envelope. The phase representation is derived from locations of events that occur within filter bank outputs. A frame-based generalization of the method imposes phase consistency across consecutive synthesis frames. The method is applied to synthetic and actual complex acoustic signals consisting of closely spaced rapidly damped sine wave. Time-frequency resolution limitations are discussed.
READ LESS

Summary

A new approach to time-scale expansion of short-duration complex acoustic signals is introduced. Using a subband signal representation, channel phases are selected to preserve a desired time-scaled temporal envelope. The phase representation is derived from locations of events that occur within filter bank outputs. A frame-based generalization of the method...

READ MORE

Time-scale modification with inconsistent constraints

Published in:
Proc. 1995 Workshop on Applications of Signal Processing to Audio Acoustics, 15-18 October 1995.

Summary

A set theoretic estimation approach is introduced for timescale modification of complex acoustic signals. The method determines a signal that meets, in a least-squared error sense, desired temporal and spectral envelope constraints that are inconsistent. These constraints are generalized within the set theoretic framework to include other signal characteristics such as instantaneous frequency and group delay. The approach can enhance acoustic signals consisting of closely-spaced sequential time components, and is applicable to biological, underwater, and music sound processing.
READ LESS

Summary

A set theoretic estimation approach is introduced for timescale modification of complex acoustic signals. The method determines a signal that meets, in a least-squared error sense, desired temporal and spectral envelope constraints that are inconsistent. These constraints are generalized within the set theoretic framework to include other signal characteristics such...

READ MORE

Military and government applications of human-machine communication by voice

Published in:
Proc. Natl. Acad. Sci., Vol. 92, October 1995, pp. 10011-10016.

Summary

This paper describes a range of opportunities for military and government applications of human-machine communication by voice, based on visits and contacts with numerous user organizations in the United States. The applications include some that appear to be feasible by careful integration of current state-of-the-art technology and others that will require a varying mix of advances in speech technology and in integration of the technology into applications environments. Applications that are described include (1) speech recognition and synthesis for mobile command and control; (2) speech processing for a portable multifunction soldier's computer; (3) speech- and language-based technology for naval combat team tactical training; (4) speech technology for command and control on a carrier flight deck; (5) control of auxiliary systems, and alert and warning generation, in fighter aircraft and helicopters; and (6) voice check-in, report entry, and communication for law enforcement agents or special forces. A phased approach for transfer of the technology into applications is advocated, where integration of applications systems is pursued in parallel with advanced research to meet future needs.
READ LESS

Summary

This paper describes a range of opportunities for military and government applications of human-machine communication by voice, based on visits and contacts with numerous user organizations in the United States. The applications include some that appear to be feasible by careful integration of current state-of-the-art technology and others that will...

READ MORE

Sine-wave amplitude coding using a mixed LSF/PARCOR representation

Published in:
Proc. 1995 IEEE Workshop on Speech Coding for Telecommunications, 20-22 Spetember 1995, pp. 77-8.

Summary

An all-pole model of the speech spectral envelope is used to code the sine-wave amplitudes in the Sinusoidal Transform Coder. While line spectral frequencies (LSFs) are currently used to represent this all-pole model, it is shown that a mixture of line spectral frequencies and partial correlation (PARCOR) coefficients can be used to reduce complexity without a loss in quantization efficiency. Objective and subjective measures demonstrate that speech quality is maintained. In addition, the use of split vector quantization is shown to substantially reduce the number of bits needed to code the all-pole model.
READ LESS

Summary

An all-pole model of the speech spectral envelope is used to code the sine-wave amplitudes in the Sinusoidal Transform Coder. While line spectral frequencies (LSFs) are currently used to represent this all-pole model, it is shown that a mixture of line spectral frequencies and partial correlation (PARCOR) coefficients can be...

READ MORE

A comparison of signal processing front ends for automatic word recognition

Published in:
IEEE Trans. Speech Audio Process., Vol. 3, No. 4, July 1995, pp. 286-293.

Summary

This paper compares the word error rate of a speech recognizer using several signal processing front ends based on auditory properties. Front ends were compared with a control mel filter banks (MFB) based cepstral front end in clean speech and with speech degraded by noise and spectral variability, using the TI-105 isolated word database. MFB recognition error rates ranged from 0.5 to 3.1%,, and the reduction in error rates provided by auditory models was less than 0.5 percentage points. Some earlier studies that demonstrated considerably more improvement with auditory models used linear predictive coding (LPC) based control front ends. This paper shows that MFB cepstra significantly outperform LPC cepstra under noisy conditions. Techniques using an optimal linear combination of features for data reduction were also evaluated.
READ LESS

Summary

This paper compares the word error rate of a speech recognizer using several signal processing front ends based on auditory properties. Front ends were compared with a control mel filter banks (MFB) based cepstral front end in clean speech and with speech degraded by noise and spectral variability, using the...

READ MORE

Measuring fine structure in speech: application to speaker identification

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. 1, 9-12 May 1995, pp. 325-328.

Summary

The performance of systems for speaker identification (SID) can be quite good with clean speech, though much lower with degraded speech. Thus it is useful to search for new features for SID, particularly features that are robust over a degraded channel. This paper investigates features that are based on amplitude and frequency modulations of speech formants, high resolution measurement of fundamental frequency and location of "secondary pulses," measured using a high-resolution energy operator. When these features are added to traditional features using an existing SID system with a 168 speaker telephone speech database, SID performance improved by as much as 4% for male speakers and 8.2% for female speakers.
READ LESS

Summary

The performance of systems for speaker identification (SID) can be quite good with clean speech, though much lower with degraded speech. Thus it is useful to search for new features for SID, particularly features that are robust over a degraded channel. This paper investigates features that are based on amplitude...

READ MORE

Language identification using phoneme recognition and phonotactic language modeling

Author:
Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Vol. 5, ICASSP, 9-12 May 1995, pp. 3503-3506.

Summary

A language identification technique using multiple single-language phoneme recognizers followed by n-gram language models yielded to performance at the March 1994 NIST language identification evaluation. Since the NIST evaluation, work has been aimed at further improving performance by using the acoustic likelihoods emitted from gender-dependent phoneme recognizers to weight the phonotactic likelihoods output from gender-dependent language models. We have investigated the effect of restricting processing to the most highly discriminating n-grams, and we have also added explicit duration modeling at the phonotactic level. On the OGI Multi-language Telephone Speech Corpus, accuracy on an 11-language identification task has risen to 89% on 45-s utterances and 79% on 10-s utterances. Two-language classification accuracy is 98% and 95% for the 45-s and 10-s utterance, respectively. Finally, we have started to apply these same techniques to the problem of dialect identification.
READ LESS

Summary

A language identification technique using multiple single-language phoneme recognizers followed by n-gram language models yielded to performance at the March 1994 NIST language identification evaluation. Since the NIST evaluation, work has been aimed at further improving performance by using the acoustic likelihoods emitted from gender-dependent phoneme recognizers to weight the...

READ MORE

The effects of telephone transmission degradations on speaker recognition performance

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. 1, Speech, 9-12 May 1995, pp. 329-332.

Summary

The two largest factors affecting automatic speaker identification performance are the size of the population an the degradations introduced by noisy communication, channels (e.g., telephone transmission). To examine experimentally these two factors, this paper presents text-independent speaker identification results for varying speaker population sizes up to 630 speakers for both clean, wideband speech and telephone speech. A system based on Gaussian mixture speaker identification and experiments are conducted on the TIMIT and NTIMIT databases. This is believed to be the first speaker identification experiments on the complete 630 speaker TIMIT and NTIMIT databases and the largest text-independent speaker identification task reported to date. Identification accuracies of 99.5% and 60.7% are achieved on the TIMIT and NTIMIT databases, respectively. This paper also presents experiments which examine and attempt to quantify the performance loss associated with various telephone degradations by systematically degrading the TIMIT speech in a manner consistent with measured NTIMIT degradations and measuring the performance loss at each step. It is found that the standard degradations of filtering and additive noise do not account for all of the performance gap between the TIMIT and NTIMIT data. Measurements of nonlinear microphone distortions are also...
READ LESS

Summary

The two largest factors affecting automatic speaker identification performance are the size of the population an the degradations introduced by noisy communication, channels (e.g., telephone transmission). To examine experimentally these two factors, this paper presents text-independent speaker identification results for varying speaker population sizes up to 630 speakers for both...

READ MORE

Large population speaker identification using clean and telephone speech

Published in:
IEEE Signal Process. Lett., Vol. 2, No. 3, March 1995, pp. 46-48.

Summary

This paper presents text-independent speaker identification results for varying speaker population sizes up to 630 speakers for both clean, wideband speech, and telephone speech. A system based on Gaussian mixture speaker models is used for speaker identification, and experiments are conducted on the TIMIT and NTIMIT databases. The TIMIT results show large population performance under near-ideal conditions, and the NTIMIT results show the corresponding accuracy loss due to telephone transmission. These are believed to be the first speaker identification experiments on the complete 630 speaker TIMIT and NTIMIT databases and the largest text-independent speaker identification task reported to date. Identification accuracies of 99.5 and 60.7% were achieved on the TIMIT and NTIMIT databases, respectively.
READ LESS

Summary

This paper presents text-independent speaker identification results for varying speaker population sizes up to 630 speakers for both clean, wideband speech, and telephone speech. A system based on Gaussian mixture speaker models is used for speaker identification, and experiments are conducted on the TIMIT and NTIMIT databases. The TIMIT results...

READ MORE

Robust text-independent speaker identification using Gaussian mixture speaker models

Published in:
IEEE Trans. Speech Audio Process., Vol. 3, No. 1, January 1995, pp. 72-83.

Summary

This paper introduces and motivates the use of Gaussian mixture models (GMM) for robust text-independent speaker identification. The individual Gaussian components of a GMM are shown to represent some general speaker-dependent spectral shapes that are effective for modeling speaker identify. The focus of this work is on applications which require high identification rates using short utterance from unconstrained conversational speech and robustness to degradations produced by transmission over a telephone channel. A complete experimental evaluation of the Gaussian mixture speaker model is conducted on a 49 speaker, conversational telephone speech database. The experiments examine algorithmic issues (initializations, variance limiting, model order selection), spectral variability robustness techniques, large population performance, and comparisons to other speaker modeling techniques (uni-modal Gaussian, VQ codebook, tied Gaussian mixture, and radial basis functions). The Gaussian mixture speaker model attains 96.8% identification accuracy using 5 second clean speech utterances and 80.8% accuracy using 15 second telephone speech utterances with a 49 speaker population and is shown to outperform the other speaker modeling techniques on an identical 16 speaker telephone speech task.
READ LESS

Summary

This paper introduces and motivates the use of Gaussian mixture models (GMM) for robust text-independent speaker identification. The individual Gaussian components of a GMM are shown to represent some general speaker-dependent spectral shapes that are effective for modeling speaker identify. The focus of this work is on applications which require...

READ MORE