Publications

Refine Results

(Filters Applied) Clear All

Two-talker pitch tracking for co-channel talker interference suppression

Published in:
MIT Lincoln Laboratory Report TR-951

Summary

Almost all co-channel talker interference suppression systems use the difference in the pitches of the target and jammer speakers to suppress the jammer and enhance the target. While joint pitch estimators outputting two pitch estimates as a function of time have been proposed, the task of proper assignment of pitch to speaker (two-talker pitch tracking) has proven difficult. This report describes several approaches to the two-talker pitch tracking problem including algorithms for pitch track interpolation, spectral envelope tracking, and spectral envelope classification. When evaluated on an all-voiced two-talker database, the best of these new tracking systems correctly assigned pitch 87% of the time given perfect joint pitch estimation.
READ LESS

Summary

Almost all co-channel talker interference suppression systems use the difference in the pitches of the target and jammer speakers to suppress the jammer and enhance the target. While joint pitch estimators outputting two pitch estimates as a function of time have been proposed, the task of proper assignment of pitch...

READ MORE

An integrated speech-background model for robust speaker identification

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. 2, 23-26 March 1992, pp. 185-188.

Summary

This paper examines a procedure for text independent speaker identification in noisy environments where the interfering background signals cannot be characterized using traditional broadband or impulsive noise models. In the procedure, both the speaker and the background processes are modeled using mixtures of Gaussians. Speaker and background models are integrated into a unified statistical framework allowing the decoupling of the underlying speech process from the noise corrupted observations via the expectation-maximization algorithm. Using this formalism, speaker model parameters are estimated in the presence of the background process, and a scoring procedure is implemented for computing the speaker likelihood in the noise corrupted environment. Performance is evaluated using a 16 speaker conversational speech database with both "speech babble" and white noise background processes.
READ LESS

Summary

This paper examines a procedure for text independent speaker identification in noisy environments where the interfering background signals cannot be characterized using traditional broadband or impulsive noise models. In the procedure, both the speaker and the background processes are modeled using mixtures of Gaussians. Speaker and background models are integrated...

READ MORE

A speech recognizer using radial basis function neural networks in an HMM framework

Published in:
ICASSP'92, Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Vol. 1, Speech Processing 1, 23-26 March 1992, pp. 629-632.

Summary

A high performance speaker-independent isolated-word speech recognizer was developed which combines hidden Markov models (HMMs) and radial basis function (RBF) neural networks. RBF networks in this recognizer use discriminant training techniques to estimate Bayesian probabilities for each speech frame while HMM decoders estimate overall word likelihood scores for network outputs. RBF training is performed after the HMM recognizer has automatically segmented training tokens using forced Viterbi alignment. In recognition experiments using a speaker-independent E-set database, the hybrid recognizer had an error rate of 11.5% compared to 15.7% for the robust unimodal Gaussian HMM recognizer upon which the hybrid system was based. The error rate was also lower than that of a tied-mixture HMM recognizer with the same number of centers. These results demonstrate that RBF networks can be successfully incorporated in hybrid recognizers and suggest that they may be capable of good performance with fewer parameters than required by Gaussian mixture classifiers.
READ LESS

Summary

A high performance speaker-independent isolated-word speech recognizer was developed which combines hidden Markov models (HMMs) and radial basis function (RBF) neural networks. RBF networks in this recognizer use discriminant training techniques to estimate Bayesian probabilities for each speech frame while HMM decoders estimate overall word likelihood scores for network outputs...

READ MORE

Shape invariant time-scale and pitch modification of speech

Published in:
IEEE Trans. Signal Process., Vol. 40, No. 3, March 1992, pp. 497-510.

Summary

The simplified linear model of speech production predicts that when the rate of articulation is changed, the resulting waveform takes on the appearance of the original, except for a change in the time scale. The goal of this paper is to develop a time-scale modification system that preserves this shape-invariance property during voicing. This is done using a version of the sinusoidal analysis-synthesis system that models and independently modifies the phase contributions of the vocal tract and vocal cord excitation. An important property of the system is its capability of performing time-varying rates of change. Extensions of the method are applied to fixed and time-varying pitch modification of speech. The sine-wave analysis-synthesis system also allows for shape-invariant joint time-scale and pitch modification, and allows for the adjustment of the time scale and pitch according to speech characteristics such as the degree of voicing.
READ LESS

Summary

The simplified linear model of speech production predicts that when the rate of articulation is changed, the resulting waveform takes on the appearance of the original, except for a change in the time scale. The goal of this paper is to develop a time-scale modification system that preserves this shape-invariance...

READ MORE

Improved hidden Markov model speech recognition using radial basis function networks

Published in:
Advances in Neural Information Processing Systems, Denver, CO, 2-5 December 1991.

Summary

A high performance speaker-independent isolated-word hybrid speech recognizer was developed which combines Hidden Markov Models (HMMs) and Radial Basis Function (RBF) neural networks. In recognition experiments using a speaker-independent E-set database, the hybrid recognizer had an error rate of 11.5% compared to 15.7% for the robust unimodal Gaussian HMM recognizer upon which the hybrid system was based. These results and additional experiments demonstrate that RBF networks can be successfully incorporated in hybrid recognizers and suggest that they may be capable of good performance with fewer parameters than required by Gaussian mixture classifiers. A global parameter optimization method designed to minimize the overall word error rather than the frame recognition error failed to reduce the error rate.
READ LESS

Summary

A high performance speaker-independent isolated-word hybrid speech recognizer was developed which combines Hidden Markov Models (HMMs) and Radial Basis Function (RBF) neural networks. In recognition experiments using a speaker-independent E-set database, the hybrid recognizer had an error rate of 11.5% compared to 15.7% for the robust unimodal Gaussian HMM recognizer...

READ MORE

Neural network classifiers estimate Bayesian a posteriori probabilities

Published in:
Neural Comput., Vol. 3, No. 4, Winter 1991, pp. 461-483.

Summary

Many neural network classifiers provide outputs which estimate Bayesian a posteriori probabilities. When the estimation is accurate, network outputs can be treated as probabilities and sum to one. Simple proofs show that Bayesian probabilities are estimated when desired network outputs are 1 of M (one output unity, all others zero) and a squared-error or mss-entropy cost function is used. Results of Monte Carlo simulations performed using multilayer perceptron (MLP) networks trained with backpropagation, radial basis function (RBD networks, and high-order polynomial networks graphically demonstrate that network outputs provide good estimates of Bayesian probabilities. Estimation accuracy depends on network complexity, the amount of training data, and the degree to which training data reflect true likelihood distributions and a priori class probabilities. Interpretation of network outputs as Bayesian probabilities allows outputs from multiple networks to be combined for higher level decision making, simplifies creation of rejection thresholds, makes it possible to compensate for differences between pattern class probabilities in training and test data, allows outputs to be used to minimize alternative risk functions, and suggests alternative measures of network performance.
READ LESS

Summary

Many neural network classifiers provide outputs which estimate Bayesian a posteriori probabilities. When the estimation is accurate, network outputs can be treated as probabilities and sum to one. Simple proofs show that Bayesian probabilities are estimated when desired network outputs are 1 of M (one output unity, all others zero)...

READ MORE

Opportunities for advanced speech processing in military computer-based systems

Published in:
Proc. IEEE, Vol. 79, No. 11, November 1991, pp. 1626-1641.

Summary

This paper presents a study of military applications of advanced speech processing technology which includes three major elements: 1) review and assessment of current efforts in military applications of speech technology; 2) identification of opportunities for future military applications of advanced speech technology; and 3) identification of problem areas where research in speech processing is needed to meet application requirements, and of current research thrusts which appear promising. The relationship of this study to previous assessments of military applications of speech technology is discussed and substantial recent progress is noted. Current efforts in military applications of speech technology which are highlighted include: 1) narrow-band (2400 his) and very low-rate (50-1200 his) secure voice communication; 2) voice/data integration in computer networks; 3) speech recognition in fighter aircraft, military helicopters, battle management, and air traffic control training systems; and 4) noise and interference removal for human listeners. Opportunities for advanced applications are identified by means of descriptions of several generic systems which would be possible with advances in speech technology and in system integration. These generic systems include 1) an integrated multirate voice data communications terminal; 2) an interactive speech enhancement system; 3) a voice-controlled pilot's associate system; 4) advanced air traffic control training systems; 5) a battle management command and control support system with spoken natural language interface; and 6) a spoken language translation system. In identifying problem areas and research efforts to meet application requirements, it is observed that some of the most promising research involves the integration of speech algorithm techniques including speech coding, speech recognition, and speaker recognition.
READ LESS

Summary

This paper presents a study of military applications of advanced speech processing technology which includes three major elements: 1) review and assessment of current efforts in military applications of speech technology; 2) identification of opportunities for future military applications of advanced speech technology; and 3) identification of problem areas where...

READ MORE

Low-rate speech coding based on the sinusoidal model

Published in:
Chapter 6 in Advances in Speech Signal Processing, Marcel Dekker, Inc., 1992, pp. 165-208.

Summary

One approach to the problem of representation of speech signals is to use the speech production model in which speech is viewed as the result of passing a glottal excitation waveform through a time-varying linear filter that models the resonant characteristics of the vocal tract. In many applications it suffices to assume that the glottal excitation can be in one of two possible states corresponding to voiced or unvoiced speech. In attempts to design high-quality speech coders at the midband rates, generalizations of the binary excitation model have been developed. One such approach is multipulse (Atal and Remde, 1982) which uses more than one pitch pulse to model voiced speech and a possibly random set of pulses to model unvoiced speech. Code excited linear prediction (CELP) (Schroeder and Atal, 1985) is another representation which models the excitation as one of a number of random sequences or "codewords" superimposed on periodic pitch pulses. In this chapter the goal is also to generalize the model for the glottal excitation; but instead of using impulses as in multipulse or random sequences as in CELP, the excitation is assumed to be composed of sinusoidal components of arbitrary amplitudes, frequencies, and phases (McAulay and Quatieri, 1986).
READ LESS

Summary

One approach to the problem of representation of speech signals is to use the speech production model in which speech is viewed as the result of passing a glottal excitation waveform through a time-varying linear filter that models the resonant characteristics of the vocal tract. In many applications it suffices...

READ MORE

Speech nonlinearities, modulations, and energy operators

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, 14-17 May 1991, pp. 421-424.

Summary

In this paper, we investigate an AM-FM model for representing modulations in speech resonances. Specifically, we propose a frequency modulation (FM) model for the time-varying formants whose amplitude varies as the envelope of an amplitude-modulated (AM) signal. To detect the modulations we apply the energy operator (psi)(x) = (x)^2 - xx and its discrete counterpart. We found that psi can approximately track the envelope of AM signals, the instantaneous frequency of FM signals, and the product of these two functions in the general case of AM-FM signals. Several experiments are reported on the applications of this AM-FM modeling to speech signals, bandpass filtered via Gabor filtering.
READ LESS

Summary

In this paper, we investigate an AM-FM model for representing modulations in speech resonances. Specifically, we propose a frequency modulation (FM) model for the time-varying formants whose amplitude varies as the envelope of an amplitude-modulated (AM) signal. To detect the modulations we apply the energy operator (psi)(x) = (x)^2 -...

READ MORE

Peak-to-rms reduction of speech based on a sinusoidal model

Published in:
IEEE Trans. Signal Process., Vol. 39, No. 2, February 1991, pp. 273-288.

Summary

In a number of applications, a speech waveform is processed using phase dispersion and amplitude compression to reduce its peak-to-rms ratio so as to increase loudness and intelligibility while minimizing perceived distortion. In this paper, a sinusoidal-based analysis/synthesis system is used to apply a radar design solution to the problem of dispersing the phase of a speech waveform. Unlike conventional methods of phase dispersion, this solution technique adapts dynamically to the pitch and spectral characteristics of the speech, while maintaining the original spectral envelope. The solution can also be used to drive the sine-wave amplitude modification for amplitude compression, and is coupled to the desired shaping of the speech spectrum. The new dispersion solution, when integrated with amplitude compression, results in a significant reduction in the peak-to-rms ratio of the speech waveform with acceptable loss in quality. Application of a real-time prototype sine-wave preprocessor to AM radio broadcasting is described.
READ LESS

Summary

In a number of applications, a speech waveform is processed using phase dispersion and amplitude compression to reduce its peak-to-rms ratio so as to increase loudness and intelligibility while minimizing perceived distortion. In this paper, a sinusoidal-based analysis/synthesis system is used to apply a radar design solution to the problem...

READ MORE