Publications
Adaptive short-time analysis-synthesis for speech enhancement
Summary
Summary
In this paper we propose a multiresolution short-time analysis method for speech enhancement. It is well known that fixed resolution methods such as the traditional short-time Fourier transform do not generally match the time-frequency structure of the signal being analyzed resulting in poor estimates of the speech and noise spectra...
Exploiting temporal change in pitch in formant estimation
Summary
Summary
This paper considers the problem of obtaining an accurate spectral representation of speech formant structure when the voicing source exhibits a high fundamental frequency. Our work is inspired by auditory perception and physiological modeling studies implicating the use of temporal changes in speech by humans. Specifically, we develop and assess...
Multisensor very low bit rate speech coding using segment quantization
Summary
Summary
We present two approaches to noise robust very low bit rate speech coding using wideband MELP analysis/synthesis. Both methods exploit multiple acoustic and non-acoustic input sensors, using our previously-presented dynamic waveform fusion algorithm to simultaneously perform waveform fusion, noise suppression, and crosschannel noise cancellation. One coder uses a 600 bps...
Spectral representations of nonmodal phonation
Summary
Summary
Regions of nonmodal phonation, which exhibit deviations from uniform glottal-pulse periods and amplitudes, occur often in speech and convey information about linguistic content, speaker identity, and vocal health. Some aspects of these deviations are random, including small perturbations, known as jitter and shimmer, as well as more significant aperiodicities. Other...
Sinewave analysis/synthesis based on the fan-chirp transform
Summary
Summary
There have been numerous recent strides at making sinewave analysis consistent with time-varying sinewave models. This is particularly important in high-frequency speech regions where harmonic frequency modulation (FM) can be significant. One notable approach is through the Fan Chirp transform that provides a set of FM-sinewave basis functions consistent with...
An evaluation of audio-visual person recognition on the XM2VTS corpus using the Lausanne protocols
Summary
Summary
A multimodal person recognition architecture has been developed for the purpose of improving overall recognition performance and for addressing channel-specific performance shortfalls. This multimodal architecture includes the fusion of a face recognition system with the MIT/LLGMM/UBM speaker recognition architecture. This architecture exploits the complementary and redundant nature of the face...
Robust speaker recognition with cross-channel data: MIT-LL results on the 2006 NIST SRE auxiliary microphone task
Summary
Summary
One particularly difficult challenge for cross-channel speaker verification is the auxiliary microphone task introduced in the 2005 and 2006 NIST Speaker Recognition Evaluations, where training uses telephone speech and verification uses speech from multiple auxiliary microphones. This paper presents two approaches to compensate for the effects of auxiliary microphones on...
Multisensor dynamic waveform fusion
Summary
Summary
Speech communication is significantly more difficult in severe acoustic background noise environments, especially when low-rate speech coders are used. Non-acoustic sensors, such as radar sensors, vibrometers, and bone-conduction microphones, offer significant potential in these situations. We extend previous work on fixed waveform fusion from multiple sensors to an optimal dynamic...
Auditory modeling as a basis for spectral modulation analysis with application to speaker recognition
Summary
Summary
This report explores auditory modeling as a basis for robust automatic speaker verification. Specifically, we have developed feature-extraction front-ends that incorporate (1) time-varying, level-dependent filtering, (2) variations in analysis filterbank size,and (3) nonlinear adaptation. Our methods are motivated both by a desire to better mimic auditory processing relative to traditional...
Analysis of nonmodal phonation using minimum entropy deconvolution
Summary
Summary
Nonmodal phonation occurs when glottal pulses exhibit nonuniform pulse-to-pulse characteristics such as irregular spacings, amplitudes, and/or shapes. The analysis of regions of such nonmodality has application to automatic speech, speaker, language, and dialect recognition. In this paper, we examine the usefulness of a technique called minimum-entropy deconvolution, or MED, for...