Publications

Refine Results

(Filters Applied) Clear All

An evaluation of audio-visual person recognition on the XM2VTS corpus using the Lausanne protocols

Published in:
Proc. 32nd IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP, April 2007, pp. IV-237 - 240.

Summary

A multimodal person recognition architecture has been developed for the purpose of improving overall recognition performance and for addressing channel-specific performance shortfalls. This multimodal architecture includes the fusion of a face recognition system with the MIT/LLGMM/UBM speaker recognition architecture. This architecture exploits the complementary and redundant nature of the face and speech modalities. The resulting multimodal architecture has been evaluated on theXM2VTS corpus using the Lausanne open set verification protocols, and demonstrates excellent recognition performance. The multimodal architecture also exhibits strong recognition performance gains over the performance of the individual modalities.
READ LESS

Summary

A multimodal person recognition architecture has been developed for the purpose of improving overall recognition performance and for addressing channel-specific performance shortfalls. This multimodal architecture includes the fusion of a face recognition system with the MIT/LLGMM/UBM speaker recognition architecture. This architecture exploits the complementary and redundant nature of the face...

READ MORE

Multisensor dynamic waveform fusion

Published in:
Proc. 32nd Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP, April 2007, pp. IV-577 - IV-580.

Summary

Speech communication is significantly more difficult in severe acoustic background noise environments, especially when low-rate speech coders are used. Non-acoustic sensors, such as radar sensors, vibrometers, and bone-conduction microphones, offer significant potential in these situations. We extend previous work on fixed waveform fusion from multiple sensors to an optimal dynamic waveform fusion algorithm that minimizes both additive noise and signal distortion in the estimated speech signal. We show that a minimum mean squared error (MMSE) waveform matching criterion results in a generalized multichannel Wiener filter, and that this filter will simultaneously perform waveform fusion, noise suppression, and crosschannel noise cancellation. Formal intelligibility and quality testing demonstrate significant improvement from this approach.
READ LESS

Summary

Speech communication is significantly more difficult in severe acoustic background noise environments, especially when low-rate speech coders are used. Non-acoustic sensors, such as radar sensors, vibrometers, and bone-conduction microphones, offer significant potential in these situations. We extend previous work on fixed waveform fusion from multiple sensors to an optimal dynamic...

READ MORE

Understanding scores in forensic speaker recognition

Summary

Recent work in forensic speaker recognition has introduced many new scoring methodologies. First, confidence scores (posterior probabilities) have become a useful method of presenting results to an analyst. The introduction of an objective measure of confidence score quality, the normalized cross entropy, has resulted in a systematic manner of evaluating and designing these systems. A second scoring methodology that has become popular is support vector machines (SVMs) for high-level features. SVMs are accurate and produce excellent results across a wide variety of token types-words, phones, and prosodic features. In both cases, an analyst may be at a loss to explain the significance and meaning of the score produced by these methods. We tackle the problem of interpretation by exploring concepts from the statistical and pattern classification literature. In both cases, our preliminary results show interesting aspects of scores not obvious from viewing them "only as numbers."
READ LESS

Summary

Recent work in forensic speaker recognition has introduced many new scoring methodologies. First, confidence scores (posterior probabilities) have become a useful method of presenting results to an analyst. The introduction of an objective measure of confidence score quality, the normalized cross entropy, has resulted in a systematic manner of evaluating...

READ MORE

Exploiting nonacoustic sensors for speech encoding

Summary

The intelligibility of speech transmitted through low-rate coders is severely degraded when high levels of acoustic noise are present in the acoustic environment. Recent advances in nonacoustic sensors, including microwave radar, skin vibration, and bone conduction sensors, provide the exciting possibility of both glottal excitation and, more generally, vocal tract measurements that are relatively immune to acoustic disturbances and can supplement the acoustic speech waveform. We are currently investigating methods of combining the output of these sensors for use in low-rate encoding according to their capability in representing specific speech characteristics in different frequency bands. Nonacoustic sensors have the ability to reveal certain speech attributes lost in the noisy acoustic signal; for example, low-energy consonant voice bars, nasality, and glottalized excitation. By fusing nonacoustic low-frequency and pitch content with acoustic-microphone content, we have achieved significant intelligibility performance gains using the DRT across a variety of environments over the government standard 2400-bps MELPe coder. By fusing quantized high-band 4-to-8-kHz speech, requiring only an additional 116 bps, we obtain further DRT performance gains by exploiting the ear's insensitivity to fine spectral detail in this frequency region.
READ LESS

Summary

The intelligibility of speech transmitted through low-rate coders is severely degraded when high levels of acoustic noise are present in the acoustic environment. Recent advances in nonacoustic sensors, including microwave radar, skin vibration, and bone conduction sensors, provide the exciting possibility of both glottal excitation and, more generally, vocal tract...

READ MORE

Estimating and evaluating confidence for forensic speaker recognition

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP, Vol. 1, 19-23 March 2005, pp. I-717 - I-720.

Summary

Estimating and evaluating confidence has become a key aspect of the speaker recognition problem because of the increased use of this technology in forensic applications. We discuss evaluation measures for speaker recognition and some of their properties. We then propose a framework for confidence estimation based upon scores and metainformation, such as utterance duration, channel type, and SNR. The framework uses regression techniques with multilayer perceptrons to estimate confidence with a data-driven methodology. As an application, we show the use of the framework in a speaker comparison task drawn from the NIST 2000 evaluation. A relative comparison of different types of meta-information is given. We demonstrate that the new framework can give substantial improvements over standard distribution methods of estimating confidence.
READ LESS

Summary

Estimating and evaluating confidence has become a key aspect of the speaker recognition problem because of the increased use of this technology in forensic applications. We discuss evaluation measures for speaker recognition and some of their properties. We then propose a framework for confidence estimation based upon scores and metainformation...

READ MORE

Multisensor MELPE using parameter substitution

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. 1, 17-21 May 2004, pp. I-477 - I-480.

Summary

The estimation of speech parameters and the intelligibility of speech transmitted through low-rate coders, such as MELP, are severely degraded when there are high levels of acoustic noise in the speaking environment. The application of nonacoustic and nontraditional sensors, which are less sensitive to acoustic noise than the standard microphone, is being investigated as a means to address this problem. Sensors being investigated include the General Electromagnetic Motion Sensor (GEMS) and the Physiological Microphone (P-mic). As an initial effort in this direction, a multisensor MELPe coder using parameter substitution has been developed, where pitch and voicing parameters are obtained from GEMS and PMic sensors, respectively, and the remaining parameters are obtained as usual from a standard acoustic microphone. This parameter substitution technique is shown to produce significant and promising DRT intelligibility improvements over the standard 2400 bps MELPe coder in several high-noise military environments. Further work is in progress aimed at utilizing the nontraditional sensors for additional intelligibility improvements and for more effective lower rate coding in noise.
READ LESS

Summary

The estimation of speech parameters and the intelligibility of speech transmitted through low-rate coders, such as MELP, are severely degraded when there are high levels of acoustic noise in the speaking environment. The application of nonacoustic and nontraditional sensors, which are less sensitive to acoustic noise than the standard microphone...

READ MORE

Automated lip-reading for improved speech intelligibility

Published in:
Proc. of the IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP, Vol. I, 17-21 May 2004, pp. I-701 - I-704.

Summary

Various psycho-acoustical experiments have concluded that visual features strongly affect the perception of speech. This contribution is most pronounced in noisy environments where the intelligibility of audio-only speech is quickly degraded. An exploration of the effectiveness for extracted visual features such as lip height and width for improving speech intelligibility in noisy environments is provided in this paper. The intelligibility content of these extracted visual features will be investigated through an intelligibility test on an animated rendition of the video generated from the extracted visual features, as well as on the original video. These experiments demonstrate that the extracted video features do contain important aspects of intelligibility that may be utilized in augmenting speech enhancement and coding applications. Alternatively, these extracted visual features can be transmitted in a bandwidth effective way to augment speech coders.
READ LESS

Summary

Various psycho-acoustical experiments have concluded that visual features strongly affect the perception of speech. This contribution is most pronounced in noisy environments where the intelligibility of audio-only speech is quickly degraded. An exploration of the effectiveness for extracted visual features such as lip height and width for improving speech intelligibility...

READ MORE

Exploiting nonacoustic sensors for speech enhancement

Summary

Nonacoustic sensors such as the general electromagnetic motion sensor (GEMS), the physiological microphone (P-mic), and the electroglottograph (EGG) offer multimodal approaches to speech processing and speaker and speech recognition. These sensors provide measurements of functions of the glottal excitation and, more generally, of the vocal tract articulator movements that are relatively immune to acoustic disturbances and can supplement the acoustic speech waveform. This paper describes an approach to speech enhancement that exploits these nonacoustic sensors according to their capability in representing specific speech characteristics in different frequency bands. Frequency-domain sensor phase, as well as magnitude, is found to contribute to signal enhancement. Preliminary testing involves the time-synchronous multi-sensor DARPA Advanced Speech Encoding Pilot Speech Corpus collected in a variety of harsh acoustic noise environments. The enhancement approach is illustrated with examples that indicate its applicability as a pre-processor to low-rate vocoding and speaker authentication, and for enhanced listening from degraded speech.
READ LESS

Summary

Nonacoustic sensors such as the general electromagnetic motion sensor (GEMS), the physiological microphone (P-mic), and the electroglottograph (EGG) offer multimodal approaches to speech processing and speaker and speech recognition. These sensors provide measurements of functions of the glottal excitation and, more generally, of the vocal tract articulator movements that are...

READ MORE